Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

promql: make lookback and matrix selections left-open and right-closed #13904

Merged
merged 1 commit into from
Jun 20, 2024

Conversation

KofClubs
Copy link
Contributor

@KofClubs KofClubs commented Apr 8, 2024

Fixes #13213

@KofClubs
Copy link
Contributor Author

KofClubs commented Apr 8, 2024

This Issue/PR is under discussion and is just an example.

Copy link
Member

@beorn7 beorn7 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for tackling this. This might be the only change needed. Best to see by adding test cases (or observing which existing tests change their result in which way), which should mostly affect the framework tests in https://github.com/prometheus/prometheus/tree/main/promql/testdata .

@KofClubs
Copy link
Contributor Author

Best to see by adding test cases (or observing which existing tests change their result in which way), which should mostly affect the framework tests in https://github.com/prometheus/prometheus/tree/main/promql/testdata .

I am working on unit tests. I ran all the unit tests and tracked 26 errors that shall be fixed:

go test github.com/prometheus/prometheus/...

@KofClubs KofClubs requested a review from dgl as a code owner April 12, 2024 15:04
@beorn7
Copy link
Member

beorn7 commented May 2, 2024

Sorry for the delay. I'm back from vacation and will try to get to this ASAP. If anyone qualified beats me to it, even better.

@beorn7
Copy link
Member

beorn7 commented May 2, 2024

In the meantime, @KofClubs could you fix the DCO? It should be easy. See the details of the failed DCO test.

@vpranckaitis
Copy link
Contributor

Does this also adjust the lookback behavior? If I understand correctly, with lookback set to 5m, queries some_metric and last_over_time(some_metric[5m]) should be interchangeable (as long as there are no staleness markers), shouldn't they?

@beorn7
Copy link
Member

beorn7 commented May 8, 2024

@vpranckaitis that's a good point. The lookback should be "left-open", too. Maybe it is already. We have to check.

@beorn7
Copy link
Member

beorn7 commented May 8, 2024

I still haven't found time for a detailed review. I will try later this week or early next week. My apologies.

@KofClubs
Copy link
Contributor Author

In the meantime, @KofClubs could you fix the DCO? It should be easy. See the details of the failed DCO test.

OK. I shall fixup all commits to one after review and sign it off.

Sorry, I've been extremely busy recently. I shall work on this PR now, to resolve conflicts firstly.

@KofClubs KofClubs force-pushed the matrix-select-lorc branch 2 times, most recently from 52ac3ea to e940058 Compare May 13, 2024 16:47
@KofClubs
Copy link
Contributor Author

With conflicts resolved and commits signed off, I think this PR is ready to be reviewed:)

@KofClubs KofClubs changed the title [WIP] promql: make matrix selections left-open and right-closed promql: make matrix selections left-open and right-closed May 14, 2024
Copy link
Member

@beorn7 beorn7 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good in general. A few subtleties in the comments.

I will look into the question of the lookback delta being inclusive ASAP.

@beorn7
Copy link
Member

beorn7 commented May 15, 2024

About the lookback delta:

There is even a test for it, see

load 10s
metric 0
# Series with single point goes stale after 5 minutes.
eval instant at 0s metric
{__name__="metric"} 0
eval instant at 150s metric
{__name__="metric"} 0
eval instant at 300s metric
{__name__="metric"} 0
eval instant at 301s metric

Which means that the lookback is "inclusive" right now, but it needs to be "exclusive", so essentially, in that test, the at 300s should also yield an empty result, is does the at 301s now. This should be another simple change and should be included in this PR.

@beorn7
Copy link
Member

beorn7 commented May 15, 2024

And yet another question: How to merge this?

To merge this into main before the v3 release, we need to hide the behavior behind a feature flag. However, that doesn't play well with tests. So we could have the behavior changes in a separate PR, where they are behind a feature flag, and not tested at all. And then we have this PR with all the tests, which will go into a v3 branch, which doesn't exist yet, but maybe we should create one soon.

Another option would be to only launch this feature with v3 and don't have it behind a feature flag in v2 at all.

WDYT?

@KofClubs
Copy link
Contributor Author

About the lookback delta:

Get it. I think there are some aspects worth exploring.

Motivation

Apart from making behavior consistent, are there any other motivations for making lookback into (now-5m, now]?
It seems that the opening and closing of the left bound (now-5m) does not affect the correctness of the lookback behavior.

From Storage Perspective

I don't know if there is any bias in my understanding. If lookback becomes left-open and right-closed, then from the perspective of storage (which can be Prometheus itself, or another TSDB replaced by users), when retrieving data, the time interval shall become left-open and right-closed: (start, end]. Does Prometheus engine need to ask storage to ignore samples falling on start?

If there are any misunderstandings above, please feel free to point them out:)

@KofClubs
Copy link
Contributor Author

WDYT?

Both OK. While, I think this change has a large impact. Considering the change in user habits and behaviors, adding a flag seems to be necessary.
I think left-open and right-closed selection shall be enabled by default in Prometheus 3.0. Unless the user turns it on, like --matrix-selector-version=2, left-closed and right-closed selection shall never work. And no unit tests for v2-like behaviors shall be implemented in v3.

@beorn7
Copy link
Member

beorn7 commented May 16, 2024

WRT lookback delta: I think the one and only argument is the one made by @vpranckaitis already. "Correct" is whatever behavior we define as such.

The implementation details in the storage shouldn't matter much. The easiest way for now is probably to simply reduce the timestamp by 1ms. And it should be included in this PR to keep things consistent.

@beorn7
Copy link
Member

beorn7 commented May 16, 2024

About flags and configurability:

I think it is uncontroversial that v3 should only support the new behavior (the one implemented in this PR), with no way of bringing back the old behavior. So far, everyone that thought about it has agreed that the "left closed" behavior was a mistake and should have never happened in the first place.

The only question is how to deal with it in v2. And I see two feasible options:

  1. Do a hard switch with v3, so v2 always has the old behavior and v3 always has the new behavior. Fullstop.
  2. Make v2 configurable with a feature flag, so people can try out the new behavior without migrating to v3.

The problem with (2) is that it is hard to cover in tests (unless we add feature flag configuration to the testing framework), but that's maybe not a big problem because it will be tested in the v3 branch. On the other hand, the change in behavior is almost irrelevant in practice and mostly matters it tests (where you have perfectly aligned artificial data), so maybe a feature flag isn't even necessary and (1) is the way to go.

@beorn7
Copy link
Member

beorn7 commented May 16, 2024

I unresolved two conversation above where you went into the opposite direction than intended.

@KofClubs
Copy link
Contributor Author

I unresolved two conversation above where you went into the opposite direction than intended.

Thanks. Both resolved. Review them please:)

@KofClubs
Copy link
Contributor Author

(1) is the way to go.

Merging this update to v3 directly sounds good:)

As there are significant functional differences between v2 and v3, are there plans to continue maintaining v2 after v3 is released? When shall the life cycle of v2 end?

Sorry for not knowing about the planning of Prometheus 3.0 fully, so I would like to ask these questions sincerely:)

@KofClubs KofClubs force-pushed the matrix-select-lorc branch 2 times, most recently from e19a42d to 381f8d5 Compare May 24, 2024 09:34
@KofClubs
Copy link
Contributor Author

Several unit tests to be changed.

I shall continue to work on them.

Copy link
Member

@beorn7 beorn7 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are three unresolved comments, with my now more detailed suggestions how to resolve them.

Comment on lines 4474 to 3493
offset := len(c.histograms)
newTs := ts + int64(time.Minute/time.Millisecond)*int64(offset-1)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is still unresolved.

@@ -77,7 +71,7 @@ load 5m
# Tests for increase().
eval instant at 50m increase(http_requests[50m])
{path="/foo"} 100
{path="/bar"} 90
{path="/bar"} 88.88888888888889
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, I played with it. I still think we should change the test so that it yields round numbers, but that requires some more changes. Here is my suggestion:

# Tests for increase().
load 5m
	http_requests{path="/foo"}	0+10x11
	http_requests{path="/bar"}	0+10x5 0+10x5
	http_requests{path="/dings"}   10+10x11
	http_requests{path="/bumms"}    1+10x11

eval instant at 55m increase(http_requests[55m])
	{path="/foo"}   110
	{path="/bar"}    99
	{path="/dings"} 110
	{path="/bumms"} 110

@@ -115,11 +109,10 @@ load 5m

# Counter resets at in the middle of range are handled correctly by rate().
eval instant at 50m rate(testcounter_reset_middle[50m])
{} 0.03
{} 0.02962962962962963
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Again, we need to change a few more things. Here my suggestion:

# Tests for rate().
load 5m
	testcounter_reset_middle	10+10x4 0+10x5
	testcounter_reset_end    	10+10x9 0 10

# Counter resets at in the middle of range are handled correctly by rate().
eval instant at 50m rate(testcounter_reset_middle[55m])
	{} 0.03

Copy link
Member

@beorn7 beorn7 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few more comments about the documentation.

@beorn7
Copy link
Member

beorn7 commented Jun 13, 2024

Thank you very much. We are almost done here. Just the few comments above, all with suggestions what to change.

I have proposed to the v3 coordinators to start a release branch soon. Let's see when it happens.

@KofClubs KofClubs force-pushed the matrix-select-lorc branch from 4ac3fb9 to f91ae4e Compare June 14, 2024 08:49
@KofClubs
Copy link
Contributor Author

@beorn7 All resolved. Thank you!

@beorn7
Copy link
Member

beorn7 commented Jun 18, 2024

Looks good. Thank you very much.

(I'm not formally approving this PR right now, because we don't want to merge it into main. We need the v3 release branch to exist before merging.)

@beorn7 beorn7 changed the base branch from main to release-3.0 June 20, 2024 13:16
@beorn7
Copy link
Member

beorn7 commented Jun 20, 2024

We have a v3 branch: https://github.com/prometheus/prometheus/tree/release-3.0

I have already changed the merge target for this PR.

However, there are merge conflicts again (have been there before I changed the merge target).

@KofClubs could you rebase on top of the current state of the new release-3.0 branch and address the merge conflicts?

Signed-off-by: Zhang Zhanpeng <[email protected]>
Signed-off-by: beorn7 <[email protected]>
Co-authored-by: beorn7 <[email protected]>
@KofClubs KofClubs force-pushed the matrix-select-lorc branch from f91ae4e to debbdb8 Compare June 20, 2024 14:05
@KofClubs
Copy link
Contributor Author

@beorn7 Done. Thank you:)

@beorn7
Copy link
Member

beorn7 commented Jun 20, 2024

Great stuff. Will merge on green.

@beorn7 beorn7 merged commit d7089a3 into prometheus:release-3.0 Jun 20, 2024
30 checks passed
jan--f added a commit to jan--f/prometheus that referenced this pull request Jul 1, 2024
 Conflicts:
  promql/engine_test.go
    Resolved by picking main changes but adjusting total_samples for
    query "max_over_time(metricWith1HistogramEvery10Seconds[60s])[20s:5s]"
    to 312. Via prometheus#13662 this
    histogram now stores 13 values per timestamp, but via
    prometheus#13904 the range query
    is now left-open.
  promql/promqltest/testdata/functions.test
    Resolved by picking changes in main. See also
    prometheus#13662.
  promql/promqltest/testdata/histograms.test
    Resolved by picking changes in main. See also
    prometheus#13662, but adjust some
    range selectors (`s/5m/10m/`) to account for
    prometheus#13904
jan--f added a commit to jan--f/prometheus that referenced this pull request Jul 3, 2024
 Conflicts:
  promql/engine_test.go
    Resolved by picking main changes but adjusting total_samples for
    query "max_over_time(metricWith1HistogramEvery10Seconds[60s])[20s:5s]"
    to 312. Via prometheus#13662 this
    histogram now stores 13 values per timestamp, but via
    prometheus#13904 the range query
    is now left-open.
  promql/promqltest/testdata/functions.test
    Resolved by picking changes in main. See also
    prometheus#13662, but adjust some
    range selectors (`s/1m/2m/`) to account for
    prometheus#13904.
  promql/promqltest/testdata/histograms.test
    Resolved by picking changes in main. See also
    prometheus#13662, but adjust some
    range selectors (`s/5m/10m/`) to account for
    prometheus#13904.

Signed-off-by: Jan Fajerski <[email protected]>
jan--f added a commit to jan--f/prometheus that referenced this pull request Jul 18, 2024
Signed-off-by: Jan Fajerski <[email protected]>

Conflicts:
	VERSION
          pick 3.0.0
	promql/promqltest/testdata/histograms.test
          pick changes from c39776c,
          but adjust 5m range selectors to 10m to account for
          prometheus#13904.
Fixes:
        promql/promqltest/testdata/functions.test
	promql/promqltest/testdata/staleness.test
          Tests added in prometheus#9138
          need to be adjusted to account for
          prometheus#13904.
jan--f added a commit to jan--f/prometheus that referenced this pull request Aug 21, 2024
Some test queries need their interval adjusted to account for
prometheus#13904. Otherwise the
queries don't return enough samples.

Signed-off-by: Jan Fajerski <[email protected]>
jan--f added a commit to jan--f/prometheus that referenced this pull request Aug 21, 2024
Some test queries need their interval adjusted to account for
prometheus#13904. Otherwise the
queries don't return enough samples.
promql/engine_test.go:TestHistogramCopyFromIteratorRegression needed the
same, but also the result needed a fix since `increase` interpolates
over the full range.

Signed-off-by: Jan Fajerski <[email protected]>
jan--f added a commit to jan--f/prometheus that referenced this pull request Sep 2, 2024
In order to fix new tests for changes added in
prometheus#13904.

Signed-off-by: Jan Fajerski <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Status: Done
Development

Successfully merging this pull request may close these issues.

promql: Make range selections left-open and right-closed
4 participants