Skip to content

Conversation

@zenador
Copy link
Contributor

@zenador zenador commented Mar 19, 2024

Addresses #13508

Currently histogram_quantile is failing on the newly converted histograms, but the other basic histogram functions work as expected. (fixed now)

Needed to update the behaviour of histogram_stddev and histogram_stdvar to handle negative bounds properly (previously it was already bugged even for exponential histograms with negative observations). (now in #13824)

@zenador zenador marked this pull request as draft March 19, 2024 16:07
@krajorama krajorama self-requested a review March 20, 2024 11:52
promql/test.go Outdated
vec = append(vec, Sample{Metric: series.Metric, T: point.T, H: point.H})
break
}
if strings.Contains(iq.expr, "_bucket") {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd rather we didn't do this, we should make the copy by hand to be explicit. Unit tests should be very clear, explicit and define expectations.

Copy link
Contributor Author

@zenador zenador Mar 20, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we discuss this more at the next sync? We are already implicitly loading all these classic histograms as nhcb, so it feels inconsistent to load them implicitly but run the testing explicitly as separate extra tests. I thought the idea of doing this was to automatically extend test coverage for classic histograms to the converted nhcb to ensure that as far as our tests are concerned, the conversion introduces no regressions and both classic/nhcb should return the exact same results. If we make the testing explicit, we will have to duplicate a lot of the unit test code, and it will be very easy to miss test cases, e.g. if someone adds new test cases for classic histograms but overlooks adding the nhcb version, then there may be cases where they yield different results but we won't know about it. Doing it automatically will ensure we don't miss out on any tests that we have.

I agree though that it may come as a surprise, so if we decide to keep this, we should clearly document it somewhere. Alternatively, we could introduce a keyword in the tests to mark that this test should be repeated on the nhcb version, and add it to all applicable tests, so it will be explicit and people adding new tests will be less likely to miss it.

There is a con of doing this implicitly though, as there are rare cases where the classic histograms and nhcb yield different results, as seen in the tests removed in the latest commit:

  • The More realistic with rates. segment: this is a bug and I'll fix it as soon as I figure out what's going on (fixed, will elaborate in follow up comment)
  • nonmonotonic_bucket series can't be converted into a valid nhcb at all, so we will not be able to test on both (and this non-monotonic problem should be gone with nhcb). The keyword approach I mentioned above should be ideal for this. Also I'm not sure why we test this, naturally we can't guarantee correctness if there are missing scrapes and inconsistent buckets.
  • mixed_bucket series in the test is currently giving the wrong result. To check this, remove the series with the redefined buckets in the loading segment and run it. It will return 0.2 instead of 0.15, and 0.2 is what the nhcb version returns. This bug should probably be fixed separately (completely out of scope for nhcb), though I'm not sure if it's worth supporting this edge case - we could just say we don't guarantee correctness if there are duplicate buckets, unless there's a very good reason why we need to support it, e.g. it happens often for reasons that are hard to fix

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Technically we have 5 choices as far as I see:

  • explicit tests (easy, but error prone)
  • assert that all tests have both versions
  • able to mark tests to include in checks with nhcb
  • able to mark tests to exclude in checks with nhcb
  • do what we do now and fix all cases (not possible)

Do we have access to all eval statements? We could avoid having to mark exceptions by just defining the nhcb version explicitly.

Also technically you'd have to remove le label from the nhcb expressions as that's the official conversion to native query.

Copy link
Contributor Author

@zenador zenador Mar 21, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The 2nd one seems a bit complicated to implement. 3rd is slightly more troublesome to update all test cases than the 4th but it is more explicit, I don't really mind 3rd or 4th.

We could avoid having to mark exceptions by just defining the nhcb version explicitly.

Sorry, what do you mean by this?

Also technically you'd have to remove le label from the nhcb expressions as that's the official conversion to native query.

Yes it works well even without doing it, but I've updated the last commit to remove it anyway.

Copy link
Contributor Author

@zenador zenador Mar 22, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

# More realistic with rates.
eval instant at 50m histogram_quantile(0.2, rate(testhistogram_bucket[5m]))
	{start="positive"} 0.048
	{start="negative"} -0.2

eval instant at 50m histogram_quantile(0.5, rate(testhistogram_bucket[5m]))
	{start="positive"} 0.15
	{start="negative"} -0.15

eval instant at 50m histogram_quantile(0.8, rate(testhistogram_bucket[5m]))
	{start="positive"} 0.72
	{start="negative"} 0.3

The above (originally included) tests were initially failing only for the nhcb version, but the bug is now fixed. Note that even without the fix, the following tests were working for both classic and nhcb:

# More realistic with rates.
eval instant at 50m histogram_quantile(0.2, rate(testhistogram_bucket{start="positive"}[5m]))
	{start="positive"} 0.048

eval instant at 50m histogram_quantile(0.2, rate(testhistogram_bucket{start="negative"}[5m]))
	{start="negative"} -0.2

eval instant at 50m histogram_quantile(0.5, rate(testhistogram_bucket{start="positive"}[5m]))
	{start="positive"} 0.15

eval instant at 50m histogram_quantile(0.5, rate(testhistogram_bucket{start="negative"}[5m]))
	{start="negative"} -0.15

eval instant at 50m histogram_quantile(0.8, rate(testhistogram_bucket{start="positive"}[5m]))
	{start="positive"} 0.72

eval instant at 50m histogram_quantile(0.8, rate(testhistogram_bucket{start="negative"}[5m]))
	{start="negative"} 0.3

Meaning that when you mix the positive and negative series (that have different bucket bounds) together, something gets corrupted somewhere. This isn't a problem for the tests on other metrics as the different series have consistent bucket bounds, and it isn't a problem for the other tests on this metric as those have no rate (which might involve histograms being copied between series). This isn't a problem for classic histograms or the exponential native histograms, so the problem is with our new customValues field, which we shallow copied when we added it to the model instead of doing a deep copy. I added the fix commit to the earlier chunks PR because I'd rather fix this newly introduced bug before we merge nhcb to main. We can revisit these optimisations later on when nhcb implementation is more complete and we have more tests.

Evidence of the corruption (when printing the result of rate after it is passed into histogram_quantile):

the correct bounds with separate queries:

s: {start="positive"} => {count:0.04, sum:0, [-Inf,0.1]:0.016666666666666666, (0.1,0.2]:0.006666666666666667, (0.2,1]:0.013333333333333334, (1,+Inf]:0.0033333333333333335} @[3000000]
s: {start="negative"} => {count:0.01, sum:0, [-Inf,-0.2]:0.0033333333333333335, (-0.2,-0.1]:0.0033333333333333335, (0.3,+Inf]:0.0033333333333333335} @[3000000]

the wrongly duplicated bounds with combined query:

s: {start="negative"} => {count:0.01, sum:0, [-Inf,0.1]:0.0033333333333333335, (0.1,0.2]:0.0033333333333333335, (1,+Inf]:0.0033333333333333335} @[3000000] <- copied wrongly from other series
s: {start="positive"} => {count:0.04, sum:0, [-Inf,0.1]:0.016666666666666666, (0.1,0.2]:0.006666666666666667, (0.2,1]:0.013333333333333334, (1,+Inf]:0.0033333333333333335} @[3000000]

the wrongly duplicated bounds with combined query, alternate version (the order seems non-deterministic, so sometimes you get the earlier one and sometimes it's this one):

s: {start="positive"} => {count:0.04, sum:0, [-Inf,-0.2]:0.016666666666666666, (-0.2,-0.1]:0.006666666666666667, (-0.1,0.3]:0.013333333333333334, (0.3,+Inf]:0.0033333333333333335} @[3000000] <- copied wrongly from other series
s: {start="negative"} => {count:0.01, sum:0, [-Inf,-0.2]:0.0033333333333333335, (-0.2,-0.1]:0.0033333333333333335, (0.3,+Inf]:0.0033333333333333335} @[3000000]

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Moved this part to #13827

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

About the mixed_bucket test case: This needs to be studied carefully. I think the test is actually not very suitably because all the buckets are the same, so we are essentially exercising edge cases of the interpolation algorithm (and both 0.15 and 0.2 might both be mathematically correct answers). We should have a test that has different buckets defined in different formats, and then maybe even one where the same buckets are defined with different formats (like now) but they have different values. (I'm not even sure what the behavior should be then. This "float as a string" thing in classic histograms is really a big mess. Maybe it is supposed to sum them all up? That would even be relevant to how we convert classic histograms into NHCB.)

About the nonmonotonic_bucket test: This tests how PromQL deals with this case of invalid input. It's important to have those tests.

In both cases, I see two ways we could deal with them for NHCB:

  1. We could state that we only convert classic histograms into NHCB if they don't have those shenanigans, i.e. they must not have duplicate buckets in different formats, and they must be monotonic, otherwise we'll just decline to convert them. I think this is a reasonable requirement as the conversion is meant to happen "close to the source" of an instrumented target, where those problem shouldn't occur.
  2. We somehow emulate the work-arounds done for classic histograms during the conversion. For one, fixing non-monotonicity (but how to issue a warning then?). And then mirroring the behavior for differently formatted le labels, in particular if the values collide (for which I don't even know the intended behavior from the top of my head).

I would currently tend towards (1).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thinking more about the test setup, I'm more and more inclined to make everything explicit.

For one, that means that loading a classic histogram needs to have some keyword to activate conversion into NHCB, mostly because there are cases where the test deliberately contains data that cannot be converted to NHCB (see above), unless we decide to tweak the conversion algorithm so that it can deal with those cases (which maybe we should not, see also above). (I think this is an agreement with @zenador's assessment so far.)

However, I also think we should give up on running the classic histogram tests automatically and implicitly with converted NHCB. Here are the reasons I see:

  • There are queries that deliberately create a different result, e.g. rate(histo_bucket) creates a classic histogram as output, while rate(histo) creates a native histogram. There are more subtle cases like the counter reset detection that is deliberately broken in classic histograms with negative observations in the game but works correctly after NHCB conversion. (I don't know if we have a test for that right now, but we could one day or maybe even should.)
  • Even for queries that are supposed to create the same output, the conversion isn't trivial. histogram_quantile(0.5, histo_bucket) is easy. histogram_quantile(0.5, sum by (le) (histo_bucket)) is a bit harder. You also have to take into account the also valid histogram_quantile(0.5, sum(histo_bucket) by (le)) (and there might be more labels in any order and you have to find the le label and remove it). Then there are queries involving foo_count and foo_sum, which needed to be converted to histogram_count(foo) and histogram_sum(foo). Etc. In short, I think implementing a correct test converter is a very deep rabbit hole.
  • And now add the problem that there are cases where the conversion to NHCB doesn't work by design (see above, where we concluded we need to mark the conversion explicitly anyay).

In summary, I think would need to invest a lot of effort to chase an ultimately impossible goal if we desire to automatically run all classic histogram tests with NHCB, too. In many cases, we need a human anyway to decide if the test should run with NHCB at all, and if yes how the query will look like.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe even one where the same buckets are defined with different formats (like now) but they have different values. (I'm not even sure what the behavior should be then.

Yeah, it seems like in that case we could arbitrarily select one and the behaviour would be non-deterministic, so perhaps it's better not to test that and just say we provide no guarantees in such edge cases.

I would currently tend towards (1).

I agree, it is better to only convert valid histograms. If they are problematic, better to flag it out earlier with an error so the user can look into fixing it, than to automatically correct it upon conversion based on assumptions that may be wrong.

Anyway, now both loading and evaluating with native histograms are opt-in, you need to use the keyword suffix _with_nhcb for both of those. So we don't have those suffixes for the nonmonotonic_bucket, and for mixed_bucket we load it, use the keyword for tests with the same results, and don't use the keyword for the failing test, and we now have a separate test for the different result of the nhcb version, so when this disparity gets looked into in future, the tests can be updated to indicate the change in behaviour.

The conversion in the eval query is very simple, and it works well for the current tests that we have, but for more complicated cases like the one you mentioned, it's best that we don't use the keyword and define a separate test instead. I agree that making the test converter completely correct would be more trouble than it's worth, so we'll just have a simple one for convenience to cover most of the simple cases, and fall back to manually defining it when necessary.

@zenador zenador force-pushed the nhcb-tests branch 8 times, most recently from c19836b to 2e78f2b Compare March 22, 2024 12:25
@zenador zenador force-pushed the nhcb-tests branch 3 times, most recently from 0c6e102 to fad5a66 Compare March 22, 2024 13:59
@zenador zenador changed the title Run unit tests for native histograms with custom buckets converted from classic histograms Add basic unit tests for native histograms with custom buckets converted from classic histograms Mar 22, 2024
@zenador zenador marked this pull request as ready for review March 22, 2024 14:20
@bboreham bboreham changed the title Add basic unit tests for native histograms with custom buckets converted from classic histograms [nhcb branch] Add basic unit tests for native histograms with custom buckets converted from classic histograms Mar 25, 2024
promql/test.go Outdated
for _, histWrapper := range histMap {
sort.Float64s(histWrapper.upperBounds)
upperBounds := make([]float64, 0, len(histWrapper.upperBounds))
prevLe := -372.4869 // assuming that the lowest bound is never this value
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
prevLe := -372.4869 // assuming that the lowest bound is never this value
prevLe := math.NaN()

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I ended up making this -Inf instead, since that should also not be a valid boundary.

Signed-off-by: George Krajcsovits <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants