Skip to content

Histogram CT Zero ingestion#14694

Merged
ArthurSens merged 2 commits intomainfrom
ct-histogram
Sep 26, 2024
Merged

Histogram CT Zero ingestion#14694
ArthurSens merged 2 commits intomainfrom
ct-histogram

Conversation

@ArthurSens
Copy link
Copy Markdown
Member

@ArthurSens ArthurSens commented Aug 19, 2024

Fixes #13384

Fundamentally it follows the same logic used in AppendCTZeroSample(), used to append zeros for usual samples, but for histograms, we're using Go's zero value for histogram.Histogram{} and histogram.FloatHistogram{}

The tricky thing here is that the Histogram format has CounterResetHints, which the query engine uses to detect resets. From my understanding, however, this is a no-op during the scrape and the storage layer handles the addition of counter-reset hints. This PR is only tackling the scrape part of things.

There is one test in head_appender_test that queries the appended data and counter-reset hints are verified as well

@ArthurSens ArthurSens force-pushed the ct-histogram branch 2 times, most recently from e0ee07d to fa1832b Compare August 19, 2024 12:52
Comment thread tsdb/head_append.go
@ArthurSens
Copy link
Copy Markdown
Member Author

Test failure message
panic: test timed out after 30s
running tests:
	TestManagerCTZeroIngestionHistogram (30s)
	TestManagerCTZeroIngestionHistogram/disabled_with_CT_on_histogram (30s)

goroutine 45 [running]:
testing.(*M).startAlarm.func1()
	/Users/arthursens/go/pkg/mod/golang.org/[email protected]/src/testing/testing.go:2366 +0x1dc
created by time.goFunc
	/Users/arthursens/go/pkg/mod/golang.org/[email protected]/src/time/sleep.go:177 +0x44

goroutine 1 [chan receive]:
testing.(*T).Run(0xc000293380, {0x101136029, 0x23}, 0x102bfabd8)
	/Users/arthursens/go/pkg/mod/golang.org/[email protected]/src/testing/testing.go:1750 +0x604
testing.runTests.func1(0xc000293380)
	/Users/arthursens/go/pkg/mod/golang.org/[email protected]/src/testing/testing.go:2161 +0x84
testing.tRunner(0xc000293380, 0xc000805948)
	/Users/arthursens/go/pkg/mod/golang.org/[email protected]/src/testing/testing.go:1689 +0x184
testing.runTests(0xc000296978, {0x1033ee5a0, 0x51, 0x51}, {0x30?, 0x8?, 0x1033f7fc0?})
	/Users/arthursens/go/pkg/mod/golang.org/[email protected]/src/testing/testing.go:2159 +0x6e4
testing.(*M).Run(0xc0002b0b40)
	/Users/arthursens/go/pkg/mod/golang.org/[email protected]/src/testing/testing.go:2027 +0xb78
go.uber.org/goleak.VerifyTestMain({0x102c04cc0, 0xc0002b0b40}, {0xc000845e48, 0x3, 0x3})
	/Users/arthursens/go/pkg/mod/go.uber.org/[email protected]/testmain.go:53 +0x48
github.com/prometheus/prometheus/util/testutil.TolerantVerifyLeak(0xc0002b0b40)
	/Users/arthursens/Documents/projetos/prometheus/util/testutil/testing.go:34 +0x2e4
github.com/prometheus/prometheus/scrape.TestMain(...)
	/Users/arthursens/Documents/projetos/prometheus/scrape/scrape_test.go:61
main.main()
	_testmain.go:217 +0x29c

goroutine 7 [select]:
go.opencensus.io/stats/view.(*worker).start(0xc000226680)
	/Users/arthursens/go/pkg/mod/[email protected]/stats/view/worker.go:292 +0x128
created by go.opencensus.io/stats/view.init.0 in goroutine 1
	/Users/arthursens/go/pkg/mod/[email protected]/stats/view/worker.go:34 +0xf4

goroutine 56 [chan receive]:
testing.(*T).Run(0xc000293520, {0x1011291b2, 0x1d}, 0xc0002a71a0)
	/Users/arthursens/go/pkg/mod/golang.org/[email protected]/src/testing/testing.go:1750 +0x604
github.com/prometheus/prometheus/scrape.TestManagerCTZeroIngestionHistogram(0xc000293520)
	/Users/arthursens/Documents/projetos/prometheus/scrape/manager_test.go:940 +0x1ec
testing.tRunner(0xc000293520, 0x102bfabd8)
	/Users/arthursens/go/pkg/mod/golang.org/[email protected]/src/testing/testing.go:1689 +0x184
created by testing.(*T).Run in goroutine 1
	/Users/arthursens/go/pkg/mod/golang.org/[email protected]/src/testing/testing.go:1742 +0x5e8

goroutine 57 [select]:
github.com/prometheus/prometheus/util/runutil.Retry(0x5f5e100, 0xc00003e780, 0xc00004fde0)
	/Users/arthursens/Documents/projetos/prometheus/util/runutil/runutil.go:31 +0x128
github.com/prometheus/prometheus/scrape.TestManagerCTZeroIngestionHistogram.func7(0xc0002936c0)
	/Users/arthursens/Documents/projetos/prometheus/scrape/manager_test.go:1008 +0xa64
testing.tRunner(0xc0002936c0, 0xc0002a71a0)
	/Users/arthursens/go/pkg/mod/golang.org/[email protected]/src/testing/testing.go:1689 +0x184
created by testing.(*T).Run in goroutine 56
	/Users/arthursens/go/pkg/mod/golang.org/[email protected]/src/testing/testing.go:1742 +0x5e8

goroutine 83 [IO wait]:
internal/poll.runtime_pollWait(0x10c0a9940, 0x72)
	/Users/arthursens/go/pkg/mod/golang.org/[email protected]/src/runtime/netpoll.go:345 +0xa0
internal/poll.(*pollDesc).wait(0xc0004a3720, 0x72, 0x0)
	/Users/arthursens/go/pkg/mod/golang.org/[email protected]/src/internal/poll/fd_poll_runtime.go:84 +0xb8
internal/poll.(*pollDesc).waitRead(...)
	/Users/arthursens/go/pkg/mod/golang.org/[email protected]/src/internal/poll/fd_poll_runtime.go:89
internal/poll.(*FD).Accept(0xc0004a3700)
	/Users/arthursens/go/pkg/mod/golang.org/[email protected]/src/internal/poll/fd_unix.go:611 +0x3a8
net.(*netFD).accept(0xc0004a3700)
	/Users/arthursens/go/pkg/mod/golang.org/[email protected]/src/net/fd_unix.go:172 +0x38
net.(*TCPListener).accept(0xc0000e9be0)
	/Users/arthursens/go/pkg/mod/golang.org/[email protected]/src/net/tcpsock_posix.go:159 +0x3c
net.(*TCPListener).Accept(0xc0000e9be0)
	/Users/arthursens/go/pkg/mod/golang.org/[email protected]/src/net/tcpsock.go:327 +0x68
net/http.(*Server).Serve(0xc0005de000, {0x102c0ae80, 0xc0000e9be0})
	/Users/arthursens/go/pkg/mod/golang.org/[email protected]/src/net/http/server.go:3260 +0x488
net/http/httptest.(*Server).goServe.func1()
	/Users/arthursens/go/pkg/mod/golang.org/[email protected]/src/net/http/httptest/server.go:310 +0xb4
created by net/http/httptest.(*Server).goServe in goroutine 57
	/Users/arthursens/go/pkg/mod/golang.org/[email protected]/src/net/http/httptest/server.go:308 +0x9c

goroutine 87 [select]:
net/http.(*persistConn).writeLoop(0xc00049b440)
	/Users/arthursens/go/pkg/mod/golang.org/[email protected]/src/net/http/transport.go:2458 +0x120
created by net/http.(*Transport).dialConn in goroutine 41
	/Users/arthursens/go/pkg/mod/golang.org/[email protected]/src/net/http/transport.go:1800 +0x1cd0

goroutine 85 [select]:
github.com/prometheus/prometheus/scrape.(*scrapeLoop).run(0xc0005c6c60, 0x0)
	/Users/arthursens/Documents/projetos/prometheus/scrape/scrape.go:1270 +0x4c0
created by github.com/prometheus/prometheus/scrape.(*scrapePool).sync in goroutine 84
	/Users/arthursens/Documents/projetos/prometheus/scrape/scrape.go:544 +0x104c

goroutine 86 [IO wait]:
internal/poll.runtime_pollWait(0x10c0a9848, 0x72)
	/Users/arthursens/go/pkg/mod/golang.org/[email protected]/src/runtime/netpoll.go:345 +0xa0
internal/poll.(*pollDesc).wait(0xc000548020, 0x72, 0x0)
	/Users/arthursens/go/pkg/mod/golang.org/[email protected]/src/internal/poll/fd_poll_runtime.go:84 +0xb8
internal/poll.(*pollDesc).waitRead(...)
	/Users/arthursens/go/pkg/mod/golang.org/[email protected]/src/internal/poll/fd_poll_runtime.go:89
internal/poll.(*FD).Read(0xc000548000, {0xc0005ed000, 0x1000, 0x1000})
	/Users/arthursens/go/pkg/mod/golang.org/[email protected]/src/internal/poll/fd_unix.go:164 +0x320
net.(*netFD).Read(0xc000548000, {0xc0005ed000, 0x1000, 0x1000})
	/Users/arthursens/go/pkg/mod/golang.org/[email protected]/src/net/fd_posix.go:55 +0x48
net.(*conn).Read(0xc000298578, {0xc0005ed000, 0x1000, 0x1000})
	/Users/arthursens/go/pkg/mod/golang.org/[email protected]/src/net/net.go:185 +0x88
net/http.(*persistConn).Read(0xc00049b440, {0xc0005ed000, 0x1000, 0x1000})
	/Users/arthursens/go/pkg/mod/golang.org/[email protected]/src/net/http/transport.go:1977 +0xc4
bufio.(*Reader).fill(0xc000054f00)
	/Users/arthursens/go/pkg/mod/golang.org/[email protected]/src/bufio/bufio.go:110 +0x21c
bufio.(*Reader).Peek(0xc000054f00, 0x1)
	/Users/arthursens/go/pkg/mod/golang.org/[email protected]/src/bufio/bufio.go:148 +0xb4
net/http.(*persistConn).readLoop(0xc00049b440)
	/Users/arthursens/go/pkg/mod/golang.org/[email protected]/src/net/http/transport.go:2141 +0x264
created by net/http.(*Transport).dialConn in goroutine 41
	/Users/arthursens/go/pkg/mod/golang.org/[email protected]/src/net/http/transport.go:1799 +0x1c64

goroutine 43 [IO wait]:
internal/poll.runtime_pollWait(0x10c0a9750, 0x72)
	/Users/arthursens/go/pkg/mod/golang.org/[email protected]/src/runtime/netpoll.go:345 +0xa0
internal/poll.(*pollDesc).wait(0xc0005480a0, 0x72, 0x0)
	/Users/arthursens/go/pkg/mod/golang.org/[email protected]/src/internal/poll/fd_poll_runtime.go:84 +0xb8
internal/poll.(*pollDesc).waitRead(...)
	/Users/arthursens/go/pkg/mod/golang.org/[email protected]/src/internal/poll/fd_poll_runtime.go:89
internal/poll.(*FD).Read(0xc000548080, {0xc000556000, 0x1000, 0x1000})
	/Users/arthursens/go/pkg/mod/golang.org/[email protected]/src/internal/poll/fd_unix.go:164 +0x320
net.(*netFD).Read(0xc000548080, {0xc000556000, 0x1000, 0x1000})
	/Users/arthursens/go/pkg/mod/golang.org/[email protected]/src/net/fd_posix.go:55 +0x48
net.(*conn).Read(0xc000522050, {0xc000556000, 0x1000, 0x1000})
	/Users/arthursens/go/pkg/mod/golang.org/[email protected]/src/net/net.go:185 +0x88
net/http.(*connReader).Read(0xc000694510, {0xc000556000, 0x1000, 0x1000})
	/Users/arthursens/go/pkg/mod/golang.org/[email protected]/src/net/http/server.go:789 +0x1b0
bufio.(*Reader).fill(0xc000119080)
	/Users/arthursens/go/pkg/mod/golang.org/[email protected]/src/bufio/bufio.go:110 +0x21c
bufio.(*Reader).Peek(0xc000119080, 0x4)
	/Users/arthursens/go/pkg/mod/golang.org/[email protected]/src/bufio/bufio.go:148 +0xb4
net/http.(*conn).serve(0xc0005182d0, {0x102c0d938, 0xc000694210})
	/Users/arthursens/go/pkg/mod/golang.org/[email protected]/src/net/http/server.go:2079 +0x10e4
created by net/http.(*Server).Serve in goroutine 83
	/Users/arthursens/go/pkg/mod/golang.org/[email protected]/src/net/http/server.go:3290 +0x678
FAIL	github.com/prometheus/prometheus/scrape	30.728s
FAIL

Here is the failure message I'm struggling to understand 🤔

@ArthurSens
Copy link
Copy Markdown
Member Author

Alright, only 1 test is failing now: TestHeadAppender_AppendHistogramCTZeroSample.

What this test does is ingest Histogram samples with created timestamps in several different scenarios. Some of those scenarios should reject the sample and others should ingested. After that, we query the ingested samples and verify that a histogram zero sample and the normal histogram sample are retrieved as expected.

The test is failing right now because the histogram zero samples are being modified after ingestion with a non-zero CounterResetHint. @beorn7 @krajorama, I'm not familiar with this part. Is it expected that histograms are modified during ingestion? Should I ignore this field when comparing the ingested/queried histogram?

@ArthurSens ArthurSens marked this pull request as ready for review August 23, 2024 15:13
@Maniktherana
Copy link
Copy Markdown
Contributor

Maniktherana commented Aug 23, 2024

Alright, only 1 test is failing now: TestHeadAppender_AppendHistogramCTZeroSample.

What this test does is ingest Histogram samples with created timestamps in several different scenarios. Some of those scenarios should reject the sample and others should ingested. After that, we query the ingested samples and verify that a histogram zero sample and the normal histogram sample are retrieved as expected.

The test is failing right now because the histogram zero samples are being modified after ingestion with a non-zero CounterResetHint. @beorn7 @krajorama, I'm not familiar with this part. Is it expected that histograms are modified during ingestion? Should I ignore this field when comparing the ingested/queried histogram?

relevant: @bwplotka's comment #13384 (comment)

@krajorama krajorama self-requested a review August 27, 2024 14:49
@krajorama
Copy link
Copy Markdown
Member

krajorama commented Aug 27, 2024

Alright, only 1 test is failing now: TestHeadAppender_AppendHistogramCTZeroSample.

What this test does is ingest Histogram samples with created timestamps in several different scenarios. Some of those scenarios should reject the sample and others should ingested. After that, we query the ingested samples and verify that a histogram zero sample and the normal histogram sample are retrieved as expected.

The test is failing right now because the histogram zero samples are being modified after ingestion with a non-zero CounterResetHint. @beorn7 @krajorama, I'm not familiar with this part. Is it expected that histograms are modified during ingestion? Should I ignore this field when comparing the ingested/queried histogram?

Yes, the appended histogram as well as the resulting histogram in TSDB can be modified by AppendHistogram.
The counter reset hint is set to help the PromQL engine in counter reset detection. If the count reset hint is left 0, then the PromQL engine runs the detection every time , which is super costly for native histograms, so the Appender tries to set the counter reset to either 1 or 2 (detected reset, detected no reset) to help PromQL. So reading back different hint is totally expected. Also if I understand correctly, the zero histogram is getting written into TSDB as an extra sample. That will induce a counter reset, so to see Counter resets is totally expected.

There's two strategies in tests around this:

  1. If we are not interested in the counter resets, in which case you'd set the counter reset hint to 0 in the actual samples before making the comparison with the expected samples.
  2. If we are interested in the counter resets, in which case you have to know/calculate what the expected hint should be.

If AppendHistogramCTZeroSample was just simply calling AppendHistogram I would say go with 1. , but it's more complicated and might get even more complicated due to out-of-order (#14546 ) so I think we need to go for 2 and encode the expected counter resets.

Copy link
Copy Markdown
Member

@krajorama krajorama left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

first pass feedback on tests

Comment thread tsdb/head_test.go Outdated
Comment thread tsdb/head_test.go Outdated
@ArthurSens
Copy link
Copy Markdown
Member Author

Appender tries to set the counter reset to either 1 or 2 (detected reset, detected no reset) to help PromQL.

Also if I understand correctly, the zero histogram is getting written into TSDB as an extra sample. That will induce a counter reset, so to see Counter resets is totally expected.

Ah ok that makes sense then. Yes, we should always see a CounterResetHint==2 when ingesting a zero sample histogram.

But that leaves me with another question: If we're scraping a histogram for the first time and this histogram has set CT during exposition. Should the zero sample also contain ResetHint==2? Theoretically there was no reset since it's the first sample ever ingested for the series

@krajorama
Copy link
Copy Markdown
Member

Appender tries to set the counter reset to either 1 or 2 (detected reset, detected no reset) to help PromQL.

Also if I understand correctly, the zero histogram is getting written into TSDB as an extra sample. That will induce a counter reset, so to see Counter resets is totally expected.

Ah ok that makes sense then. Yes, we should always see a CounterResetHint==2 when ingesting a zero sample histogram.

But that leaves me with another question: If we're scraping a histogram for the first time and this histogram has set CT during exposition. Should the zero sample also contain ResetHint==2? Theoretically there was no reset since it's the first sample ever ingested for the series

Yep, you got it. If there's no previous sample, we encode 0 (unknown reset) because later when the chunk is read back there might be some previous sample loaded from disk for example and we cannot say counter reset or no reset with any confidence, so we encode "unknown".

Comment thread tsdb/head_append.go
Comment thread tsdb/head_append.go
@ArthurSens
Copy link
Copy Markdown
Member Author

Converted to draft while I work on the failing test in head_appender

@ArthurSens
Copy link
Copy Markdown
Member Author

This PR will conflict with #14546 once merged. I don't mind resolving the conflicts here, so it's up to you which PR gets merged first @krajorama!

@ArthurSens ArthurSens marked this pull request as draft September 17, 2024 10:21
@ArthurSens
Copy link
Copy Markdown
Member Author

Converting to draft as I work through the conflicts

@krajorama
Copy link
Copy Markdown
Member

Converting to draft as I work through the conflicts

Many thanks for taking the hit, the OOO NH PR was months in the making.

@ArthurSens ArthurSens marked this pull request as ready for review September 18, 2024 07:04
@ArthurSens
Copy link
Copy Markdown
Member Author

Many thanks for taking the hit, the OOO NH PR was months in the making.

No problems :)
PR should be ready for review now

Comment thread tsdb/head_append.go
Comment on lines +768 to +785
s := a.head.series.getByID(chunks.HeadSeriesRef(ref))
if s == nil {
// Ensure no empty labels have gotten through.
lset = lset.WithoutEmpty()
if lset.IsEmpty() {
return 0, fmt.Errorf("empty labelset: %w", ErrInvalidSample)
}

if l, dup := lset.HasDuplicateLabelNames(); dup {
return 0, fmt.Errorf(`label name "%s" is not unique: %w`, l, ErrInvalidSample)
}

var created bool
var err error
s, created, err = a.head.getOrCreate(lset.Hash(), lset)
if err != nil {
return 0, err
}
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This part can be abstracted into a separate function, just like we do when appending floats, and we reduce duplicated code with AppendHistogram. I'd like to do it in a separate PR though

Copy link
Copy Markdown
Member

@jesusvazquez jesusvazquez left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good! Left one question to fully understand.

Comment thread scrape/manager_test.go
// Start fake HTTP target to that allow one scrape only.
server := httptest.NewServer(
http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) {
fail := true // TODO(bwplotka): Kill or use?
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need the TODO?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably not, but @Maniktherana is reworking this test case in #14738. I'm hoping to rebase this PR if that one is merged first, or just open a new pull request updating this one if his PR takes too long. Would that be ok?

Comment thread tsdb/head_append.go Outdated
case h != nil:
zeroHistogram := &histogram.Histogram{}
s.Lock()
_, _, err := s.appendableHistogram(ct, zeroHistogram, a.headMaxt, a.minValidTime, a.oooTimeWindow, false) // OOO is not allowed for CTZeroSamples.
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why OOO is not allowed for CT Zero Samples?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's because after the first sample, every CT is OOO. I'm not sure if WBL accepts samples with duplicated timestamps, but it doesn't sound efficient to append the same 0 sample with the same timestamp on every scrape 🤔

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's something that we can probably improve and eventually allow OOO CT.

I guess I could improve the comment here with a better explanation of why OOO CT is not allowed TODAY, if my comment doesn't make sense in the future then people might feel comfortable in changing the codebase 🤔

Comment thread tsdb/head_append.go Outdated
Comment thread tsdb/head_append.go Outdated
Comment thread tsdb/head_test.go Outdated
Comment thread tsdb/head_test.go Outdated
@ArthurSens ArthurSens force-pushed the ct-histogram branch 2 times, most recently from 4825787 to 3342f5c Compare September 23, 2024 20:34
@ArthurSens
Copy link
Copy Markdown
Member Author

ArthurSens commented Sep 25, 2024

It should be ready for another review :)

Comment thread tsdb/head_append.go Outdated
if ct > a.maxt {
a.maxt = ct
}
return 0, nil
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Inconsistency with float handling:

Suggested change
return 0, nil
return storage.SeriesRef(s.ref), nil

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch! It's not only inconsistent, it would be incorrect to always return 0! The reference is used in following Append calls

Comment thread tsdb/head_test.go
require.NoError(t, err)
}
}
require.NoError(t, a.Commit())
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: it's kind of weird that we don't test with multiple Commits() , thus the path to return out of order is never exercised. As far as I can tell that was the case for floats anyway, so maybe another PR?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Interesting. Yeah, we haven't tested OOO well enough, that was a good call out. If that's ok I'd be happy to extend the tests here to cover OOO in another PR

@ArthurSens
Copy link
Copy Markdown
Member Author

Just did a quick experiment with an app exposing native histograms, I can see the first one being ingested as a 0 🥳

image

Copy link
Copy Markdown
Member

@krajorama krajorama left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@ArthurSens
Copy link
Copy Markdown
Member Author

Taking the liberty to merge!

@ArthurSens ArthurSens merged commit d5f65cf into main Sep 26, 2024
@ArthurSens ArthurSens deleted the ct-histogram branch September 26, 2024 15:48
jan--f added a commit to jan--f/prometheus that referenced this pull request Oct 29, 2024
julienduchesne pushed a commit to julienduchesne/prometheus that referenced this pull request Dec 13, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

tsdb: need histogram support for created-timestamp handling

5 participants