Proposed changes to Append CT PR.#1
Proposed changes to Append CT PR.#1ArthurSens merged 3 commits intoArthurSens:scrape-synthetic-zerofrom
Conversation
Changes: * Changed textparse Parser interface for consistency and robustness. * Changed CT interface to be more explicit and handle validation. * Simplified test, change scrapeManager to allow testability. * Added TODOs. Signed-off-by: bwplotka <[email protected]>
Signed-off-by: bwplotka <[email protected]>
ArthurSens
left a comment
There was a problem hiding this comment.
Thanks for raising this and challenging many decisions I've made!
I liked how you changed many places, e.g. feature-flag and method names, to make a point that we're ingesting zeros.
One thing that I couldn't understand was the metadata test, what are we trying to accomplish over there? 🤔
| // CreatedTimestamp returns nil as it's not implemented yet. | ||
| // TODO(bwplotka): https://github.com/prometheus/prometheus/issues/12980 | ||
| func (p *PromParser) CreatedTimestamp() *int64 { | ||
| return nil | ||
| } |
There was a problem hiding this comment.
I had the impression we didn't want to implement CT for Prometheus text. The issue mentioned here also doesn't mention prometheus text parsing 🤔
| func (p *ProtobufParser) CreatedTimestamp(ct *types.Timestamp) bool { | ||
| var foundCT *types.Timestamp | ||
| // CreatedTimestamp returns CT or nil if CT is not present or | ||
| // invalid (as timestamp e.g. negative value) on counters, summaries or histograms. |
There was a problem hiding this comment.
| // invalid (as timestamp e.g. negative value) on counters, summaries or histograms. | |
| // invalid (e.g. a timestamp with negative value) on counters, summaries or histograms. |
Not sure if that's what you meant here, but it reads weird 🤔
| if err != nil { | ||
| // Errors means ct == nil or invalid timestamp, which we silently ignore. | ||
| return nil |
There was a problem hiding this comment.
Hmmmm, ideally we would want to at least log client-side errors. Would it make sense to return an error as well?
There was a problem hiding this comment.
We could also change the proto definition to return int64 like callum suggested and not worry with conversion errors
There was a problem hiding this comment.
Return where? So by making int64 we don't magically remove error surface. We simply not check things. So why not doing the same here and simply not check things? (:
Also how do you propose changing field to int64 without braking changes? Do you propose adding another field (as we should in proto)? What would you do with other timestamp fields that have google Timestamp type? (:
There was a problem hiding this comment.
I understand that adding a new message is the right way of doing backward/forward compatibility, but since the proto change was released only in Prometheus 2.48 and seeing that proto changes weren't announced in the changelog, I don't expect that replacing the field will break anyone (even though this is a breaking change)
tsdb/agent/db.go
Outdated
| func (a *appender) AppendCreatedTimestamp(ref storage.SeriesRef, l labels.Labels, t int64) (storage.SeriesRef, error) { | ||
| func (a *appender) AppendCTZeroSample(ref storage.SeriesRef, l labels.Labels, t int64, ct int64) (storage.SeriesRef, error) { | ||
| // TODO(bwplotka): Implement | ||
| panic("to implement") |
There was a problem hiding this comment.
panic? 😱
Don't Prometheus server and agent share the same scrape codebase? Won't this cause agents to panic on every scrape? 🤔
There was a problem hiding this comment.
It will, but only for this PR. I planned to do extra work in this PR, but now I think we should postpone, so fixed in next commit to this PR (pushing in a 10m)
scrape/manager_test.go
Outdated
| // Disable regular scrapes. | ||
| ScrapeInterval: model.Duration(9999 * time.Minute), | ||
| ScrapeTimeout: model.Duration(5 * time.Second), | ||
| // Ensure proto is chosen. |
There was a problem hiding this comment.
Good point adding a comment here! Maybe we could extend by saying that we're choosing proto because at the time it is the only protocol that parses CT? We can remove this option once we add more protocols
| })) | ||
|
|
||
| // Start fake HTTP target to scrape returning a single metric. | ||
| once := sync.Once{} |
scrape/scrape.go
Outdated
| if ctMs := p.CreatedTimestamp(); sl.enableCTZeroIngestion && ctMs != nil { | ||
| ref, err = app.AppendCTZeroSample(ref, lset, t, *ctMs) | ||
| if err != nil && !errors.Is(err, storage.ErrOutOfOrderCT) { // OOO is a common case, ignoring completely for now. | ||
| // CT is an experimental feature. For now, we don't need to fail the | ||
| // scrape on errors updating the created timestamp, log debug. | ||
| level.Debug(sl.l).Log("msg", "Error when updating metadata in scrape loop", "series", string(met), "ct", *ctMs, "t", t, "err", err) |
There was a problem hiding this comment.
but the log message looks like a copy-paste mistake? We're not updating metadata yet, are we?
Signed-off-by: bwplotka <[email protected]>
Changes: