-
Notifications
You must be signed in to change notification settings - Fork 10.3k
[PRW 2.0 spec] Sample Write Confirmation in Response #14359
Description
Proposal
cc @cstyan @alexgreenbank @npazosmendez
Problem
In the PRW 2.0 spec, the request Content-Type tells Receiver what proto to use to parse the message.
This works great if everyone implements the spec properly. However, reality kicks in and bugs happens. After updating our PRW 2.0 Prometheus branch
we noticed a common issue:
🐛 Receivers (especially 1.0) don't check Content-Type header. This is literally what Prometheus receive handler is doing pre-2.0 changes
The same in Thanos receiving.
It's was understandable as in 1.0 those headers were hardcoded so receivers also didn't care to check those.
The consequence is that newer clients could decide to send a write request encoded as RW2.0 proto to a Receiver that only reads 1.0 proto. In this case we would expecting the Receiver to fail with at least 400 (ideally 415).
Currently the above will cause the "broken" 1.0 Receiver to unmarshal v2 payload with v1 proto. Thanks to 2.0-rc.1 change in spec, this at least does not corrupt the writes (or give underminitic error or empty payload), but always returns empty message.
Unfortunately this will result likely in "no write" and 2xx, while client might think the data was written.
Specifically in Thanos this would result in debug log and 200
The same with Prometheus.
This problem also exists in other broken Receiver cases (wrong proto is used), broken Senders (sending v2 payload, but content type v1), but also if we would add more field to e.g. TimeSeries proto in future (e.g. samplesv2) (the same problem exists in 1.0 already).
Potential Solution
Use Response headers to send number of samples, exemplars, histogram samples ingested.
This is actually very close to #14323, just reversed and using HTTP response header (simpler).
One could establish response headers like
X-Prometheus-Remote-Write-Stats-Series 1
X-Prometheus-Remote-Write-Stats-Samples 12144
X-Prometheus-Remote-Write-Stats-Histograms 1123
X-Prometheus-Remote-Write-Stats-Exemplars 123
Those has benefit of:
- Confirming what was written
- Solving remote_write client metrics: a partially failed batch is counted as all samples failed #14323 cc @rfratto
- Giving server side stats
- Still having room for response
This could MUST or SHOULD. WDYT?
Alternatives
Fail on empty request (e.g empty symbol or no timeseries)
One way would be to add MUST/SHOULD to spec to fail on empty request. This would help with new Receivers accidently parsing wrong protos/payloads. However there are 2 cons:
- This does not fixes 1.0 Receivers.
- Empty request was used in the past as a nice "gateway" for probing Receivers.
- This actually makes proto forward/backward compatibility in future tricky, because old Receivers (e.g. 2.0) would fail if no TimeSeries are sent, but one day we might have more fields than timeseries and want to put it in 2.x (stretch?)
Do Nothing
In some way, this only happens on "broken" situation and once Sender ensures Receiver checks content-type properly and uses correct proto (tests/manual checks) it would be perhaps fine?
🐛