-
Notifications
You must be signed in to change notification settings - Fork 10.3k
[meta] PROM-35 Remote Write (PRW2.0) Stability #16944
Description
Context
This is an umbrella issue collecting things we plan to implement for the stable Prometheus Remote Write 2.0 (PRW2.0) protocol and Prometheus implementation; also described by the PROM-35 proposal. Superseeds #1310 and #14360
Anyone is welcome to comment or DM me, @cstyan or @saswatamcode to add/update/remove things from this list!
Acceptance Criteria
This is issue will be closed when we will be able to mark PRW2.0 as stable and when we can start the official process of planned deprecation and switching the default PRW method to V2 instead of V1. That "stability", depends in general, on:
a) The acceptable state (functionality and performance) of the Prometheus receive and export (write) Remote Write 2 implementation.
b) Some adoption for (at least) receiving in the ecosystem, by systems like Cortex, Thanos, Mimir and/or vendors.
We expect close to zero production traffic use until (a) and (b). Majority of the traffic will also not switch until we mark the protocol stable, change the defaults or at least announce the v1 deprecation.
Tasks
List of tasks towards stability.
Protocol
- Decide what to do with "Timestamp on exemplar is MUST" @cstyan
- Decide on delta type improvement (specifically CT/ST per sample addition) @fionaliao @carrieedwards
- NewRelic dev asked in the past what to do with the WriteResponse when the parsing is async. We might want to extend the protocol to cover this case (e.g. "Written" header with "N/A" value or so).
- Finish proposal (remote write 2.0 - formal proposal #13887) @bwplotka
Prometheus Implementation
- Ensure type-and-unit feature works with RW ([meta] PROM-39 type-and-unit-labels stability #16610).
- CT/STs are working with RW (ST: Prometheus propagates ST in Remote-Write 2.0 #14220) @bwplotka
- Document behaviour (remote write 2.0 - decide whether addition of metadata should count towards max samples in write request #14407)
- Prometheus uses the official RW client on receive (thanks to @pipiland2612!)
- Prometheus uses the official RW client @saswatamcode on sender.
- This has been attempted, but the special handling for age limited retries and lack of sent bytes metric is blocking. Something to consider adding to lib to have lib used in Prometheus, but not a blocker for stability. [RW] Use remoteapi.Write function to sendSamplesWithBackoff #17373
- Ensure RW new features works on agent mode (remote write 2.0 - test for agent mode with metadata-wal-records and type-and-unit features #13483).
- PRW1 metadata parity ([RW2] metadata.send feature parity with v1 #17191) (in progress [RW2.0] Enable fast metadata look up for Remote Write 2.0 #17436)
- Known bugs fixed (Remote Write 2.0 returns 500 when Native Histograms are sent but the feature is disabled #17181, Remote Write V2 and converted classic histograms to NHCB #17075)
- Make exemplar implementation follow the spec bug(sending RW2): RW2 sends disconnected exemplars without samples #17857
Best-effort small tech debt tasks, might worth to descope from 2.0:
- Address TODO's in Remote Write Prometheus code and tests (remote write 2.0 - ensure unit tests are all properly updated #13479).
- DRY code (remote write 2.0 - refactor append functions, build write request, etc. to reduce code duplication #13481, remote write 2.0 - DRY the queue manager code #14409).
Adoption / Documentation / Integrations
- Add 2.0 support to compliance test; make it easy to test write and receive implementations (remote_write_sender: Add support for Remote Write 2.0 compliance#101).
- Official benchmarks numbers for the protocol itself (Remote Write 2.0: Benchmarks vs 1.0 and OTLP #14253) (FYI prombench already benchmark implementation).
- Ship the blog post on compliance. cc @pipiland2612 ([Remote Write] Write blog post about new compliance tests and library advancements. #17603)
- Ship the blog post on RW stability? blocked on benchmarks (we have some drafts already) (Remote Write 2.0: Write a blog post (when stable) #14254).
- Gather latest/contribute/help with PRW 2.0 in Thanos cc @saswatamcode
- Gather latest/contribute/help with PRW 2.0 in Cortex cc @alanprot
- Gather latest/contribute/help with PRW 2.0 in Mimir cc @krajorama
- Gather latest adoption state from vendors/providers (I'm heard some rumours about Google, Grafana Cloud and NewRelic; to double check).
Other, general Remote Write work
In the past in #6333 we tried to capture general Remote Write implementation task. I don't think it makes sense to track those in a meta issue that has no specific milestone definition, other than complete a list of task; also the list is pretty old and out of date.
I went through #6333 items, triaged them (#8809, #6733, #6934, #7218, #8779) and ensured the relevant tasks are on GH under component/remote storage label.
Out of scope for this ticket/2.0
- Metadata in WAL related tickets, likely closable, especially less relevant given
metadata-wal-recordsbeing problematic and discouraged (IMO)- Metadata vs max samples limit (remote write 2.0 - ensure the metadata is not counted against max samples in a write request #13480).
- metadata + cache (remote write 2.0 - ensure metadata cache in queue manager is cleaned up #13482)
- Consider new compression (remote write 2.0 - implement additional compressions #13366)
NOTE: this is likely to be moved out of 2.0 scope. We don't hear a great feedback as this to be needed or trivial to do. VictoriaMetrics team voted to zstd at some point, but Grafana engineers complained about receiving CPU use. We can always add it in 2.1; let's ask around.
Metadata
Metadata
Assignees
Labels
Type
Projects
Status