-
Notifications
You must be signed in to change notification settings - Fork 10.3k
Remote write: Make HTTP 429 errors recoverable only when the user opts-in #8474
Description
We have to make the 429-retry mechanism in remote-write opt-in, I'd prefer as a remote write configuration flag (per remote-write url).
This is indeed a breaking change. From what I can we see, we never drop data here? This means Prometheus will just keep falling behind if it is being constantly rate-limited? This assumes that Prometheus doesn't send data to a remote system that basically limits it constantly.
This is not true, we limit people in GrafanaCloud pretty regularly and they're completely fine with it. Further, in Cortex, we return 429 not just on a samples per second limit and do it for other cases too. For example, we have a limit for active series and if a user sends more than their active series limit, we return a 429. In this case, Prometheus would never be able to proceed because it would constantly try to send the sample that creates a new series and it would fail forever.
Now I am not sure if Cortex has the right behavior with 429 for other limits, it seemed pretty right when we built it ;) I still think it has the right behavior, but given Cortex has a major chunk of remote-write use-cases, we shouldn't roll out this change in Prometheus until we fix it in Cortex. WDYT @roidelapluie @csmarchbanks?
Originally posted by @gouthamve in #8237 (comment)