Skip to content

Remote write: Make HTTP 429 errors recoverable only when the user opts-in #8474

@roidelapluie

Description

@roidelapluie

We have to make the 429-retry mechanism in remote-write opt-in, I'd prefer as a remote write configuration flag (per remote-write url).

This is indeed a breaking change. From what I can we see, we never drop data here? This means Prometheus will just keep falling behind if it is being constantly rate-limited? This assumes that Prometheus doesn't send data to a remote system that basically limits it constantly.

This is not true, we limit people in GrafanaCloud pretty regularly and they're completely fine with it. Further, in Cortex, we return 429 not just on a samples per second limit and do it for other cases too. For example, we have a limit for active series and if a user sends more than their active series limit, we return a 429. In this case, Prometheus would never be able to proceed because it would constantly try to send the sample that creates a new series and it would fail forever.

Now I am not sure if Cortex has the right behavior with 429 for other limits, it seemed pretty right when we built it ;) I still think it has the right behavior, but given Cortex has a major chunk of remote-write use-cases, we shouldn't roll out this change in Prometheus until we fix it in Cortex. WDYT @roidelapluie @csmarchbanks?

Originally posted by @gouthamve in #8237 (comment)

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions