Skip to content

remote write lose sample when target is unavailable #14087

@roastiek

Description

@roastiek

What did you do?

I'm using remote write to a receiver, which can be temporary down when updating.

What did you expect to see?

All samples should be evantualy written back when a receiver is running again.

What did you see instead? Under which circumstances?

After upgrading to 2.51.2 from 2.50.1 I started missing samples on a receiver in time when it was not running. I cheked this by quering some metrics samples in both prometheus and receiver. I can see drop in prometheus_remote_storage_samples_total but nothing in prometheus_remote_storage_samples_dropped_total or prometheus_remote_storage_samples_failed_total metrics. So the samples were probraly never tried to send and just skipped.

I'm suspecting changes made in #13583 and shared parameter tail bool between

func (w *Watcher) watch(segmentNum int, tail bool) error {
and
func (w *Watcher) readSegment(r *LiveReader, segmentNum int, tail bool) error {
First time is used to to tail the wal and second time to skip reading samples of checkpoint. Prior the change it stayed true once set. However now it can revert back to false when processing of samples is paused for some time and then resumed.

Reverting back to 2.50.1 fixed this.

System information

No response

Prometheus version

2.51.2

Prometheus configuration file

No response

Alertmanager version

No response

Alertmanager configuration file

No response

Logs

No response

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions