storage/remote: compute highestTimestamp and dataIn at QueueManager level#17065
storage/remote: compute highestTimestamp and dataIn at QueueManager level#17065machine424 merged 2 commits intoprometheus:mainfrom
Conversation
|
still needs a double check and some extra tests, but wdyt @cstyan? |
|
Great idea! 👍 thanks for doing this. I think there's a few things we need to verify but likely we can get rid of those other two values you've pointed out. What we want users to be able to do is differentiate between "prometheus is scraping samples and writing them to the WAL, but the WAL reader for remote write (per queue) is stuck for some reason" vs "this queue manager is seeing samples but is deciding to drop them all". I think we can still get that information in other ways, but we should verify. As far as the reduction in complexity for |
466cffc to
19fbc90
Compare
|
Thanks Callum for taking a look. The most likely cases I can think of are the watcher failing, or the queue being full (because pending samples can’t be sent). Both of these are already covered by existing metrics that expose such issues. So yes, you're right, we'll kind of swap some false positives because we don't take relabeling into account for some possible false negatives, but I think the new “shadow areas” can be clarified by the existing metrics. That said, we can keep the metrics at the storage level for now, 4 series per Prometheus instance isn’t a big deal. We can always reconsider this in the future. |
|
cc @cstyan if you can take another look. |
8ae3d9d to
9779d29
Compare
Yeah I think you're right, it's just been a long time since I last looked at the data we have here in any detail. I think the top level "most recent scrape timestamp" via I think we also have a proxy for "sampels are being scraped and written to/read from the WAL, we're just choosing to drop all (via relabel) for queue X and that's why you're not seeing data in your remote storage" with the metrics in the watcher for records read plus the queues dropped samples metric. The dropped samples metric even has a label to differentiate between unexpectedly dropped samples vs samples dropped due to relabelling. I think we're okay here 🤔 |
|
Yes, because what gets sent is what the watcher sends to the queues; that's what we need to be monitoring. And to be even more prudent (even though there are no stability guarantees for metrics), I'll only mark them as deprecated, propose the alternatives, and postpone the removal to another PR. |
…evel Because of relabelling, an endpoint can only select a subset of series that go through WriteStorage Having a highestTimestamp at WriteStorage level yields wrong values if the corresponding sample won't even make it to a remote queue. Currently PrometheusRemoteWriteBehind is based on that, and would fire if an endpoint is only interested in a subset of series that take time to appear. A "prometheus_remote_storage_queue_highest_timestamp_seconds" that only takes into account samples in the queue is introduced, and used in PrometheusRemoteWriteBehind and dashboards in documentation/prometheus-mixin Same applies to samplesIn/dataIn, QueueManager should know more about when to update those; when data is enqueued. That makes dataDropped unnecessary, thus help simplify the logic in QueueManager.calculateDesiredShards() Signed-off-by: machine424 <[email protected]>
…ams}_in_total and prometheus_remote_storage_highest_timestamp_in_seconds Signed-off-by: machine424 <[email protected]>
792147c to
ba14bc4
Compare
|
cc @cstyan |
parial cherry-pick of prometheus#17065 to make it easier to backport to older versions. the new metric is "prometheus_remote_storage_queue_highest_timestamp_seconds"
parial cherry-pick of prometheus#17065 to make it easier to backport to older versions. the new metric is "prometheus_remote_storage_queue_highest_timestamp_seconds"
parial cherry-pick of prometheus#17065 to make it easier to backport to older versions. the new metric is "prometheus_remote_storage_queue_highest_timestamp_seconds"
parial cherry-pick of prometheus#17065 to make it easier to backport to older versions. the new metric is "prometheus_remote_storage_queue_highest_timestamp_seconds"
parial cherry-pick of prometheus#17065 to make it easier to backport to older versions. the new metric is "prometheus_remote_storage_queue_highest_timestamp_seconds"
parial cherry-pick of prometheus#17065 to make it easier to backport to older versions. the new metric is "prometheus_remote_storage_queue_highest_timestamp_seconds"
parial cherry-pick of prometheus#17065 to make it easier to backport to older versions. the new metric is "prometheus_remote_storage_queue_highest_timestamp_seconds"
parial cherry-pick of prometheus#17065 to make it easier to backport to older versions. the new metric is "prometheus_remote_storage_queue_highest_timestamp_seconds"
parial cherry-pick of prometheus#17065 to make it easier to backport to older versions. the new metric is "prometheus_remote_storage_queue_highest_timestamp_seconds"
parial cherry-pick of prometheus#17065 to make it easier to backport to older versions. the new metric is "prometheus_remote_storage_queue_highest_timestamp_seconds"
Because of relabelling, an endpoint can only select a subset of series that go through WriteStorage
Having a highestTimestamp at WriteStorage level only yields wrong values if the corresponding sample won't even make it to a remote queue.
Currently PrometheusRemoteWriteBehind is based on that, and would fire if an endpoint is only interested in a subset of series that take time to appear.
A "prometheus_remote_storage_queue_highest_timestamp_seconds" that only takes into account samples in the queue is introduced, and used in PrometheusRemoteWriteBehind and dashboards in documentation/prometheus-mixin
Same applies to samplesIn/dataIn, QueueManager should know more about when to update those; when data is enqueued.
That makes dataDropped unnecessary, thus help simplify the logic in QueueManager.calculateDesiredShards()
Which issue(s) does the PR fix:
Does this PR introduce a user-facing change?