[release-3.7] fix: Remote-write: revert changes in the queue resharding logic #17412
[release-3.7] fix: Remote-write: revert changes in the queue resharding logic #17412bwplotka merged 4 commits intoprometheus:release-3.7from
Conversation
|
I'll need to fix that DCO |
…,histograms}_in_total and prometheus_remote_storage_highest_timestamp_in_seconds" This reverts commit ba14bc4. Signed-off-by: machine424 <[email protected]>
…anager level" This reverts commit 184c7eb. Signed-off-by: machine424 <[email protected]>
Signed-off-by: machine424 <[email protected]>
|
I'm really sorry to be a pain, but how we know if we reverted properly OR if we reverted correct thing (OR that somebody will not re-revert things back) if we have no regression test? Is it hard to write one? Can we write skipped one for now if the risk of flakiness is high? |
|
I thought the existing tests + manual testing + reviews should be enough to validate the revert. As mentioned, I'll try to pick this later on main. |
|
I think I can put together a higher level test. |
…uce the sharding scale up deadlock Signed-off-by: machine424 <[email protected]>
|
Test added, ran locally On main c36e966, the test fails On this branch, the test passes |
|
LGTM but I think we also want to revert the mixin alerts/dashboard changes from #17065 as well to use the original |
Now that the resharding has no deadlock, the new |
|
Thanks, taking a look |
| } | ||
|
|
||
| // TestRemoteWrite_ReshardingWithoutDeadlock ensures that resharding (scaling up) doesn't block when the shards are full. | ||
| // See: https://github.com/prometheus/prometheus/issues/17384. |
There was a problem hiding this comment.
| // See: https://github.com/prometheus/prometheus/issues/17384. | |
| // Regression test for https://github.com/prometheus/prometheus/issues/17384. |
bwplotka
left a comment
There was a problem hiding this comment.
Cool, I think it makes sense. New metrics and new alerts/dashboards to keep make sense, you don't revert them you just don't rely on those metrics for the desired sharding calculation 👍🏽
Can you fix commit/title message to be more clear? "reverting changes in queue sharding" is bit vague. We are reverting thing partially so let's be clear - I think what we do here is "fix resharding by reverting highestTimestamp move to queue_manager" or sth?
Thank for quick help on this! LGTM otherwise!
| } | ||
|
|
||
| checkInitialDesiredShardsOnce.Do(func() { | ||
| s, err := getMetricValue(t, bytes.NewReader(metrics), model.MetricTypeGauge, "prometheus_remote_storage_shards_desired") |
There was a problem hiding this comment.
This can introduce flakiness (resharding faster than this check) - I guess it's fine for now, no trivial way to improve it quickly. One way would be to start with default config that is fine with 1 shard, then scale capacity down to 1.
Would it be useful to add a quick // TODO about this idea?
There was a problem hiding this comment.
If you could share a run showing the flakiness I can think about making the test more reliable on main.
|
Thanks for taking a look. the revert is not only about highestTimestamp, thus the generic but correct PR title. (Details can be found in the description or by checking the changes direclty) If you think the PR isn’t ready yet, don’t hesitate to build on it. |
We're trying to ensure this will not cause more harm then helps. If we would simply revert whole commit, we wouldn't need to be picky. I think it's ready, you addressed my comments! |
|
I squashed when merging, sorry if this was important to preserve history. |
looking at the commits, one can see that I did perform a complete revert in the first two commits, then brought back the metrics only changes in a third commit and added the test in the forth one. |
|
Sure, but the effect is the same, eventually a partial revert was done in this set of commits, no? As a result, I had to deeply understand the change and challenge some details instead of just trusting that old thing was working (which is proven by 3.6 working fine). What am I missing? (: You are welcome to push back on my comments (which you did), my review is not perfect and I understand your sentiment to "just merge quickly". Apologies if you feel it's too detailed and time consuming but in my opinion either you do a proper, high quality, detailed review (and as a result a good change) or you don't review. Otherwise just more work, bugs, frustrations happens. Let me know what you think I should improve in my review style, I am keen to learn. Anyway, you did help amazingly to fix this AND addressed my comments (especially thanks for the test, it's game changing), so I'm happy. |
|
I could definitely notice multiple commits and review each separate commit, I guess it could be useful (GH does not make it easy though) and I could avoid squash merging 🙃 |
Which issue(s) does the PR fix:
Fixes #17384
Does this PR introduce a user-facing change?
I think we can keep the metrics deprecations as they still stand (we don't guarantee any stability on metrics anyway...).
I'll revisit this later as I still think that we can move the logic closer to the queues and instead of
all - droppedwe can consider"needed to be enqueued"/appended".I just reverted to the logic in 3.6, thus I didn't try to fix and run the test in https://github.com/prometheus/prometheus/pull/17407/files#diff-e55ce92fccc71019472af9eab3dd7c85fce2d2c6d520052bcc23b29ce7b962fbR2654, I'll revisit it later as well.