MemPostings.Delete(): make pauses to unlock and let the readers read by colega · Pull Request #15242 · prometheus/prometheus

colega · 2024-10-29T16:50:53Z

This introduces back some unlocking that was removed in #13286 but in a more balanced way, as suggested by @pracucci.

For TSDBs with a lot of churn, Delete() can take a couple of seconds, and while it's holding the mutex, reads and writes are blocked waiting for that mutex, increasing the number of connections handled and memory usage.

This implementation pauses every 4K labels processed (note that also compared to #13286 we're not processing all the label-values anymore, but only the affected ones, because of #14307), makes sure that it's possible to get the read lock, and waits for a few milliseconds more.

@pracucci

This introduces back some unlocking that was removed in prometheus#13286 but in a more balanced way, as suggested by @pracucci. For TSDBs with a lot of churn, Delete() can take a couple of seconds, and while it's holding the mutex, reads and writes are blocked waiting for that mutex, increasing the number of connections handled and memory usage. This implementation pauses every 4K labels processed (note that also compared to prometheus#13286 we're not processing all the label-values anymore, but only the affected ones, because of prometheus#14307), makes sure that it's possible to get the read lock, and waits for a few milliseconds more. Signed-off-by: Oleg Zaytsev <[email protected]> Co-authored-by: Marco Pracucci <[email protected]> Signed-off-by: Oleg Zaytsev <[email protected]>

pracucci

Nice job. LGTM.

codesome

LGTM. But can we test it somewhere before merging to ~verify if this is doing the intended thing?

machine424 · 2024-10-30T13:34:30Z

I've always wanted to give go_sync_mutex_wait_total_seconds_total a try.

As main like a base (adding #15239 on top, then reverting it and adding #15242 on top)

I got:

I see this patch reduces lock contention, but I'm also seeing, in this situation, for this setup, the benefit of concurrency that was introduced in #15239 and reverted.

The 3 instances were fed with:

./avalanche   --gauge-metric-count=98   --counter-metric-count=111   --histogram-metric-count=2   --histogram-metric-bucket-count=8   --native-histogram-metric-count=0   --summary-metric-count=9   --summary-metric-objective-count=4   --series-count=50   --value-interval=300   --series-interval=300   --metric-interval=300   --port=9001

and queried with:

while true; do
  curl -G http://localhost:8080/api/v1/query --data-urlencode 'query={__name__=~".+"}' &
  curl -G http://localhost:8080/api/v1/query --data-urlencode 'query={__name__=~".+"}' &

  curl -G http://localhost:8888/api/v1/query --data-urlencode 'query={__name__=~".+"}' &
  curl -G http://localhost:8888/api/v1/query --data-urlencode 'query={__name__=~".+"}' &

  curl -G http://localhost:9091/api/v1/query --data-urlencode 'query={__name__=~".+"}' &
  curl -G http://localhost:9091/api/v1/query --data-urlencode 'query={__name__=~".+"}' &

  sleep 1
done

The prometheus instances were run with GOMEMLIMIT=5GiB GOMAXPROCS=8

Signed-off-by: Oleg Zaytsev <[email protected]>

colega · 2024-11-01T10:32:37Z

Thank you for testing this, @machine424, we also tested in our Mimir environment (runs same TSDB) and after experimenting a little bit, I pushed another change that would pause more often, every 512 deleted series.

Note that the previous change looked well under normal load, but in scenarios where there's a lot of series churn (we have instances with more than 750K series churning) the parallel approach is still not enough, as we'd still block all reads for seconds.

I'll try to gather some graphs of our test on Monday, but so far I'd say this is looking good.

The change introduced in prometheus#14307 was great for the performance of MemPostings.Delete(): we know which labels we have to touch, so we just grab the lock once and keep processing them until we're done. While this is the best for MemPostings.Delete(), it's much worse for other users of that mutex: readers and new series writes. These now have to wait until we've deleted all the affected postings, which are potentially millions. Our operation isn't so urgent, but we're not letting other users grab the mutex. While prometheus#15242 proposes a solution for this by performing small pauses from time to time and letting other callers take the mutex, it still doesn't address the elephant in the room: we're just doing too much stuff with the write mutex being held, that's is exclusive time for us. We can spread it longer over time, decreasing the impact, but the overall exclusive time is the same, and we should try to address that. What's the long operation we're doing while holding the write mutex? We're filtering the postings list for each label by building a new one, and looking up in a hashmap whether each one of those elements id deleted. These lists are potentially tens or hundreds of thousands of elements each one. Why don't we build that list while holding just a RLock()? Well there's a small issue with that: maybe a new series is added to the list after we've built that filtered replacement postings list, and before we took the write mutex. The good news is that postings are always appended in-order to the list and we are the only ones who delete things from that list, so if we just check whether the list grew, we can know for sure if we're missing something. Moreover we don't even need to rebuild the entire list if something was added, we just need to add the extra elements that the list has now compared to our snapshot. Finally, one last thing to address is that with this approach we'd be taking the write mutex once per affected label value again, which is a lot, it causes too long compactions under lots of read pressure, and we mitigated that in the past. We also can't just build all the replacement builds and then swap them in one go: we would be referencing too much memory there. It is true that while we're swapping there's still someone referencing the old slice from some reader, but not at an arbitrary scale: for a 2M series tsdb, label names which reference all series (like the typical env=prod) as 16MB of postings each one (2M * 8B). So we process labels in batches, with max 128 labels in a batch (so ideally we take one write mutex per each 128 labels) and a maximum of 10*len(allpostings) max postings in the batch (in our example that would be an extra 160MB allocated temporarily, which should be negligible in a 2M series instance. Signed-off-by: Oleg Zaytsev <[email protected]>

colega · 2024-11-02T10:34:44Z

While this improves the situation, we've seen that it's not a perfect solution and we still see some impacts on the reads while doing the compaction. I would like us to consider a different approach: #15310

Edit: that one still needs some more love, so I think we can proceed with merging this one, while I polish the other one.

Signed-off-by: Oleg Zaytsev <[email protected]>

colega · 2024-11-04T08:17:22Z

As promised, some screenshots. We've deployed the new image around Oct 31st at 13 UTC:

This is the go_sync_mutex_wait_total_seconds_total metric:

And the p99 reads latency has improved as well:

pracucci

Thanks Oleg for this work! It helped really a lot (as screenshots show).

Port prometheus/prometheus#15242 to our fork temporarily until it is upstreamed. Signed-off-by: Giedrius Statkevičius <[email protected]>

go.mod: port prometheus/prometheus#15242

jesusvazquez

LGTM It seems a straightforward change that has brought some good results for downstream projects using the TSDB.

I see that both Thanos and Mimir have already taken this commit in so I see no reason to hold it further.

colega · 2024-11-05T14:11:30Z

@machine424:

I've always wanted to give go_sync_mutex_wait_total_seconds_total a try.

Let's export that metric in Prometheus too (I was looking at it in Mimir, where it has been exported for a year).

#15339

@pracucci

…15242) This introduces back some unlocking that was removed in #13286 but in a more balanced way, as suggested by @pracucci. For TSDBs with a lot of churn, Delete() can take a couple of seconds, and while it's holding the mutex, reads and writes are blocked waiting for that mutex, increasing the number of connections handled and memory usage. This implementation pauses every 4K labels processed (note that also compared to #13286 we're not processing all the label-values anymore, but only the affected ones, because of #14307), makes sure that it's possible to get the read lock, and waits for a few milliseconds more. Signed-off-by: Oleg Zaytsev <[email protected]> Co-authored-by: Marco Pracucci <[email protected]>

Signed-off-by: Vladimir Varankin <[email protected]>

verejoel · 2024-11-27T14:26:44Z

Is there a chance we could backport this into 2.55? That way we could also patch this into Thanos quicker without having to upgrade to v3 of Prometheus.

@pracucci

…rometheus#15242) This introduces back some unlocking that was removed in prometheus#13286 but in a more balanced way, as suggested by @pracucci. For TSDBs with a lot of churn, Delete() can take a couple of seconds, and while it's holding the mutex, reads and writes are blocked waiting for that mutex, increasing the number of connections handled and memory usage. This implementation pauses every 4K labels processed (note that also compared to prometheus#13286 we're not processing all the label-values anymore, but only the affected ones, because of prometheus#14307), makes sure that it's possible to get the read lock, and waits for a few milliseconds more. Signed-off-by: Oleg Zaytsev <[email protected]> Co-authored-by: Marco Pracucci <[email protected]>

@pracucci

…rometheus#15242) This introduces back some unlocking that was removed in prometheus#13286 but in a more balanced way, as suggested by @pracucci. For TSDBs with a lot of churn, Delete() can take a couple of seconds, and while it's holding the mutex, reads and writes are blocked waiting for that mutex, increasing the number of connections handled and memory usage. This implementation pauses every 4K labels processed (note that also compared to prometheus#13286 we're not processing all the label-values anymore, but only the affected ones, because of prometheus#14307), makes sure that it's possible to get the read lock, and waits for a few milliseconds more. Signed-off-by: Oleg Zaytsev <[email protected]> Co-authored-by: Marco Pracucci <[email protected]>

@pracucci

…rometheus#15242) This introduces back some unlocking that was removed in prometheus#13286 but in a more balanced way, as suggested by @pracucci. For TSDBs with a lot of churn, Delete() can take a couple of seconds, and while it's holding the mutex, reads and writes are blocked waiting for that mutex, increasing the number of connections handled and memory usage. This implementation pauses every 4K labels processed (note that also compared to prometheus#13286 we're not processing all the label-values anymore, but only the affected ones, because of prometheus#14307), makes sure that it's possible to get the read lock, and waits for a few milliseconds more. Signed-off-by: Oleg Zaytsev <[email protected]> Co-authored-by: Marco Pracucci <[email protected]>

colega requested a review from jesusvazquez as a code owner October 29, 2024 16:50

colega mentioned this pull request Oct 29, 2024

Revert "Process MemPostings.Delete() with GOMAXPROCS workers" #15239

Merged

colega force-pushed the mempostings-delete-with-pauses branch from 0fee8a8 to b11d1df Compare October 29, 2024 16:55

pracucci approved these changes Oct 29, 2024

View reviewed changes

codesome previously approved these changes Oct 29, 2024

View reviewed changes

Pause more often.

44cf0df

Signed-off-by: Oleg Zaytsev <[email protected]>

colega mentioned this pull request Nov 2, 2024

MemPostings.Delete() batches without holding the Lock() #15310

Closed

colega dismissed codesome’s stale review via 44cf0df November 4, 2024 08:06

Fix comment typo

7f853b9

Signed-off-by: Oleg Zaytsev <[email protected]>

pracucci approved these changes Nov 4, 2024

View reviewed changes

GiedriusS added a commit to vinted/thanos that referenced this pull request Nov 4, 2024

go.mod: port prometheus/prometheus#15242

129809c

Port prometheus/prometheus#15242 to our fork temporarily until it is upstreamed. Signed-off-by: Giedrius Statkevičius <[email protected]>

GiedriusS mentioned this pull request Nov 4, 2024

go.mod: port https://github.com/prometheus/prometheus/pull/15242 vinted/thanos#120

Merged

GiedriusS added a commit to vinted/thanos that referenced this pull request Nov 4, 2024

go.mod: port prometheus/prometheus#15242

c0f4b80

Port prometheus/prometheus#15242 to our fork temporarily until it is upstreamed. Signed-off-by: Giedrius Statkevičius <[email protected]>

GiedriusS added a commit to vinted/thanos that referenced this pull request Nov 4, 2024

Merge pull request #120 from vinted/vinted_patch_0370003

b0553ac

go.mod: port prometheus/prometheus#15242

jesusvazquez approved these changes Nov 5, 2024

View reviewed changes

jesusvazquez merged commit b1e4052 into prometheus:main Nov 5, 2024

colega mentioned this pull request Nov 5, 2024

Export 'go_sync_mutex_wait_total_seconds_total' metric #15339

Merged

narqo added a commit to grafana/mimir that referenced this pull request Nov 13, 2024

vendor: cherry-pick prometheus/prometheus#15242

79860a3

Signed-off-by: Vladimir Varankin <[email protected]>

narqo mentioned this pull request Nov 13, 2024

[release-2.14] Cherry-pick Prometheus PR 15242 grafana/mimir#9889

Merged

narqo added a commit to grafana/mimir that referenced this pull request Nov 13, 2024

vendor: cherry-pick prometheus/prometheus#15242 (#9889)

b9fa597

Signed-off-by: Vladimir Varankin <[email protected]>

colega mentioned this pull request Nov 18, 2024

receive: periodic huge spikes in latency when blocks are cut thanos-io/thanos#7913

Closed

verejoel mentioned this pull request Nov 27, 2024

MemPostings.Delete(): make pauses to unlock and let the readers read … verejoel/prometheus#1

Open

verejoel mentioned this pull request Dec 17, 2024

MemPostings.Delete(): make pauses to unlock and let the readers read open-ch/prometheus#1

Open

yuchen-db mentioned this pull request Apr 30, 2025

Cherry pick upstream fix to lock contention yuchen-db/prometheus#2

Merged

machine424 mentioned this pull request Sep 1, 2025

[PERF] TSDB: Default stripe size=1024 (and 256) (was 16384) #17101

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

MemPostings.Delete(): make pauses to unlock and let the readers read#15242

MemPostings.Delete(): make pauses to unlock and let the readers read#15242
jesusvazquez merged 3 commits intoprometheus:mainfrom
colega:mempostings-delete-with-pauses

colega commented Oct 29, 2024

Uh oh!

pracucci left a comment

Uh oh!

codesome left a comment

Uh oh!

machine424 commented Oct 30, 2024 •

edited

Loading

Uh oh!

colega commented Nov 1, 2024

Uh oh!

colega commented Nov 2, 2024 •

edited

Loading

Uh oh!

colega commented Nov 4, 2024 •

edited

Loading

Uh oh!

pracucci left a comment

Uh oh!

jesusvazquez left a comment

Uh oh!

colega commented Nov 5, 2024

Uh oh!

verejoel commented Nov 27, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

Conversation

colega commented Oct 29, 2024

Uh oh!

pracucci left a comment

Choose a reason for hiding this comment

Uh oh!

codesome left a comment

Choose a reason for hiding this comment

Uh oh!

machine424 commented Oct 30, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

colega commented Nov 1, 2024

Uh oh!

colega commented Nov 2, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

colega commented Nov 4, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pracucci left a comment

Choose a reason for hiding this comment

Uh oh!

jesusvazquez left a comment

Choose a reason for hiding this comment

Uh oh!

colega commented Nov 5, 2024

Uh oh!

verejoel commented Nov 27, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

machine424 commented Oct 30, 2024 •

edited

Loading

colega commented Nov 2, 2024 •

edited

Loading

colega commented Nov 4, 2024 •

edited

Loading