Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[PERF] TSDB: MemPostings: keep a map of label values slices #15426

Conversation

colega
Copy link
Contributor

@colega colega commented Nov 20, 2024

While investigating lock contention on MemPostings, we saw that lots of locking is happening in LabelValues and PostingsForLabelsMatching, both copying the label values slices while holding the mutex.

This adds an extra map that holds an append-only label values slice for each one of the label names. Since the slice is append-only, it can be copied without holding the mutex.

We have observed no or negligible changes deploying this to our dev or normal production-like environments. We haven't seen a memory increase or a CPU usage change.

However, when deployed to a TSDB that has ~60 series created per second, with spikes of 300, and has thousands of rule queries per second, we saw a 4x reduction in p99 query time, 10x reduction in avg append time, and 10-50x reduction in mutex wait time.

Unfortunately, I can't share all the details about this environment, I'll try to share as much as I can, redacting the sensitive parts.

The following graphs correspond to a Mimir env, with zone-a running with this PR, while zone-b and zone-c run from main.

Sum of mutex wait time per zone:
image

Average append latency:
image

Average read latency (includes all read operations):
image

I wish I could provide meaningful numbers from a Prometheus instance, but we don't run a Prometheus instance that big to see a difference.

@colega colega requested a review from jesusvazquez as a code owner November 20, 2024 08:52
@jesusvazquez
Copy link
Member

/prombench main

@prombot
Copy link
Contributor

prombot commented Nov 20, 2024

⏱️ Welcome to Prometheus Benchmarking Tool. ⏱️

Compared versions: PR-15426 and main

After successful deployment, the benchmarking results can be viewed at:

Other Commands:
To stop benchmark: /prombench cancel
To restart benchmark: /prombench restart main

Copy link
Member

@bboreham bboreham left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good; couple of thoughts on style.

Comment on lines 353 to 357
p.mtx.Unlock()
p.mtx.RLock()
p.mtx.RUnlock() //nolint:staticcheck // SA2001: this is an intentionally empty critical section.
time.Sleep(time.Millisecond)
p.mtx.Lock()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To improve readability, extract to function. Take most of the comments above with it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried extracting this:

// unlockWaitAndLockAgain will unlock an already locked p.mtx.Lock() and then wait a little bit before locking it again,
// letting the RLock()-waiting goroutines to get the lock.
func (p *MemPostings) unlockWaitAndLockAgain() {
	p.mtx.Unlock()
	// While it's tempting to just do a `time.Sleep(time.Millisecond)` here,
	// it wouldn't ensure use that readers actually were able to get the read lock,
	// because if there are writes waiting on same mutex, readers won't be able to get it.
	// So we just grab one RLock ourselves.
	p.mtx.RLock()
	// We shouldn't wait here, because we would be blocking a potential write for no reason.
	// Note that if there's a writer waiting for us to unlock, no reader will be able to get the read lock.
	p.mtx.RUnlock() //nolint:staticcheck // SA2001: this is an intentionally empty critical section.
	// Now we can wait a little bit just to increase the chance of a reader getting the lock.
	time.Sleep(time.Millisecond)
	p.mtx.Lock()
}

But it's not inlined. I'm afraid that waiting on a function call just to unlock here might defy the purpose, wdyt?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At least the Sleep must be much more expensive than a function call, since it hits the scheduler.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was worried about the function call before the unlock, discussed offline, agreed that Unlock is going to be expensive anyway if there's someone waiting, so pushed: d44ea7e

@@ -57,13 +59,15 @@ var ensureOrderBatchPool = sync.Pool{
type MemPostings struct {
mtx sync.RWMutex
m map[string]map[string][]storage.SeriesRef
lvs map[string][]string
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggest to comment m and lvs here, also the rules on locking around lvs.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done in e61c8c9

@colega
Copy link
Contributor Author

colega commented Nov 21, 2024

Pushed one more commit, reducing the mutex time in MemPostings.Symbols() (I've seen a profile of goroutines waiting on this one during compaction).

@jesusvazquez
Copy link
Member

/prombench cancel

@prombot
Copy link
Contributor

prombot commented Nov 22, 2024

Benchmark cancel is in progress.

While investigating lock contention on `MemPostings`, we saw that lots
of locking is happening in `LabelValues` and
`PostingsForLabelsMatching`, both copying the label values slices while
holding the mutex.

This adds an extra map that holds an append-only label values slice for
each one of the label names. Since the slice is append-only, it can be
copied without holding the mutex.

Signed-off-by: Oleg Zaytsev <[email protected]>
@colega colega force-pushed the mempostings-with-a-map-of-label-values-slices branch from e9676c0 to c73ca37 Compare November 22, 2024 10:37
@jesusvazquez
Copy link
Member

/prombench main

@prombot
Copy link
Contributor

prombot commented Nov 22, 2024

⏱️ Welcome to Prometheus Benchmarking Tool. ⏱️

Compared versions: PR-15426 and main

After successful deployment, the benchmarking results can be viewed at:

Other Commands:
To stop benchmark: /prombench cancel
To restart benchmark: /prombench restart main

@bboreham
Copy link
Member

We restarted Prombench because CPU was noticeably higher.
This time it is less noticeable, but for instance in the last 15 mins it is ~5% higher.

image

I cannot see anything in Parca to say why.

@prombot
Copy link
Contributor

prombot commented Nov 25, 2024

Benchmark tests are running for 3 days! If this is intended ignore this message otherwise you can cancel it by commenting: /prombench cancel

@bboreham
Copy link
Member

We didn't find anything; let's call it close enough.
image

@bboreham
Copy link
Member

/prombench cancel

@prombot
Copy link
Contributor

prombot commented Nov 25, 2024

Benchmark cancel is in progress.

@bboreham bboreham changed the title MemPostings: keep a map of label values slices [PERF] TSDB: MemPostings: keep a map of label values slices Nov 25, 2024
@jesusvazquez jesusvazquez merged commit cd1f8ac into prometheus:main Nov 29, 2024
28 checks passed
colega added a commit to grafana/mimir that referenced this pull request Nov 29, 2024
colega added a commit to grafana/mimir that referenced this pull request Nov 29, 2024
colega added a commit to grafana/mimir that referenced this pull request Nov 29, 2024
colega added a commit to grafana/mimir that referenced this pull request Nov 29, 2024
colega added a commit to grafana/mimir that referenced this pull request Nov 29, 2024
colega added a commit to grafana/mimir that referenced this pull request Nov 29, 2024
colega added a commit to grafana/mimir that referenced this pull request Dec 2, 2024
Updates mimir-prometheus weekly-r319 to grafana/mimir-prometheus#779

This includes the cherry-pick of prometheus/prometheus#15426

Signed-off-by: Oleg Zaytsev <[email protected]>
colega added a commit to grafana/mimir that referenced this pull request Dec 2, 2024
Updates mimir-prometheus weekly-r319 to grafana/mimir-prometheus#779

This includes the cherry-pick of prometheus/prometheus#15426

Signed-off-by: Oleg Zaytsev <[email protected]>
julienduchesne pushed a commit to julienduchesne/prometheus that referenced this pull request Dec 13, 2024
While investigating lock contention on `MemPostings`, we saw that lots
of locking is happening in `LabelValues` and
`PostingsForLabelsMatching`, both copying the label values slices while
holding the mutex.

This adds an extra map that holds an append-only label values slice for
each one of the label names. Since the slice is append-only, it can be
copied without holding the mutex.

Signed-off-by: Oleg Zaytsev <[email protected]>
julienduchesne pushed a commit to julienduchesne/prometheus that referenced this pull request Dec 13, 2024
…-15426

[weekly-r319] MemPostings: keep a map of label values slices (prometheus#15426)
andrejbranch pushed a commit to andrejbranch/prometheus that referenced this pull request Feb 27, 2025
While investigating lock contention on `MemPostings`, we saw that lots
of locking is happening in `LabelValues` and
`PostingsForLabelsMatching`, both copying the label values slices while
holding the mutex.

This adds an extra map that holds an append-only label values slice for
each one of the label names. Since the slice is append-only, it can be
copied without holding the mutex.

Signed-off-by: Oleg Zaytsev <[email protected]>
andrejbranch pushed a commit to andrejbranch/prometheus that referenced this pull request Feb 27, 2025
While investigating lock contention on `MemPostings`, we saw that lots
of locking is happening in `LabelValues` and
`PostingsForLabelsMatching`, both copying the label values slices while
holding the mutex.

This adds an extra map that holds an append-only label values slice for
each one of the label names. Since the slice is append-only, it can be
copied without holding the mutex.

Signed-off-by: Oleg Zaytsev <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants