-
Notifications
You must be signed in to change notification settings - Fork 9.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[PERF] TSDB: MemPostings: keep a map of label values slices #15426
[PERF] TSDB: MemPostings: keep a map of label values slices #15426
Conversation
/prombench main |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good; couple of thoughts on style.
tsdb/index/postings.go
Outdated
p.mtx.Unlock() | ||
p.mtx.RLock() | ||
p.mtx.RUnlock() //nolint:staticcheck // SA2001: this is an intentionally empty critical section. | ||
time.Sleep(time.Millisecond) | ||
p.mtx.Lock() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To improve readability, extract to function. Take most of the comments above with it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I tried extracting this:
// unlockWaitAndLockAgain will unlock an already locked p.mtx.Lock() and then wait a little bit before locking it again,
// letting the RLock()-waiting goroutines to get the lock.
func (p *MemPostings) unlockWaitAndLockAgain() {
p.mtx.Unlock()
// While it's tempting to just do a `time.Sleep(time.Millisecond)` here,
// it wouldn't ensure use that readers actually were able to get the read lock,
// because if there are writes waiting on same mutex, readers won't be able to get it.
// So we just grab one RLock ourselves.
p.mtx.RLock()
// We shouldn't wait here, because we would be blocking a potential write for no reason.
// Note that if there's a writer waiting for us to unlock, no reader will be able to get the read lock.
p.mtx.RUnlock() //nolint:staticcheck // SA2001: this is an intentionally empty critical section.
// Now we can wait a little bit just to increase the chance of a reader getting the lock.
time.Sleep(time.Millisecond)
p.mtx.Lock()
}
But it's not inlined. I'm afraid that waiting on a function call just to unlock here might defy the purpose, wdyt?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
At least the Sleep must be much more expensive than a function call, since it hits the scheduler.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was worried about the function call before the unlock, discussed offline, agreed that Unlock is going to be expensive anyway if there's someone waiting, so pushed: d44ea7e
tsdb/index/postings.go
Outdated
@@ -57,13 +59,15 @@ var ensureOrderBatchPool = sync.Pool{ | |||
type MemPostings struct { | |||
mtx sync.RWMutex | |||
m map[string]map[string][]storage.SeriesRef | |||
lvs map[string][]string |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Suggest to comment m
and lvs
here, also the rules on locking around lvs
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done in e61c8c9
Pushed one more commit, reducing the mutex time in |
/prombench cancel |
Benchmark cancel is in progress. |
While investigating lock contention on `MemPostings`, we saw that lots of locking is happening in `LabelValues` and `PostingsForLabelsMatching`, both copying the label values slices while holding the mutex. This adds an extra map that holds an append-only label values slice for each one of the label names. Since the slice is append-only, it can be copied without holding the mutex. Signed-off-by: Oleg Zaytsev <[email protected]>
e9676c0
to
c73ca37
Compare
/prombench main |
Benchmark tests are running for 3 days! If this is intended ignore this message otherwise you can cancel it by commenting: |
/prombench cancel |
Benchmark cancel is in progress. |
This backports the following PRs: - prometheus/prometheus#15426 - grafana/mimir-prometheus#751 - grafana/mimir-prometheus#763 Signed-off-by: Oleg Zaytsev <[email protected]>
This backports the following PRs: - prometheus/prometheus#15426 - grafana/mimir-prometheus#751 - grafana/mimir-prometheus#763 Signed-off-by: Oleg Zaytsev <[email protected]>
This backports the following PRs: - prometheus/prometheus#15426 - grafana/mimir-prometheus#751 - grafana/mimir-prometheus#763 Signed-off-by: Oleg Zaytsev <[email protected]>
This backports the following PRs: - prometheus/prometheus#15426 - grafana/mimir-prometheus#751 - grafana/mimir-prometheus#763 Signed-off-by: Oleg Zaytsev <[email protected]>
This backports the following PRs: - prometheus/prometheus#15426 - grafana/mimir-prometheus#751 - grafana/mimir-prometheus#763 Signed-off-by: Oleg Zaytsev <[email protected]>
This backports the following PRs: - prometheus/prometheus#15426 - grafana/mimir-prometheus#751 - grafana/mimir-prometheus#763 Signed-off-by: Oleg Zaytsev <[email protected]>
Updates mimir-prometheus weekly-r319 to grafana/mimir-prometheus#779 This includes the cherry-pick of prometheus/prometheus#15426 Signed-off-by: Oleg Zaytsev <[email protected]>
Updates mimir-prometheus weekly-r319 to grafana/mimir-prometheus#779 This includes the cherry-pick of prometheus/prometheus#15426 Signed-off-by: Oleg Zaytsev <[email protected]>
While investigating lock contention on `MemPostings`, we saw that lots of locking is happening in `LabelValues` and `PostingsForLabelsMatching`, both copying the label values slices while holding the mutex. This adds an extra map that holds an append-only label values slice for each one of the label names. Since the slice is append-only, it can be copied without holding the mutex. Signed-off-by: Oleg Zaytsev <[email protected]>
…-15426 [weekly-r319] MemPostings: keep a map of label values slices (prometheus#15426)
While investigating lock contention on `MemPostings`, we saw that lots of locking is happening in `LabelValues` and `PostingsForLabelsMatching`, both copying the label values slices while holding the mutex. This adds an extra map that holds an append-only label values slice for each one of the label names. Since the slice is append-only, it can be copied without holding the mutex. Signed-off-by: Oleg Zaytsev <[email protected]>
While investigating lock contention on `MemPostings`, we saw that lots of locking is happening in `LabelValues` and `PostingsForLabelsMatching`, both copying the label values slices while holding the mutex. This adds an extra map that holds an append-only label values slice for each one of the label names. Since the slice is append-only, it can be copied without holding the mutex. Signed-off-by: Oleg Zaytsev <[email protected]>
While investigating lock contention on
MemPostings
, we saw that lots of locking is happening inLabelValues
andPostingsForLabelsMatching
, both copying the label values slices while holding the mutex.This adds an extra map that holds an append-only label values slice for each one of the label names. Since the slice is append-only, it can be copied without holding the mutex.
We have observed no or negligible changes deploying this to our dev or normal production-like environments. We haven't seen a memory increase or a CPU usage change.
However, when deployed to a TSDB that has ~60 series created per second, with spikes of 300, and has thousands of rule queries per second, we saw a 4x reduction in p99 query time, 10x reduction in avg append time, and 10-50x reduction in mutex wait time.
Unfortunately, I can't share all the details about this environment, I'll try to share as much as I can, redacting the sensitive parts.
The following graphs correspond to a Mimir env, with
zone-a
running with this PR, whilezone-b
andzone-c
run frommain
.Sum of mutex wait time per zone:

Average append latency:

Average read latency (includes all read operations):

I wish I could provide meaningful numbers from a Prometheus instance, but we don't run a Prometheus instance that big to see a difference.