Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix locks in db.reloadBlocks() #12920

Merged
merged 5 commits into from
Feb 10, 2025
Merged

Conversation

prymitive
Copy link
Contributor

This partially reverts ae3d392.

ae3d392 added a call to db.mtx.Lock() that lasts for the entire duration of db.reloadBlocks(), previous db.mtx would be locked only during critical part of db.reloadBlocks(). The motivation was to protect against races:
9e0351e#r555699794 The 'reloads' being mentioned are (I think) reloadBlocks() calls, rather than db.reload() or other methods. TestTombstoneCleanRetentionLimitsRace was added to catch this but I wasn't able to ever get any error out of it, even after disabling all calls to db.mtx in reloadBlocks() and CleanTombstones(). To make things more complicated CleanupTombstones() itself calls reloadBlocks(), so it seems that the real issue is that we might have concurrent calls to reloadBlocks().

The problem with this change is that db.reloadBlocks() can take a very long time, that's because it might need to load very large blocks from disk, which is slow. While db.mtx is locked a large chunk of the db is locked, including queries, since db.mtx read lock is needed for db.Querier() call. One of the issues this manifests itself as is a gap in all metrics and blocked queries just after a large block compaction happens. When compaction merges multiple day-or-more blocks into a week-or-more block it create a single very big block. After that block is written it needs to be loaded and that seems to be taking many seconds (30-45), during which mtx is held and everything is blocked.

Turns out that there is another lock that is more fine grained and aimed at this specific use case:

// cmtx ensures that compactions and deletions don't run simultaneously. cmtx sync.Mutex

All calls to reloadBlocks() are wrapped inside cmtx lock. The only exception is db.reload() which this change fixes. We can't add cmtx lock inside reloadBlocks() itself because it's called by a number of functions, some of which are already holding cmtx.

Looking at the code I think it is sufficient to hold cmtx and skip a reloadBlocks() wide mtx call.

@prymitive
Copy link
Contributor Author

prymitive commented Oct 2, 2023

TestTombstoneCleanRetentionLimitsRace does fail in CI, which is great because I couldn't get it to fail at all. That's helpful.

But that only confirms that TestTombstoneCleanRetentionLimitsRace does work in CI.
It's not unexpected for it to fail after removing db.mtx.Lock() from within reloadBlocks() because that function is not protected by any lock.
Which is okay as long as every function that call reloadBlocks() puts the lock in place.

I did check that every function calling reloadBlocks() is using db.cmtx, with the only exception of db.reload(), which this PR fixes. So I don't think that TestTombstoneCleanRetentionLimitsRace is a realistic test, it tests something that shouldn't ever happen and to pass that test it adds extra expensive locks.

I think we should remove TestTombstoneCleanRetentionLimitsRace, or replace calls to reloadBlocks() inside of it with calls to some other function that could care against CleanTombstones().

What do you think @jesusvazquez ?

@prymitive prymitive marked this pull request as draft October 2, 2023 14:49
@prymitive prymitive marked this pull request as ready for review October 2, 2023 15:24
@prymitive
Copy link
Contributor Author

prymitive commented Oct 9, 2023

I've deployed this patch on 4th of October to some of our production instances where we've previously observed high latency on query API calls just after compaction runs.
Results can be seen on the screenshot below, we no longer see big spikes on API timings.

Screenshot from 2023-10-09 10-08-16

roidelapluie
roidelapluie previously approved these changes Oct 17, 2023
Copy link
Member

@roidelapluie roidelapluie left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, from your explanation and my understanding of the code

@prymitive prymitive force-pushed the compactLock branch 2 times, most recently from c5c3e70 to 1b8cf6a Compare October 18, 2023 09:19
@prymitive
Copy link
Contributor Author

I think I've updated all places that read db.blocks to have a read lock.

@bboreham
Copy link
Member

bboreham commented Apr 7, 2024

Are we ok just removing the test? I read through the comments and it sounds like you hoped to find a way to test the race.

@prymitive
Copy link
Contributor Author

Are we ok just removing the test? I read through the comments and it sounds like you hoped to find a way to test the race.

Im not sure if that’s a question for me or some reviewer, but I do believe that this test is not valid and should be removed (as per my previous comment).

@bboreham
Copy link
Member

Hello from the bug-scrub! It looks like this was approved and ready to go, but now it has a small merge conflict.

I'll also link #14610 which is touching a similar area.

@prymitive
Copy link
Contributor Author

Hello from the bug-scrub! It looks like this was approved and ready to go, but now it has a small merge conflict.

I'll also link #14610 which is touching a similar area.

There are no approvals here at the moment and the last comment was a bit vague so I'm not sure if it was for me or someone else to answer.

@bboreham
Copy link
Member

Approvals get removed when a change is pushed.

@prymitive
Copy link
Contributor Author

rebased

Copy link
Member

@bboreham bboreham left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would prefer to see the functions involved in this PR all document whether they need certain lock(s) on entry. Like is already done in a few places saying // The compaction mutex should be held before calling this method. (although I would spell that db.cmtx.

Do you know the lock hierarchy between DB.mtx, DB.cmtx and DB.autoCompactMtx? I.e. which ones you must lock first, if you're going to lock more than one of them?

prymitive and others added 4 commits January 9, 2025 17:05
This partially reverts ae3d392.

ae3d392 added a call to db.mtx.Lock() that lasts for the entire duration of db.reloadBlocks(),
previous db.mtx would be locked only during critical part of db.reloadBlocks().
The motivation was to protect against races:
prometheus@9e0351e#r555699794
The 'reloads' being mentioned are (I think) reloadBlocks() calls, rather than db.reload() or other methods.
TestTombstoneCleanRetentionLimitsRace was added to catch this but I wasn't able to ever get any error out of it, even after disabling all calls to db.mtx in reloadBlocks() and CleanTombstones().
To make things more complicated CleanupTombstones() itself calls reloadBlocks(), so it seems that the real issue is that we might have concurrent calls to reloadBlocks().

The problem with this change is that db.reloadBlocks() can take a very long time, that's because it might need to load very large blocks from disk, which is slow.
While db.mtx is locked a large chunk of the db is locked, including queries, since db.mtx read lock is needed for db.Querier() call.
One of the issues this manifests itself as is a gap in all metrics and blocked queries just after a large block compaction happens.
When compaction merges multiple day-or-more blocks into a week-or-more block it create a single very big block.
After that block is written it needs to be loaded and that seems to be taking many seconds (30-45), during which mtx is held and everything is blocked.

Turns out that there is another lock that is more fine grained and aimed at this specific use case:

// cmtx ensures that compactions and deletions don't run simultaneously.
cmtx sync.Mutex

All calls to reloadBlocks() are wrapped inside cmtx lock. The only exception is db.reload() which this change fixes.
We can't add cmtx lock inside reloadBlocks() itself because it's called by a number of functions, some of which are already holding cmtx.

Looking at the code I think it is sufficient to hold cmtx and skip a reloadBlocks() wide mtx call.

Signed-off-by: Łukasz Mierzwa <[email protected]>
This test ensures that running db.reloadBlocks() and db.CleanTombstones() at the same time doesn't race.
The problem is that CleanTombstones() is a public method while reloadBlocks() is internal.
CleanTombstones() sets db.cmtx lock while reloadBlocks() is not protected by any locks at all, it expects the public method through which it was called to do it.
So having a race between these two is not unexpected and we shouldn't really be testing this.
db.cmtx ensures that no other function can be modifying the list of open blocks and so the scenario tested here cannot happen.
If it would happen it would be only because some other method doesn't aquire db.ctmx lock, something this test cannot detect.

Signed-off-by: Łukasz Mierzwa <[email protected]>
We don't hold db.mtx lock when trying to read db.blocks here so we need a read lock around this loop.

Signed-off-by: Łukasz Mierzwa <[email protected]>
Compact() is an uppercase function that deals with locks on its own, so we shouldn't have a lock around it.

Signed-off-by: Lukasz Mierzwa <[email protected]>
@prymitive
Copy link
Contributor Author

I would prefer to see the functions involved in this PR all document whether they need certain lock(s) on entry. Like is already done in a few places saying // The compaction mutex should be held before calling this method. (although I would spell that db.cmtx.

I've added it where missing, plus changed compaction mutex to db.cmtx.

Do you know the lock hierarchy between DB.mtx, DB.cmtx and DB.autoCompactMtx? I.e. which ones you must lock first, if you're going to lock more than one of them?

I do not know that, sorry. All I know is what I can see in the code and on ae3d392.

@prymitive
Copy link
Contributor Author

Test failure seems unrelated:

=== NAME  TestDelayedCompaction
    compact_test.go:2015: 
        	Error Trace:	D:/a/prometheus/prometheus/tsdb/compact_test.go:1949
        	            				D:/a/prometheus/prometheus/tsdb/compact_test.go:2015
        	Error:      	"2.9755205s" is not less than "2s"
        	Test:       	TestDelayedCompaction

Copy link
Member

@bboreham bboreham left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for delay; I think we can go ahead with this.

@bboreham bboreham merged commit b74cebf into prometheus:main Feb 10, 2025
26 checks passed
@prymitive prymitive deleted the compactLock branch February 10, 2025 17:36
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants