-
Notifications
You must be signed in to change notification settings - Fork 38.6k
use shared mutex to guard against block files being removed before read #26316
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
1be79d5 to
eccc18e
Compare
eccc18e to
5806421
Compare
5806421 to
ad67092
Compare
|
The following sections might be updated with supplementary metadata relevant to reviewers and maintainers. ReviewsSee the guideline for information on the review process.
ConflictsReviewers, this pull request conflicts with the following ones:
If you consider this pull request important, please also help to review the conflicting pull requests. Ideally, start with the one that should be merged first. |
dergoegge
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Concept ACK
I am wondering if the current approach actually fixes the race mentioned in #17161. Reading the block file will fail if (a) the file was removed or (b) if CBlockIndex::nStatus does not have the BLOCK_HAVE_DATA bit set at the time of calling GetBlockPos() in ReadBlockFromDisk. In this PR you cover (a) but not (b), is that right? BlockManager::PruneOneBlockFile unsets the bit during pruning before UnlinkPrunedFiles is called, so it seems to like the race could still happen.
| std::atomic_bool fReindex(false); | ||
| bool fPruneMode = false; | ||
| uint64_t nPruneTarget = 0; | ||
| std::shared_mutex cs_UnlinkFiles; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| std::shared_mutex cs_UnlinkFiles; | |
| std::shared_mutex g_unlinkfiles_mutex; |
Would be more inline with our naming conventions.
But generally I think we should be removing globals instead of adding more. In this case, we could make ReadBlockFromDisk, ReadRawBlockFromDisk, UndoReadFromDisk, UnlinkPrunedFiles methods of BlockManager and have the mutex owned by that as well. I have not scoped out how big of a refactor that would be but I think that would be preferable.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
All places have access to the BlockManager fairly easily so it should be straightforward refactor, with the exception of CZMQPublishRawBlockNotifier::NotifyBlock. I have no idea how I will get a handle to the BlockManager from there :/.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I created #26375 to remove needing to read a block from CZMQPublishRawBlockNotifier::NotifyBlock entirely. After that or something like it the refactor can be done much more easily. Marking as draft until that is merged.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The refactor should be after and separate from the fix.
Yes, that sounds right to me. However, after this change #17161 will not really be a problem. Since getting the block position inside The real issue this solves is on Windows where if after
I think we should just avoid that happening altogether since "implementation-defined" sounds scary. This was suggested in #11913 (review) and mentioned in #13903 (comment) as well. Another strategy could be to not unset |
|
This seems like only a partial fix, since we should be checking OTOH, maybe we should just handle missing block files more gracefully than a failed assertion... |
|
Closing in favor of #26533. |
Potentially fixed #17161.
It seems there's a race in
ReadBlockFromDiskandUndoReadFromDiskwhere after the block/undo file position is retrieved the actual block/undo file could be removed during prune. This could just cause the read to fail, or it could prevent the file from being removed if the file is already open depending on the platform (https://en.cppreference.com/w/cpp/io/c/remove). This PR uses a shared mutex to lock the removal of files until any reads are completed. This way it doesn't blockcs_mainwhile reading soReadBlockFromDiskusage can be moved outside ofcs_mainin some cases like #26308.The first commit also updates the function signature of
ReadRawBlockFromDiskto take a block index instead of the file position, and then the file position is retrieved while lockingcs_maininside the function. That way it unifies it withReadBlockFromDiskand can be moved outsidecs_mainscope insidenet_processingin a follow up PR like #26326.Also see #25232.