Skip to content

Conversation

@andrewtoth
Copy link
Contributor

@andrewtoth andrewtoth commented Oct 15, 2022

Potentially fixed #17161.

It seems there's a race in ReadBlockFromDisk and UndoReadFromDisk where after the block/undo file position is retrieved the actual block/undo file could be removed during prune. This could just cause the read to fail, or it could prevent the file from being removed if the file is already open depending on the platform (https://en.cppreference.com/w/cpp/io/c/remove). This PR uses a shared mutex to lock the removal of files until any reads are completed. This way it doesn't block cs_main while reading so ReadBlockFromDisk usage can be moved outside of cs_main in some cases like #26308.

The first commit also updates the function signature of ReadRawBlockFromDisk to take a block index instead of the file position, and then the file position is retrieved while locking cs_main inside the function. That way it unifies it with ReadBlockFromDisk and can be moved outside cs_main scope inside net_processing in a follow up PR like #26326.

Also see #25232.

@andrewtoth andrewtoth force-pushed the block-read-shared-mutex branch from eccc18e to 5806421 Compare October 17, 2022 14:44
@DrahtBot
Copy link
Contributor

DrahtBot commented Oct 17, 2022

The following sections might be updated with supplementary metadata relevant to reviewers and maintainers.

Reviews

See the guideline for information on the review process.

Type Reviewers
Concept ACK dergoegge

Conflicts

Reviewers, this pull request conflicts with the following ones:

  • #26415 (rpc,rest: faster getblock and rest_block by reading raw block by andrewtoth)
  • #26288 (Enable -Wstring-concatenation and -Wstring-conversion on clang builds by Empact)

If you consider this pull request important, please also help to review the conflicting pull requests. Ideally, start with the one that should be merged first.

Copy link
Member

@dergoegge dergoegge left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Concept ACK

I am wondering if the current approach actually fixes the race mentioned in #17161. Reading the block file will fail if (a) the file was removed or (b) if CBlockIndex::nStatus does not have the BLOCK_HAVE_DATA bit set at the time of calling GetBlockPos() in ReadBlockFromDisk. In this PR you cover (a) but not (b), is that right? BlockManager::PruneOneBlockFile unsets the bit during pruning before UnlinkPrunedFiles is called, so it seems to like the race could still happen.

std::atomic_bool fReindex(false);
bool fPruneMode = false;
uint64_t nPruneTarget = 0;
std::shared_mutex cs_UnlinkFiles;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
std::shared_mutex cs_UnlinkFiles;
std::shared_mutex g_unlinkfiles_mutex;

Would be more inline with our naming conventions.

But generally I think we should be removing globals instead of adding more. In this case, we could make ReadBlockFromDisk, ReadRawBlockFromDisk, UndoReadFromDisk, UnlinkPrunedFiles methods of BlockManager and have the mutex owned by that as well. I have not scoped out how big of a refactor that would be but I think that would be preferable.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All places have access to the BlockManager fairly easily so it should be straightforward refactor, with the exception of CZMQPublishRawBlockNotifier::NotifyBlock. I have no idea how I will get a handle to the BlockManager from there :/.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I created #26375 to remove needing to read a block from CZMQPublishRawBlockNotifier::NotifyBlock entirely. After that or something like it the refactor can be done much more easily. Marking as draft until that is merged.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The refactor should be after and separate from the fix.

@andrewtoth
Copy link
Contributor Author

andrewtoth commented Oct 19, 2022

Reading the block file will fail if (a) the file was removed or (b) if CBlockIndex::nStatus does not have the BLOCK_HAVE_DATA bit set at the time of calling GetBlockPos() in ReadBlockFromDisk. In this PR you cover (a) but not (b), is that right? BlockManager::PruneOneBlockFile unsets the bit during pruning before UnlinkPrunedFiles is called, so it seems to like the race could still happen.

Yes, that sounds right to me. However, after this change #17161 will not really be a problem. Since getting the block position inside ReadBlockFromDisk will return an empty position if nStatus does not have BLOCK_HAVE_DATA, the read will fail. Having the read fail should be fine as long as the error can be handled gracefully (like RPC/REST/ZMQ/net_processing) but in validation code cs_main should remain locked so it can't happen.

The real issue this solves is on Windows where if after pindex->GetBlockPos() is called and the file is opened then the files are unlinked. This will cause the file to not be removed. From https://en.cppreference.com/w/cpp/io/c/remove

If the file is currently open by the current or another process, the behavior of this function is implementation-defined

I think we should just avoid that happening altogether since "implementation-defined" sounds scary. This was suggested in #11913 (review) and mentioned in #13903 (comment) as well.

Another strategy could be to not unset CBlockIndex::nFile and CBlockIndex::nDataPos when calling BlockManager::PruneOneBlockFile and then not check for BLOCK_HAVE_DATA in CBlockIndex::GetBlockPos(). That way a read will still succeed even if the block is set to be pruned but the file has not been removed. Then nFile and nDataPos can be zeroed in UnlinkPrunedFiles. But that seems like a bigger change for not much more benefit.

@luke-jr
Copy link
Member

luke-jr commented Nov 3, 2022

This seems like only a partial fix, since we should be checking BLOCK_HAVE_DATA within the lock. Otherwise, there's still a race after we check it and the block is read, where the block could be pruned.

OTOH, maybe we should just handle missing block files more gracefully than a failed assertion...

@andrewtoth
Copy link
Contributor Author

Closing in favor of #26533.

@andrewtoth andrewtoth closed this Nov 18, 2022
@andrewtoth andrewtoth deleted the block-read-shared-mutex branch August 17, 2023 20:40
@bitcoin bitcoin locked and limited conversation to collaborators Aug 16, 2024
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

CBlockIndex::nStatus race when pruning

4 participants