Skip to content

Fix race in storage join#21009

Merged
vdimir merged 9 commits intoClickHouse:masterfrom
vdimir:fix-race-storage-join
Feb 26, 2021
Merged

Fix race in storage join#21009
vdimir merged 9 commits intoClickHouse:masterfrom
vdimir:fix-race-storage-join

Conversation

@vdimir
Copy link
Copy Markdown
Member

@vdimir vdimir commented Feb 20, 2021

I hereby agree to the terms of the CLA available at: https://yandex.ru/legal/cla/?lang=en

Changelog category (leave one):

  • Bug Fix

Changelog entry (a user-readable short description of the changes that goes to CHANGELOG.md):

  • Block parallel insertions into storage join

Fix race introduced in PR #20079

@vdimir vdimir added the pr-not-for-changelog This PR should not be mentioned in the changelog label Feb 20, 2021
@robot-clickhouse robot-clickhouse added pr-bugfix Pull request with bugfix, not backported by default and removed pr-not-for-changelog This PR should not be mentioned in the changelog labels Feb 20, 2021
@vdimir vdimir changed the title Try to fix race in storage join: block parallel inserts Fix race in storage join Feb 20, 2021
@vdimir vdimir marked this pull request as ready for review February 20, 2021 18:02
@vdimir vdimir added pr-not-for-changelog This PR should not be mentioned in the changelog and removed pr-not-for-changelog This PR should not be mentioned in the changelog labels Feb 24, 2021
@vdimir
Copy link
Copy Markdown
Member Author

vdimir commented Feb 24, 2021

test_storage_kafka/test.py::test_kafka_no_holes_when_write_suffix_failed not related, failed in master during past three days

@KochetovNicolai KochetovNicolai self-assigned this Feb 24, 2021
@KochetovNicolai
Copy link
Copy Markdown
Member

Some considerations:

  • std::shared_lock<std::shared_mutex> storage_join_lock is not needed in HashJoin. HashJoin should not be responsible for external synchronization.
  • JoinSource is an internal class for HashJoin. It need to get HashJoinPtr + std::shared_lock<std::shared_mutex> storage_join_lock in constructor (btw, why this lock was removed?)
  • Function JoinGet does not need to have access to HashJoinPtr. We can add StorageJoin::joinGet, which takes std::shared_lock<std::shared_mutex> storage_join_lock and calls HashJoin::joinGet. Lock will be taken for each block, but it is ok (and we will have less chances to have a deadlock).

Also, IMHO, joinGet function should be a member of HashJoin::RightTableData (but this is not important for this pr).

/// Access the innards.
HashJoinPtr & getJoin() { return join; }
std::shared_ptr<HashJoinHolder> getJoin() { return std::make_shared<HashJoinHolder>(rwlock, join); }
HashJoinPtr getJoin(std::shared_ptr<TableJoin> analyzed_join) const;
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this method should actually return HashJoinHolder (looks a little bit more logical then having shared_lock inside HashJoin) .

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We discussed that it will requires changes in ExpressionAnalyzer, it will have to handle this lock. So decided to keep lock in HashJoin.

Move joinGet into StorageJoin

Protect JoinSource with lock, add test

Add comments about locking logic
@vdimir
Copy link
Copy Markdown
Member Author

vdimir commented Feb 25, 2021

@KochetovNicolai

std::shared_lock<std::shared_mutex> storage_join_lock is not needed in HashJoin. HashJoin should not be responsible for external synchronization.

We discussed that it will requires changes in ExpressionAnalyzer, it will have to handle this lock. So decided to keep lock in HashJoin and add mote detailed comments to code.

JoinSource is an internal class for HashJoin. It need to get HashJoinPtr + std::shared_lock<std::shared_mutex> storage_join_lock in constructor (btw, why this lock was removed?)

Yes, lock removed accidentally, I fixed it and added into test.

Function JoinGet does not need to have access to HashJoinPtr. We can add StorageJoin::joinGet, which takes std::shared_lock<std::shared_mutex> storage_join_lock and calls HashJoin::joinGet. Lock will be taken for each block, but it is ok (and we will have less chances to have a deadlock).

I moved joinGet into StorageJoin, and take lock for each block, because used_flags are not required in this method.

public:
ExecutableFunctionJoinGet(HashJoinPtr join_, const DB::Block & result_columns_)
: join(std::move(join_)), result_columns(result_columns_) {}
ExecutableFunctionJoinGet(StorageJoinPtr storage_join_, const DB::Block & result_columns_)
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we also need to hold TableLockHolder here along with storage.


{
std::unique_lock lock(data->rwlock);
assert(storage_join_lock.mutex() == nullptr);
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm, I think exception will be better here, cause it is not a class invariant.

Comment on lines +36 to +37
DataTypePtr joinGetCheckAndGetReturnType(const DataTypes & data_types, const String & column_name, bool or_null) const;
ColumnWithTypeAndName joinGet(const Block & block, const Block & block_with_columns_to_add) const;
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Need some comments for this methods.

@vdimir vdimir merged commit 78024ae into ClickHouse:master Feb 26, 2021
@vdimir vdimir deleted the fix-race-storage-join branch February 26, 2021 11:26
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

pr-bugfix Pull request with bugfix, not backported by default

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants