Fix transaction mutation race#95009
Conversation
When a transaction that started a mutation commits before the background mutation job has applied the mutation to all parts, `tryGetTransactionForMutation` returns null because the transaction is no longer in the running list. Previously, this incorrectly threw a LOGICAL_ERROR exception: "Cannot find transaction ... that has started mutation ... that is going to be applied to part ..." Now, when the transaction is not found, we check the `csn` field of the mutation entry: - If `csn == Tx::RolledBackCSN`: Transaction rolled back, skip the mutation - If `csn == Tx::UnknownCSN`: Throw error (transaction neither running nor committed) - Otherwise: Transaction committed, proceed with mutation without the transaction pointer Co-Authored-By: Claude Opus 4.5 <[email protected]>
…ansaction-mutation-race
Use partitioning (PARTITION BY key % 5) to keep multiple parts instead of SYSTEM STOP MERGES, because stopping merges also stops mutations. Co-Authored-By: Claude Opus 4.5 <[email protected]>
The test uses `BEGIN TRANSACTION` which requires `allow_experimental_transactions` to be enabled. This tag skips the test in CI configurations where transactions are not supported. Co-Authored-By: Claude Opus 4.5 <[email protected]>
|
While this sounds reasonable, there are some doubts. |
Distributed DDL queries (like mutations) inside transactions are not supported with `DatabaseReplicated`, so this test needs to be skipped in that configuration. Co-Authored-By: Claude Opus 4.5 <[email protected]>
This should not be possible. If this happens - it's a bug somewhere else. How can we commit an operation before this operation is finished? |
|
@tavplubix, download the report, and read server logs to reconstruct the exact failing scenario. |
|
Actual answer here: #94424 (comment) |
I did it here as you can see: #94424 (comment)
@alexey-milovidov, this answer actually proves that your PR doesn't fix the root cause of the issue, it just hides it |
When a transaction that started a mutation commits before the background mutation job has applied the mutation to all parts,
tryGetTransactionForMutationreturns null because the transaction is no longer in the running list.Previously, this incorrectly threw a LOGICAL_ERROR exception: "Cannot find transaction ... that has started mutation ... that is going to be applied to part ..."
Now, when the transaction is not found, we check the
csnfield of the mutation entry:csn == Tx::RolledBackCSN: Transaction rolled back, skip the mutationcsn == Tx::UnknownCSN: Throw error (transaction neither running nor committed)Changelog category (leave one):
Closes #94424