Skip to content

Fix transaction mutation race#95009

Merged
alexey-milovidov merged 8 commits intomasterfrom
fix-transaction-mutation-race
Jan 26, 2026
Merged

Fix transaction mutation race#95009
alexey-milovidov merged 8 commits intomasterfrom
fix-transaction-mutation-race

Conversation

@alexey-milovidov
Copy link
Copy Markdown
Member

@alexey-milovidov alexey-milovidov commented Jan 23, 2026

When a transaction that started a mutation commits before the background mutation job has applied the mutation to all parts, tryGetTransactionForMutation returns null because the transaction is no longer in the running list.

Previously, this incorrectly threw a LOGICAL_ERROR exception: "Cannot find transaction ... that has started mutation ... that is going to be applied to part ..."

Now, when the transaction is not found, we check the csn field of the mutation entry:

  • If csn == Tx::RolledBackCSN: Transaction rolled back, skip the mutation
  • If csn == Tx::UnknownCSN: Throw error (transaction neither running nor committed)
  • Otherwise: Transaction committed, proceed with mutation without the transaction pointer

Changelog category (leave one):

  • CI Fix or Improvement (changelog entry is not required)

Closes #94424

alexey-milovidov and others added 3 commits January 23, 2026 22:20
When a transaction that started a mutation commits before the background
mutation job has applied the mutation to all parts, `tryGetTransactionForMutation`
returns null because the transaction is no longer in the running list.

Previously, this incorrectly threw a LOGICAL_ERROR exception:
"Cannot find transaction ... that has started mutation ... that is going to be applied to part ..."

Now, when the transaction is not found, we check the `csn` field of the mutation entry:
- If `csn == Tx::RolledBackCSN`: Transaction rolled back, skip the mutation
- If `csn == Tx::UnknownCSN`: Throw error (transaction neither running nor committed)
- Otherwise: Transaction committed, proceed with mutation without the transaction pointer

Co-Authored-By: Claude Opus 4.5 <[email protected]>
@clickhouse-gh
Copy link
Copy Markdown
Contributor

clickhouse-gh bot commented Jan 23, 2026

Workflow [PR], commit [83a9091]

Summary:

@clickhouse-gh clickhouse-gh bot added the pr-ci label Jan 23, 2026
alexey-milovidov and others added 3 commits January 23, 2026 23:22
Use partitioning (PARTITION BY key % 5) to keep multiple parts instead of
SYSTEM STOP MERGES, because stopping merges also stops mutations.

Co-Authored-By: Claude Opus 4.5 <[email protected]>
The test uses `BEGIN TRANSACTION` which requires `allow_experimental_transactions`
to be enabled. This tag skips the test in CI configurations where transactions
are not supported.

Co-Authored-By: Claude Opus 4.5 <[email protected]>
@alexey-milovidov
Copy link
Copy Markdown
Member Author

While this sounds reasonable, there are some doubts.

alexey-milovidov and others added 2 commits January 26, 2026 00:28
Distributed DDL queries (like mutations) inside transactions are not
supported with `DatabaseReplicated`, so this test needs to be skipped
in that configuration.

Co-Authored-By: Claude Opus 4.5 <[email protected]>
@alexey-milovidov alexey-milovidov self-assigned this Jan 26, 2026
@alexey-milovidov alexey-milovidov merged commit 52b5165 into master Jan 26, 2026
133 of 134 checks passed
@alexey-milovidov alexey-milovidov deleted the fix-transaction-mutation-race branch January 26, 2026 03:16
@robot-ch-test-poll2 robot-ch-test-poll2 added the pr-synced-to-cloud The PR is synced to the cloud repo label Jan 26, 2026
@tavplubix
Copy link
Copy Markdown
Member

When a transaction that started a mutation commits before the background mutation job has applied the mutation to all parts

This should not be possible. If this happens - it's a bug somewhere else. How can we commit an operation before this operation is finished?

@alexey-milovidov
Copy link
Copy Markdown
Member Author

alexey-milovidov commented Feb 9, 2026

@tavplubix, download the report, and read server logs to reconstruct the exact failing scenario.
You can cross-correlate it with logs from dozens of similar reports from CI.

@alexey-milovidov
Copy link
Copy Markdown
Member Author

Actual answer here: #94424 (comment)

@tavplubix
Copy link
Copy Markdown
Member

download the report, and read server logs to reconstruct the exact failing scenario.

I did it here as you can see: #94424 (comment)

Actual answer here: #94424 (comment)

@alexey-milovidov, this answer actually proves that your PR doesn't fix the root cause of the issue, it just hides it

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

pr-ci pr-synced-to-cloud The PR is synced to the cloud repo

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Logical error: Cannot find transaction A that has started mutation B that is going to be applied to part C (STID: 3968-4fab)

3 participants