Skip to content

fix(core): fix table reading timeout error after non-wal table dropped and re-created#6095

Merged
ideoma merged 5 commits intomasterfrom
fix-reader-open-timeout-on-scoreboard-lock
Sep 2, 2025
Merged

fix(core): fix table reading timeout error after non-wal table dropped and re-created#6095
ideoma merged 5 commits intomasterfrom
fix-reader-open-timeout-on-scoreboard-lock

Conversation

@ideoma
Copy link
Copy Markdown
Collaborator

@ideoma ideoma commented Sep 1, 2025

Found by fuzz tests.
When a non-WAL table is dropped and re-created, purge jobs can push max txn using TxnScoreboardPoolV2.isRangeAvailable() and it can lead to a timeout on opening TableReader, where it cannot lock the latest transaction in the scoreboard.

The fix is not to modify the max in TxnScoreboardPoolV2.isRangeAvailable(), making it a read-only scoreboard operation.

@coderabbitai
Copy link
Copy Markdown

coderabbitai bot commented Sep 1, 2025

Important

Review skipped

Auto reviews are disabled on this repository.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

✨ Finishing Touches
🧪 Generate unit tests
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch fix-reader-open-timeout-on-scoreboard-lock

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share
🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

‼️ IMPORTANT
Auto-reply has been disabled for this repository in the CodeRabbit settings. The CodeRabbit bot will not respond to your replies unless it is explicitly tagged.

  • Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query.
  • PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
    • @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
    • @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.

Support

Need help? Create a ticket on our support page for assistance with any issues or questions.

CodeRabbit Commands (Invoked using PR/Issue comments)

Type @coderabbitai help to get the list of available commands.

Other keywords and placeholders

  • Add @coderabbitai ignore or @coderabbit ignore anywhere in the PR description to prevent this PR from being reviewed.
  • Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
  • Add @coderabbitai anywhere in the PR title to generate the title automatically.

Status, Documentation and Community

  • Visit our Status Page to check the current availability of CodeRabbit.
  • Visit our Documentation for detailed information on how to use CodeRabbit.
  • Join our Discord Community to get help, request features, and share feedback.
  • Follow us on X/Twitter for updates and announcements.

@puzpuzpuz
Copy link
Copy Markdown
Contributor

When a non-WAL table is dropped and re-created, purge jobs can push max txn using TxnScoreboardPoolV2.isRangeAvailable() and it can lead to a timeout on opening TableReader, where it cannot lock the latest transaction in the scoreboard.

Why exactly table readers time out? Aren't they supposed to spin with reading the latest txn and then trying to acquire it? Or somehow we end up with the max txn being incremented many times in this scenario?

@ideoma
Copy link
Copy Markdown
Collaborator Author

ideoma commented Sep 2, 2025

Why exactly table readers time out? Aren't they supposed to spin with reading the latest txn and then trying to acquire it? Or somehow we end up with the max txn being incremented many times in this scenario?

Purge job calls isRangeAvailable(0,13) because it tries to clean up dropped table with the same dir name (non-wal table), this pushes max txn to 13 in the score board. Readers then timeout because the spin to acquire txn 0 read from the _txn file and they cannot, because max is 13.

@puzpuzpuz
Copy link
Copy Markdown
Contributor

because it tries to clean up dropped table with the same dir name (non-wal table)

Gotcha, that's the culprit!

@puzpuzpuz puzpuzpuz added Bug Incorrect or unexpected behavior Core Related to storage, data type, etc. labels Sep 2, 2025
puzpuzpuz
puzpuzpuz previously approved these changes Sep 2, 2025
@glasstiger
Copy link
Copy Markdown
Contributor

[PR Coverage check]

😍 pass : 20 / 22 (90.91%)

file detail

path covered line new line coverage
🔵 io/questdb/tasks/ColumnPurgeTask.java 5 6 83.33%
🔵 io/questdb/cairo/ColumnPurgeOperator.java 10 11 90.91%
🔵 io/questdb/cairo/VacuumColumnVersions.java 1 1 100.00%
🔵 io/questdb/griffin/PurgingOperator.java 2 2 100.00%
🔵 io/questdb/cairo/CairoEngine.java 1 1 100.00%
🔵 io/questdb/cairo/ColumnPurgeJob.java 1 1 100.00%

@puzpuzpuz puzpuzpuz self-requested a review September 2, 2025 15:34
@ideoma ideoma merged commit e3c705c into master Sep 2, 2025
35 checks passed
@ideoma ideoma deleted the fix-reader-open-timeout-on-scoreboard-lock branch September 2, 2025 16:57
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Bug Incorrect or unexpected behavior Core Related to storage, data type, etc.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants