Skip to content

fix(core): fix checkpoint timeout caused by slow sync() system call#6785

Merged
bluestreak01 merged 1 commit intomasterfrom
ia_flaky_test_fix
Feb 16, 2026
Merged

fix(core): fix checkpoint timeout caused by slow sync() system call#6785
bluestreak01 merged 1 commit intomasterfrom
ia_flaky_test_fix

Conversation

@glasstiger
Copy link
Copy Markdown
Contributor

@glasstiger glasstiger commented Feb 16, 2026

Summary

  • Move the circuit breaker check in DatabaseCheckpointAgent.checkpointCreate() from after ff.sync() to before it

The POSIX sync() system call flushes all dirty filesystem buffers system-wide, not just QuestDB's files. On busy CI machines (or production hosts with heavy I/O from other processes), sync() can block for well over the 60-second query timeout. The circuit breaker was only checked after sync() returned, so a completed checkpoint was discarded when the timeout had been exceeded during the blocking sync() call.

Moving the check before sync() means:

  1. If the timeout is already exceeded (from the table loop or lock acquisition), we fail fast without entering a potentially long-blocking sync()
  2. If the timeout has not been exceeded, sync() runs to completion and the checkpoint succeeds regardless of how long sync() takes — no more discarding completed work

Test plan

  • Verified with CI failure logs from ReplicationAclTest.testAclModelOnEmptyReplica (3 collected failures) that the root cause was sync() blocking for 67–266 seconds while the query timeout is 60 seconds
  • All table checkpoints completed in <10ms in every failure — no writer contention involved

@coderabbitai
Copy link
Copy Markdown

coderabbitai bot commented Feb 16, 2026

No actionable comments were generated in the recent review. 🎉


Walkthrough

Relocates circuit breaker state validation in checkpoint creation to an earlier point in the flow, executing the check after logging preferences store inclusion and before flush/sync operations. Removes the previous check position, ensuring the breaker is evaluated once at the new location.

Changes

Cohort / File(s) Summary
Circuit Breaker Timing
core/src/main/java/io/questdb/cairo/DatabaseCheckpointAgent.java
Moved circuit breaker trip check earlier in checkpoint creation flow, executing after preferences store logging but before flush/sync operations. Eliminates previous check position, enabling faster failure if breaker is tripped.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~10 minutes

Suggested labels

Bug, storage

Suggested reviewers

  • bluestreak01
  • ideoma
🚥 Pre-merge checks | ✅ 3 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (3 passed)
Check name Status Explanation
Title check ✅ Passed The title accurately describes the main change: moving the circuit breaker check before sync() to prevent checkpoint timeout issues caused by slow sync() calls.
Description check ✅ Passed The description clearly explains the problem (sync() blocking during checkpoints), the solution (moving the circuit breaker check earlier), and provides specific test evidence from CI failures.
Merge Conflict Detection ✅ Passed ✅ No merge conflicts detected when merging into master

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing touches
  • 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch ia_flaky_test_fix

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@glasstiger glasstiger changed the title chore(core): fix flaky ent test fix(core): fix checkpoint timeout caused by slow sync() system call Feb 16, 2026
@glasstiger
Copy link
Copy Markdown
Contributor Author

@coderabbitai review

@coderabbitai
Copy link
Copy Markdown

coderabbitai bot commented Feb 16, 2026

✅ Actions performed

Review triggered.

Note: CodeRabbit is an incremental review system and does not re-review already reviewed commits. This command is applicable only when automatic reviews are paused.

@glasstiger
Copy link
Copy Markdown
Contributor Author

[PR Coverage check]

😍 pass : 1 / 1 (100.00%)

file detail

path covered line new line coverage
🔵 io/questdb/cairo/DatabaseCheckpointAgent.java 1 1 100.00%

@bluestreak01 bluestreak01 merged commit 2e4afb2 into master Feb 16, 2026
46 checks passed
@bluestreak01 bluestreak01 deleted the ia_flaky_test_fix branch February 16, 2026 21:29
maciulis pushed a commit to maciulis/questdb that referenced this pull request Feb 19, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants