fix(core): fix checkpoint timeout caused by slow sync() system call#6785
fix(core): fix checkpoint timeout caused by slow sync() system call#6785bluestreak01 merged 1 commit intomasterfrom
Conversation
|
No actionable comments were generated in the recent review. 🎉 WalkthroughRelocates circuit breaker state validation in checkpoint creation to an earlier point in the flow, executing the check after logging preferences store inclusion and before flush/sync operations. Removes the previous check position, ensuring the breaker is evaluated once at the new location. Changes
Estimated code review effort🎯 2 (Simple) | ⏱️ ~10 minutes Suggested labels
Suggested reviewers
🚥 Pre-merge checks | ✅ 3 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (3 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing touches
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
|
@coderabbitai review |
✅ Actions performedReview triggered.
|
[PR Coverage check]😍 pass : 1 / 1 (100.00%) file detail
|
Summary
DatabaseCheckpointAgent.checkpointCreate()from afterff.sync()to before itThe POSIX
sync()system call flushes all dirty filesystem buffers system-wide, not just QuestDB's files. On busy CI machines (or production hosts with heavy I/O from other processes),sync()can block for well over the 60-second query timeout. The circuit breaker was only checked aftersync()returned, so a completed checkpoint was discarded when the timeout had been exceeded during the blockingsync()call.Moving the check before
sync()means:sync()sync()runs to completion and the checkpoint succeeds regardless of how longsync()takes — no more discarding completed workTest plan
ReplicationAclTest.testAclModelOnEmptyReplica(3 collected failures) that the root cause wassync()blocking for 67–266 seconds while the query timeout is 60 seconds