Fix/dynamic json oom tiny part storm backup(FOR CI)#1127
Closed
Fix/dynamic json oom tiny part storm backup(FOR CI)#1127
Conversation
…d OOM (#1113) Root cause: hasDynamicSubcolumns() triggered doCommit() for every single JSON record, creating 1 part per record. With 128 shards and high throughput, this caused thousands of tiny parts that exhausted the commit pool, triggered TOO_MANY_PARTS, and finally OOM during loadOutdatedDataParts on restart. Changes: 1. Batch JSON commits (StreamShardStore.cpp): Accumulate records until 8192 rows or 16MB bytes before committing, matching non-JSON block behavior. The final batch is committed at the end of each consume cycle. Part count reduced from O(records) to O(records / 8192). 2. Exponential backoff on pool saturation (StreamShardStore.cpp): When part_commit_pool.trySchedule() fails, sleep with exponential backoff (100ms -> 200ms -> ... -> 5s cap) instead of immediately retrying. Gives background merge threads time to compact parts. 3. Configurable thresholds (Settings.h): Added dynamic_commit_row_threshold (default: 8192) and dynamic_commit_byte_threshold (default: 16MB) as CONFIGURABLE_GLOBAL_SETTINGS for operators to tune based on workload. 4. Observability: Added LOG_DEBUG for threshold-triggered commits with sn_range info. Added LOG_TRACE for batch accumulation progress tracking. Enhanced backoff warning with current sleep duration. Fixes: #1113
- Wire dynamic_commit_row_threshold and dynamic_commit_byte_threshold from global settings via getSettingsRef() so the knobs are live. Previous static constexpr values made the settings dead code. - Cap worker-lambda commit retries at MAX_COMMIT_RETRIES=10 (~20s). The old infinite retry loop pinned a pool thread under TOO_MANY_PARTS failures, saturating the commit pool and causing the liveness issue called out in #1113. On give-up we log the failure and call progressSequences() so the producer is not blocked forever. - Add explicit JSON/non-JSON block boundary check before mergeBlocks(). Previously a non-JSON block followed by a JSON block of the same schema version was silently merged into a thresholded JSON batch. Now we detect the transition and flush the current block first, keeping JSON and non-JSON commit paths strictly separated.
progressSequences() was called unconditionally after the retry loop, which meant that if all MAX_COMMIT_RETRIES attempts failed the committed SN would still advance — causing commitSNLocal() to persist a sequence number for data that was never written to disk. Now we track success with a bool flag and only call progressSequences() on the success path. On the failure path we log the dropped range and release the pool thread without acknowledging the sequence, so NativeLog will replay the missing SN range on the next restart. Addresses review feedback from @yokofly.
When all MAX_COMMIT_RETRIES attempts fail, the previous code only logged
and returned, leaving moved_seq stuck at the front of outstanding_sns.
Because progressSequencesWithLockHeld() only pops the front when seq ==
outstanding_sns.front(), later successful commits would accumulate in
local_committed_sns forever and last_sn would never advance — wedging
the entire shard.
Add failSequencesWithLockHeld() which:
- pops the failed seq from outstanding_sns WITHOUT advancing last_sn
(so the gap is preserved and NativeLog replays it on restart)
- drains any contiguous ranges already sitting in local_committed_sns
so subsequent successful commits can resume advancing last_sn
Wire it into the worker lambda failure path under sns_mutex.
Addresses review feedback from @yokofly.
Co-Authored-By: chatgpt-codex-connector[bot] <199175422+chatgpt-codex-connector[bot]@users.noreply.github.com>
Co-Authored-By: chatgpt-codex-connector[bot] <199175422+chatgpt-codex-connector[bot]@users.noreply.github.com>
Co-Authored-By: chatgpt-codex-connector[bot] <199175422+chatgpt-codex-connector[bot]@users.noreply.github.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
PR checklist:
proton: starts/endsfor new code in existing community code base ?Please write user-readable short description of the changes: