Skip to content

Conversation

@stelfrag
Copy link
Collaborator

@stelfrag stelfrag commented Nov 27, 2025

Summary
  • Improve detection for replication stalling

Summary by cubic

Stops replication from stalling by detecting empty-response loops and safely forcing completion when appropriate. This unblocks stuck charts and makes replication finish reliably.

  • Bug Fixes
    • Track consecutive suspicious empty responses per chart (production-safe counter) and reset on real data or when streaming starts.
    • After 3 suspicious empty responses, compare parent vs child retention; if the parent has equal or newer data, mark replication finished, set progress to 100%, and send a final request to break the loop.
    • When splitting a response due to buffer overflow, force start_streaming=false to avoid truncation and data loss.
    • If a requested chart is missing on the sender, immediately send an empty REPLAY_END with start_streaming=true to unblock the parent.

Written for commit 85ec862. Summary will update automatically on new commits.

@stelfrag
Copy link
Collaborator Author

@cubic-dev-ai review this PR

@cubic-dev-ai
Copy link
Contributor

cubic-dev-ai bot commented Nov 27, 2025

@cubic-dev-ai review this PR

@stelfrag I've started the AI code review. It'll take a few minutes to complete.

Copy link
Contributor

@cubic-dev-ai cubic-dev-ai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

2 issues found across 3 files

Prompt for AI agents (all 2 issues)

Check if these issues are valid — if so, understand the root cause of each and fix them.


<file name="src/plugins.d/pluginsd_replication.c">

<violation number="1" location="src/plugins.d/pluginsd_replication.c:491">
`is_empty_response` checks the requested range instead of the child’s actual data, so empty-response detection never activates and stalled replications remain unresolved.</violation>
</file>

<file name="src/streaming/stream-replication-sender.c">

<violation number="1" location="src/streaming/stream-replication-sender.c:1250">
The empty-response path frees the sender buffer instead of committing it, so no REPLAY_END message is ever sent and the parent stays blocked.</violation>
</file>

Reply to cubic to teach it or ask questions. Re-run a review with @cubic-dev-ai review this PR

@stelfrag
Copy link
Collaborator Author

@cubic-dev-ai review this PR

@cubic-dev-ai
Copy link
Contributor

cubic-dev-ai bot commented Nov 27, 2025

@cubic-dev-ai review this PR

@stelfrag I've started the AI code review. It'll take a few minutes to complete.

Copy link
Contributor

@cubic-dev-ai cubic-dev-ai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

1 issue found across 4 files

Prompt for AI agents (all 1 issues)

Check if these issues are valid — if so, understand the root cause of each and fix them.


<file name="src/streaming/stream-replication-sender.c">

<violation number="1" location="src/streaming/stream-replication-sender.c:451">
Splitting a response because of max buffer size must never leave `enable_streaming` true; otherwise we mark replication finished even though part of the requested interval was dropped, permanently losing data.</violation>
</file>

Reply to cubic to teach it or ask questions. Re-run a review with @cubic-dev-ai review this PR

@stelfrag
Copy link
Collaborator Author

@cubic-dev-ai review this PR

@cubic-dev-ai
Copy link
Contributor

cubic-dev-ai bot commented Nov 27, 2025

@cubic-dev-ai review this PR

@stelfrag I've started the AI code review. It'll take a few minutes to complete.

Copy link
Contributor

@cubic-dev-ai cubic-dev-ai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No issues found across 4 files

@stelfrag stelfrag marked this pull request as ready for review November 28, 2025 06:49
@stelfrag stelfrag removed the request for review from Ferroin November 28, 2025 06:52
Copy link
Contributor

@cubic-dev-ai cubic-dev-ai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No issues found across 4 files

Copy link
Contributor

@thiagoftsm thiagoftsm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Replication is working as expected locally. I stopped parent some time and after its restart, gap on parent was filled as expected. LGTM!

@stelfrag stelfrag merged commit c643895 into netdata:master Nov 30, 2025
119 checks passed
@stelfrag stelfrag deleted the fix_replication_stall_part2 branch November 30, 2025 18:55
stelfrag added a commit to stelfrag/netdata that referenced this pull request Dec 1, 2025
* Implement stuck replication detection and response handling

* Fix checks and commit response for not found chart

* Initialize replication stuck detection counter and refine replication completion conditions

* Refine streaming response handling to prevent data loss during buffer overflow

(cherry picked from commit c643895)
@stelfrag stelfrag mentioned this pull request Dec 1, 2025
Ferroin pushed a commit that referenced this pull request Dec 3, 2025
* Implement stuck replication detection and response handling

* Fix checks and commit response for not found chart

* Initialize replication stuck detection counter and refine replication completion conditions

* Refine streaming response handling to prevent data loss during buffer overflow

(cherry picked from commit c643895)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants