fix(spans): Prevent silent span loss in done_flush_segments#110462
fix(spans): Prevent silent span loss in done_flush_segments#110462
Conversation
Backend Test FailuresFailures on
|
Race condition: between flush_segments() reading segment data and
done_flush_segments() cleaning it up, process_spans() can add new spans.
Previously, done_flush_segments blindly deleted the segment set, counters,
and queue entry, permanently destroying those new spans with no outcome
tracked.
There are actually two race windows:
1. Queue score race: process_spans updates the queue score via ZADD after
its Lua script runs. If done_flush_segments removes the queue entry
after new spans arrived, those spans lose their queue entry.
2. Segment data race: even after conditionally removing the queue entry,
new spans can arrive before the data deletion pipeline runs. The spans
get added to the set (by add-buffer.lua) which is then deleted.
Fix (two-phase conditional cleanup):
Phase 1 — Conditional ZREM on queue slot: capture queue scores during
flush_segments (via withscores=True). A Lua script atomically checks
whether the score is unchanged and only removes the queue entry if so.
This catches most races (when ZADD has already run) and serves as an
early-skip optimization.
Phase 2 — Conditional data deletion on segment slot: a second Lua script
atomically checks whether the ingested count (ic) is unchanged since
flush time, and only deletes the segment data (set, hrs, ic, ibc) if so.
Since add-buffer.lua and this cleanup script operate on the same
{project_id:trace_id} Redis Cluster slot, they cannot interleave. This
is the actual safety guarantee: if add-buffer.lua ran (adding spans and
incrementing ic), the cleanup script sees the changed ic and skips
deletion. If add-buffer.lua runs after the cleanup script already
deleted, it creates a fresh set and re-adds to the queue — no data loss.
The ic value captured during flush_segments is read non-atomically
relative to the set data, but this only causes false positives
(unnecessary re-flushes), never false negatives (data loss), because ic
is monotonically increasing.
Gated behind the `spans.buffer.done-flush-conditional-zrem` option
(default off) for safe rollout.
Discarded approaches:
- Rename-before-read: atomically RENAME the set key to a temp key before
reading in flush_segments, so new spans create a fresh set. Discarded
because if the flusher crashes after rename but before producing to
Kafka, the spans in the renamed key are lost (violates crash safety).
Also requires invasive changes to add-buffer.lua and _load_segment_data.
- Generation counter: flush_segments writes a "flush generation" marker;
add-buffer.lua checks it and writes to a new set key if present.
Discarded because it adds a check to the hot path (add-buffer.lua runs
for every span batch), increases complexity of an already non-trivial
Lua script, and complicates segment merges across generations.
- Score-only conditional ZREM (Phase 1 alone): captures queue scores and
conditionally removes the queue entry. Discarded as sole fix because
the ZADD happens in Python after add-buffer.lua, so there is a window
where ic changed but the score hasn't — Phase 1 would succeed and
proceed to delete data containing new spans.
fd5811b to
9efa8a7
Compare
…nditions # Conflicts: # src/sentry/spans/buffer.py # tests/sentry/spans/test_buffer.py
evanh
left a comment
There was a problem hiding this comment.
This is a great change. Nice work!
…nditions # Conflicts: # src/sentry/spans/buffer.py # tests/sentry/spans/test_buffer.py
There was a problem hiding this comment.
Cursor Bugbot has reviewed your changes and found 1 potential issue.
Autofix Details
Bugbot Autofix prepared a fix for the issue found in the latest run.
- ✅ Fixed:
_apply_per_trace_limitis defined but never called- I wired
_apply_per_trace_limitintoflush_segmentsbefore loading segment data and registeredspans.buffer.max-flush-segments-per-tracein options defaults so the limit is active and configurable.
- I wired
Or push these changes by commenting:
@cursor push db48e43bb4
Preview (db48e43bb4)
diff --git a/src/sentry/options/defaults.py b/src/sentry/options/defaults.py
--- a/src/sentry/options/defaults.py
+++ b/src/sentry/options/defaults.py
@@ -3214,6 +3214,13 @@
default=500,
flags=FLAG_PRIORITIZE_DISK | FLAG_AUTOMATOR_MODIFIABLE,
)
+# Maximum number of segments per trace to flush in one cycle. 0 disables the limit.
+register(
+ "spans.buffer.max-flush-segments-per-trace",
+ type=Int,
+ default=0,
+ flags=FLAG_PRIORITIZE_DISK | FLAG_AUTOMATOR_MODIFIABLE,
+)
# Maximum memory percentage for the span buffer in Redis before rejecting messages.
register(
"spans.buffer.max-memory-percentage",
diff --git a/src/sentry/spans/buffer.py b/src/sentry/spans/buffer.py
--- a/src/sentry/spans/buffer.py
+++ b/src/sentry/spans/buffer.py
@@ -568,6 +568,7 @@
queue_keys = []
shard_factor = max(1, len(self.assigned_shards))
max_flush_segments = options.get("spans.buffer.max-flush-segments")
+ max_flush_segments_per_trace = options.get("spans.buffer.max-flush-segments-per-trace")
flusher_logger_enabled = options.get("spans.buffer.flusher-cumulative-logger-enabled")
max_segments_per_shard = math.ceil(max_flush_segments / shard_factor)
@@ -589,6 +590,8 @@
for segment_key, score in keys_with_scores:
segment_keys.append((shard, queue_key, segment_key, score))
+ segment_keys = self._apply_per_trace_limit(segment_keys, max_flush_segments_per_trace, now)
+
data_start = time.monotonic()
with metrics.timer("spans.buffer.flush_segments.load_segment_data"):
segments, ingested_counts = self._load_segment_data([k for _, _, k, _ in segment_keys])This Bugbot Autofix run was free. To enable autofix for future PRs, go to the Cursor dashboard.
Backend Test FailuresFailures on
|
Backend Test FailuresFailures on
|
The per-trace segment flush limit feature was reverted on master (083621e). The previous merge conflict resolution kept the dead method definition and tests. Remove them to match master.
- Use ingested_counts[segment_key] instead of .get(…, 0) so a missing ic (which should never happen) raises KeyError instead of silently producing a zero that skips Phase 2 cleanup. - Pass score as float to the Lua script instead of truncating via int(), preserving full precision for the comparison.
…e precision" This reverts commit 0b7a95c.
| spans=output_spans, | ||
| project_id=int(project_id.decode("ascii")), | ||
| score=score, | ||
| ingested_count=ingested_counts.get(segment_key, 0), |
There was a problem hiding this comment.
Bug: When conditional cleanup removes a segment's queue entry (Phase 1) but the subsequent data deletion fails (Phase 2), the segment data is orphaned in Redis without being re-queued.
Severity: HIGH
Suggested Fix
Modify the cleanup logic to handle cases where Phase 1 succeeds but Phase 2 fails. One approach is to re-add the segment's entry to the processing queue if the data deletion in Phase 2 is skipped for a segment whose queue entry was already removed in Phase 1. This ensures the segment will be retried later, preventing it from being orphaned.
Prompt for AI Agent
Review the code at the location below. A potential bug has been identified by an AI
agent.
Verify if this is a real issue. If it is, propose a fix; if not, explain why it's not
valid.
Location: src/sentry/spans/buffer.py#L640
Potential issue: The two-phase conditional cleanup logic can orphan segment data in
Redis. If Phase 1 successfully removes a segment's entry from the processing queue
(because its score matches), but Phase 2 fails to delete the segment's data (e.g.,
because the `ingested_count` key has expired or changed), the segment is marked for
skipping. However, there is no mechanism to re-queue segments that are skipped after
their queue entry has already been removed. This results in the segment data remaining
in Redis indefinitely without any way to be processed or cleaned up.
There was a problem hiding this comment.
if deletion fails i don't think we should try to flush again, the data has already been flushed downstream. The only consequence is some leftover keys in Redis that will expire via TTL.
There was a problem hiding this comment.
Cursor Bugbot has reviewed your changes and found 1 potential issue.
Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.
| ingested_byte_count = ingested_results[i * 2 + 1] | ||
|
|
||
| if ingested_count: | ||
| ingested_counts[key] = int(ingested_count) |
There was a problem hiding this comment.
Unsafe ingested count capture ordering
High Severity
_load_segment_data captures ic after scanning segment sets. If new spans are added between scan completion and this GET, expected_ic includes spans not present in flushed data, allowing Phase 2 cleanup to delete unseen spans when counts still match.



Bug: Between
flush_segments()reading segment data anddone_flush_segments()cleaning it up,process_spans()can add new spans. Then,done_flush_segmentsblindly deletes the segment set, counters, and queue entry, permanently destroying those new spans with no outcome tracked.This PR aims to restore at-least-once guarantees that we usually provide in kafka consumers. Particularly, we want the consumer to not lose data when it crashes and restarts, but it is okay to double-flush or double-produce outcomes.
There are two race windows that this PR fixes (using two fixes):
Queue score race: process_spans updates the queue score via ZADD after
its Lua script runs. If
done_flush_segmentsremoves the queue entryafter new spans arrived, those spans lose their queue entry.
Segment data race: even after fixing the previous race by conditionally
removing the queue entry, new spans can arrive while
done_flush_segmentsruns, as
done_flush_segmentsruns two non-transactional pipelines (one forchecking-and-updating the queue, one for deleting segment sets). The spans
get added to the set (by add-buffer.lua) which is then deleted.
Fix for race window 1:
Conditional ZREM on queue slot: capture queue scores during
flush_segments(viawithscores=True). A Lua script atomically checkswhether the score is unchanged and only removes the queue entry if so.
Fix for race window 2:
Conditional data deletion on segment slot: a second Lua script
atomically checks whether the ingested count (ic) is unchanged since
flush time, and only deletes the segment data (set, hrs, ic, ibc) if so.
Since add-buffer.lua and this cleanup script operate on the same
{project_id:trace_id}Redis Cluster slot, they cannot interleave. Thisis the actual safety guarantee: if add-buffer.lua ran (adding spans and
incrementing ic), the cleanup script sees the changed ic and skips
deletion. If add-buffer.lua runs after the cleanup script already
deleted, it creates a fresh set and re-adds to the queue, so no data loss.
The
icvalue captured during flush_segments is read non-atomicallyrelative to the set data, but this only causes false positives
(unnecessary re-flushes), never false negatives (data loss), because ic
is monotonically increasing.
Gated behind the
spans.buffer.done-flush-conditional-zremoption(default off) for safe rollout.
Discarded approaches:
Rename-before-read: atomically RENAME the set key to a temp key before
reading in flush_segments, so new spans create a fresh set. Discarded
because if the flusher crashes after rename but before producing to
Kafka, the spans in the renamed key are lost (violates crash safety).
Also requires invasive changes to add-buffer.lua and _load_segment_data.
Generation counter: flush_segments writes a "flush generation" marker;
add-buffer.lua checks it and writes to a new set key if present.
Discarded because it adds a check to the hot path (add-buffer.lua runs
for every span batch), increases complexity of an already non-trivial
Lua script, and complicates segment merges across generations.
Score-only conditional ZREM (Phase 1 alone): captures queue scores and
conditionally removes the queue entry. Discarded as sole fix because
the ZADD happens in Python after add-buffer.lua, so there is a window
where ic changed but the score hasn't — Phase 1 would succeed and
proceed to delete data containing new spans.
Fix STREAM-773