Skip to content

fix(spans): Prevent silent span loss in done_flush_segments#110462

Merged
untitaker merged 7 commits intomasterfrom
span-buffer-race-conditions
Mar 16, 2026
Merged

fix(spans): Prevent silent span loss in done_flush_segments#110462
untitaker merged 7 commits intomasterfrom
span-buffer-race-conditions

Conversation

@untitaker
Copy link
Copy Markdown
Member

@untitaker untitaker commented Mar 11, 2026

Bug: Between flush_segments() reading segment data and done_flush_segments() cleaning it up, process_spans() can add new spans. Then, done_flush_segments blindly deletes the segment set, counters, and queue entry, permanently destroying those new spans with no outcome tracked.

This PR aims to restore at-least-once guarantees that we usually provide in kafka consumers. Particularly, we want the consumer to not lose data when it crashes and restarts, but it is okay to double-flush or double-produce outcomes.

There are two race windows that this PR fixes (using two fixes):

  1. Queue score race: process_spans updates the queue score via ZADD after
    its Lua script runs. If done_flush_segments removes the queue entry
    after new spans arrived, those spans lose their queue entry.

  2. Segment data race: even after fixing the previous race by conditionally
    removing the queue entry, new spans can arrive while done_flush_segments
    runs, as done_flush_segments runs two non-transactional pipelines (one for
    checking-and-updating the queue, one for deleting segment sets). The spans
    get added to the set (by add-buffer.lua) which is then deleted.

Fix for race window 1:

Conditional ZREM on queue slot: capture queue scores during
flush_segments (via withscores=True). A Lua script atomically checks
whether the score is unchanged and only removes the queue entry if so.

Fix for race window 2:

Conditional data deletion on segment slot: a second Lua script
atomically checks whether the ingested count (ic) is unchanged since
flush time, and only deletes the segment data (set, hrs, ic, ibc) if so.
Since add-buffer.lua and this cleanup script operate on the same
{project_id:trace_id} Redis Cluster slot, they cannot interleave. This
is the actual safety guarantee: if add-buffer.lua ran (adding spans and
incrementing ic), the cleanup script sees the changed ic and skips
deletion. If add-buffer.lua runs after the cleanup script already
deleted, it creates a fresh set and re-adds to the queue, so no data loss.

The ic value captured during flush_segments is read non-atomically
relative to the set data, but this only causes false positives
(unnecessary re-flushes), never false negatives (data loss), because ic
is monotonically increasing.

Gated behind the spans.buffer.done-flush-conditional-zrem option
(default off) for safe rollout.

Discarded approaches:

  • Rename-before-read: atomically RENAME the set key to a temp key before
    reading in flush_segments, so new spans create a fresh set. Discarded
    because if the flusher crashes after rename but before producing to
    Kafka, the spans in the renamed key are lost (violates crash safety).
    Also requires invasive changes to add-buffer.lua and _load_segment_data.

  • Generation counter: flush_segments writes a "flush generation" marker;
    add-buffer.lua checks it and writes to a new set key if present.
    Discarded because it adds a check to the hot path (add-buffer.lua runs
    for every span batch), increases complexity of an already non-trivial
    Lua script, and complicates segment merges across generations.

  • Score-only conditional ZREM (Phase 1 alone): captures queue scores and
    conditionally removes the queue entry. Discarded as sole fix because
    the ZADD happens in Python after add-buffer.lua, so there is a window
    where ic changed but the score hasn't — Phase 1 would succeed and
    proceed to delete data containing new spans.

Fix STREAM-773

@github-actions github-actions bot added the Scope: Backend Automatically applied to PRs that change backend components label Mar 11, 2026
@linear-code
Copy link
Copy Markdown

linear-code bot commented Mar 11, 2026

@github-actions
Copy link
Copy Markdown
Contributor

Backend Test Failures

Failures on 4c7d772 in this run:

tests/sentry/spans/test_buffer.py::test_zero_copylog
tests/sentry/spans/test_buffer.py:1207: in test_zero_copy
    assert rv == {
E   AssertionError: assert {b'span-buf:s...sted_count=4)} == {b'span-buf:s...sted_count=0)}
E     
E     Differing items:
E     {b'span-buf:s:{1:aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa}:bbbbbbbbbbbbbbbb': FlushedSegment(queue_key=b'span-buf:q:0', spans=... {'type': 'string', 'value': 'bbbbbbbbbbbbbbbb'}}, 'is_segment': False})], project_id=1, score=10.0, ingested_count=4)} != {b'span-buf:s:{1:aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa}:bbbbbbbbbbbbbbbb': FlushedSegment(queue_key=<ANY>, spans=[OutputSpa... {'sentry.segment.id': {'type': 'string', 'value': 'bbbbbbbbbbbbbbbb'}}})], project_id=1, score=0.0, ingested_count=0)}
E     
E     Full diff:
E       {
E     -     b'span-buf:s:{1:aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa}:bbbbbbbbbbbbbbbb': FlushedSegment(queue_key=<ANY>, spans=[OutputSpan(payload={'span_id': 'aaaaaaaaaaaaaaaa', 'is_segment': False, 'attributes': {'sentry.segment.id': {'type': 'string', 'value': 'bbbbbbbbbbbbbbbb'}}}), OutputSpan(payload={'span_id': 'bbbbbbbbbbbbbbbb', 'is_segment': True, 'attributes': {'sentry.segment.id': {'type': 'string', 'value': 'bbbbbbbbbbbbbbbb'}}}), OutputSpan(payload={'span_id': 'cccccccccccccccc', 'is_segment': False, 'attributes': {'sentry.segment.id': {'type': 'string', 'value': 'bbbbbbbbbbbbbbbb'}}}), OutputSpan(payload={'span_id': 'dddddddddddddddd', 'is_segment': False, 'attributes': {'sentry.segment.id': {'type': 'string', 'value': 'bbbbbbbbbbbbbbbb'}}})], project_id=1, score=0.0, ingested_count=0),
E     +     b'span-buf:s:{1:aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa}:bbbbbbbbbbbbbbbb': FlushedSegment(queue_key=b'span-buf:q:0', spans=[OutputSpan(payload={'span_id': 'aaaaaaaaaaaaaaaa', 'attributes': {'sentry.segment.id': {'type': 'string', 'value': 'bbbbbbbbbbbbbbbb'}}, 'is_segment': False}), OutputSpan(payload={'span_id': 'bbbbbbbbbbbbbbbb', 'attributes': {'sentry.segment.id': {'type': 'string', 'value': 'bbbbbbbbbbbbbbbb'}}, 'is_segment': True}), OutputSpan(payload={'span_id': 'cccccccccccccccc', 'attributes': {'sentry.segment.id': {'type': 'string', 'value': 'bbbbbbbbbbbbbbbb'}}, 'is_segment': False}), OutputSpan(payload={'span_id': 'dddddddddddddddd', 'attributes': {'sentry.segment.id': {'type': 'string', 'value': 'bbbbbbbbbbbbbbbb'}}, 'is_segment': False})], project_id=1, score=10.0, ingested_count=4),
E       }

Race condition: between flush_segments() reading segment data and
done_flush_segments() cleaning it up, process_spans() can add new spans.
Previously, done_flush_segments blindly deleted the segment set, counters,
and queue entry, permanently destroying those new spans with no outcome
tracked.

There are actually two race windows:

1. Queue score race: process_spans updates the queue score via ZADD after
   its Lua script runs. If done_flush_segments removes the queue entry
   after new spans arrived, those spans lose their queue entry.

2. Segment data race: even after conditionally removing the queue entry,
   new spans can arrive before the data deletion pipeline runs. The spans
   get added to the set (by add-buffer.lua) which is then deleted.

Fix (two-phase conditional cleanup):

Phase 1 — Conditional ZREM on queue slot: capture queue scores during
flush_segments (via withscores=True). A Lua script atomically checks
whether the score is unchanged and only removes the queue entry if so.
This catches most races (when ZADD has already run) and serves as an
early-skip optimization.

Phase 2 — Conditional data deletion on segment slot: a second Lua script
atomically checks whether the ingested count (ic) is unchanged since
flush time, and only deletes the segment data (set, hrs, ic, ibc) if so.
Since add-buffer.lua and this cleanup script operate on the same
{project_id:trace_id} Redis Cluster slot, they cannot interleave. This
is the actual safety guarantee: if add-buffer.lua ran (adding spans and
incrementing ic), the cleanup script sees the changed ic and skips
deletion. If add-buffer.lua runs after the cleanup script already
deleted, it creates a fresh set and re-adds to the queue — no data loss.

The ic value captured during flush_segments is read non-atomically
relative to the set data, but this only causes false positives
(unnecessary re-flushes), never false negatives (data loss), because ic
is monotonically increasing.

Gated behind the `spans.buffer.done-flush-conditional-zrem` option
(default off) for safe rollout.

Discarded approaches:

- Rename-before-read: atomically RENAME the set key to a temp key before
  reading in flush_segments, so new spans create a fresh set. Discarded
  because if the flusher crashes after rename but before producing to
  Kafka, the spans in the renamed key are lost (violates crash safety).
  Also requires invasive changes to add-buffer.lua and _load_segment_data.

- Generation counter: flush_segments writes a "flush generation" marker;
  add-buffer.lua checks it and writes to a new set key if present.
  Discarded because it adds a check to the hot path (add-buffer.lua runs
  for every span batch), increases complexity of an already non-trivial
  Lua script, and complicates segment merges across generations.

- Score-only conditional ZREM (Phase 1 alone): captures queue scores and
  conditionally removes the queue entry. Discarded as sole fix because
  the ZADD happens in Python after add-buffer.lua, so there is a window
  where ic changed but the score hasn't — Phase 1 would succeed and
  proceed to delete data containing new spans.
…nditions

# Conflicts:
#	src/sentry/spans/buffer.py
#	tests/sentry/spans/test_buffer.py
@untitaker untitaker marked this pull request as ready for review March 16, 2026 14:28
@untitaker untitaker requested review from a team as code owners March 16, 2026 14:28
@untitaker untitaker requested a review from vgrozdanic March 16, 2026 14:29
Copy link
Copy Markdown
Member

@evanh evanh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a great change. Nice work!

…nditions

# Conflicts:
#	src/sentry/spans/buffer.py
#	tests/sentry/spans/test_buffer.py
@untitaker untitaker disabled auto-merge March 16, 2026 16:10
Copy link
Copy Markdown
Contributor

@cursor cursor bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 1 potential issue.

Autofix Details

Bugbot Autofix prepared a fix for the issue found in the latest run.

  • ✅ Fixed: _apply_per_trace_limit is defined but never called
    • I wired _apply_per_trace_limit into flush_segments before loading segment data and registered spans.buffer.max-flush-segments-per-trace in options defaults so the limit is active and configurable.

Create PR

Or push these changes by commenting:

@cursor push db48e43bb4
Preview (db48e43bb4)
diff --git a/src/sentry/options/defaults.py b/src/sentry/options/defaults.py
--- a/src/sentry/options/defaults.py
+++ b/src/sentry/options/defaults.py
@@ -3214,6 +3214,13 @@
     default=500,
     flags=FLAG_PRIORITIZE_DISK | FLAG_AUTOMATOR_MODIFIABLE,
 )
+# Maximum number of segments per trace to flush in one cycle. 0 disables the limit.
+register(
+    "spans.buffer.max-flush-segments-per-trace",
+    type=Int,
+    default=0,
+    flags=FLAG_PRIORITIZE_DISK | FLAG_AUTOMATOR_MODIFIABLE,
+)
 # Maximum memory percentage for the span buffer in Redis before rejecting messages.
 register(
     "spans.buffer.max-memory-percentage",

diff --git a/src/sentry/spans/buffer.py b/src/sentry/spans/buffer.py
--- a/src/sentry/spans/buffer.py
+++ b/src/sentry/spans/buffer.py
@@ -568,6 +568,7 @@
         queue_keys = []
         shard_factor = max(1, len(self.assigned_shards))
         max_flush_segments = options.get("spans.buffer.max-flush-segments")
+        max_flush_segments_per_trace = options.get("spans.buffer.max-flush-segments-per-trace")
         flusher_logger_enabled = options.get("spans.buffer.flusher-cumulative-logger-enabled")
         max_segments_per_shard = math.ceil(max_flush_segments / shard_factor)
 
@@ -589,6 +590,8 @@
             for segment_key, score in keys_with_scores:
                 segment_keys.append((shard, queue_key, segment_key, score))
 
+        segment_keys = self._apply_per_trace_limit(segment_keys, max_flush_segments_per_trace, now)
+
         data_start = time.monotonic()
         with metrics.timer("spans.buffer.flush_segments.load_segment_data"):
             segments, ingested_counts = self._load_segment_data([k for _, _, k, _ in segment_keys])

This Bugbot Autofix run was free. To enable autofix for future PRs, go to the Cursor dashboard.

@github-actions
Copy link
Copy Markdown
Contributor

Backend Test Failures

Failures on c65ee58 in this run:

tests/sentry/spans/test_buffer.py::test_per_trace_flush_limit_throttles[cluster-chunk1]log
tests/sentry/spans/test_buffer.py:1344: in test_per_trace_flush_limit_throttles
    assert len(rv) == 2
E   AssertionError: assert 4 == 2
E    +  where 4 = len({b'span-buf:s:{1:aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa}:a000000000000000': FlushedSegment(queue_key=b'span-buf:q:0', spans=... 'bbbbbbbbbbbbbbbb'}}, 'is_segment': True})], project_id=2, score=10.0, ingested_count=1, distributed_payload_keys=[])})
tests/sentry/spans/test_buffer.py::test_per_trace_flush_limit_throttles[single-nochunk]log
tests/sentry/spans/test_buffer.py:1344: in test_per_trace_flush_limit_throttles
    assert len(rv) == 2
E   AssertionError: assert 4 == 2
E    +  where 4 = len({b'span-buf:s:{1:aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa}:a000000000000000': FlushedSegment(queue_key=b'span-buf:q:0', spans=... 'bbbbbbbbbbbbbbbb'}}, 'is_segment': True})], project_id=2, score=10.0, ingested_count=1, distributed_payload_keys=[])})

@github-actions
Copy link
Copy Markdown
Contributor

Backend Test Failures

Failures on c65ee58 in this run:

tests/sentry/spans/test_buffer.py::test_per_trace_flush_limit_throttles[cluster-nochunk]log
tests/sentry/spans/test_buffer.py:1344: in test_per_trace_flush_limit_throttles
    assert len(rv) == 2
E   AssertionError: assert 4 == 2
E    +  where 4 = len({b'span-buf:s:{1:aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa}:a000000000000000': FlushedSegment(queue_key=b'span-buf:q:0', spans=... 'bbbbbbbbbbbbbbbb'}}, 'is_segment': True})], project_id=2, score=10.0, ingested_count=1, distributed_payload_keys=[])})
tests/sentry/spans/test_buffer.py::test_per_trace_flush_limit_throttles[single-chunk1]log
tests/sentry/spans/test_buffer.py:1344: in test_per_trace_flush_limit_throttles
    assert len(rv) == 2
E   AssertionError: assert 4 == 2
E    +  where 4 = len({b'span-buf:s:{1:aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa}:a000000000000000': FlushedSegment(queue_key=b'span-buf:q:0', spans=... 'bbbbbbbbbbbbbbbb'}}, 'is_segment': True})], project_id=2, score=10.0, ingested_count=1, distributed_payload_keys=[])})

The per-trace segment flush limit feature was reverted on master
(083621e). The previous merge conflict resolution kept the dead
method definition and tests. Remove them to match master.
- Use ingested_counts[segment_key] instead of .get(…, 0) so a missing
  ic (which should never happen) raises KeyError instead of silently
  producing a zero that skips Phase 2 cleanup.
- Pass score as float to the Lua script instead of truncating via int(),
  preserving full precision for the comparison.
@untitaker untitaker enabled auto-merge (squash) March 16, 2026 17:21
spans=output_spans,
project_id=int(project_id.decode("ascii")),
score=score,
ingested_count=ingested_counts.get(segment_key, 0),
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bug: When conditional cleanup removes a segment's queue entry (Phase 1) but the subsequent data deletion fails (Phase 2), the segment data is orphaned in Redis without being re-queued.
Severity: HIGH

Suggested Fix

Modify the cleanup logic to handle cases where Phase 1 succeeds but Phase 2 fails. One approach is to re-add the segment's entry to the processing queue if the data deletion in Phase 2 is skipped for a segment whose queue entry was already removed in Phase 1. This ensures the segment will be retried later, preventing it from being orphaned.

Prompt for AI Agent
Review the code at the location below. A potential bug has been identified by an AI
agent.
Verify if this is a real issue. If it is, propose a fix; if not, explain why it's not
valid.

Location: src/sentry/spans/buffer.py#L640

Potential issue: The two-phase conditional cleanup logic can orphan segment data in
Redis. If Phase 1 successfully removes a segment's entry from the processing queue
(because its score matches), but Phase 2 fails to delete the segment's data (e.g.,
because the `ingested_count` key has expired or changed), the segment is marked for
skipping. However, there is no mechanism to re-queue segments that are skipped after
their queue entry has already been removed. This results in the segment data remaining
in Redis indefinitely without any way to be processed or cleaned up.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if deletion fails i don't think we should try to flush again, the data has already been flushed downstream. The only consequence is some leftover keys in Redis that will expire via TTL.

@untitaker untitaker merged commit 8d536a9 into master Mar 16, 2026
61 checks passed
@untitaker untitaker deleted the span-buffer-race-conditions branch March 16, 2026 17:46
Copy link
Copy Markdown
Contributor

@cursor cursor bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 1 potential issue.

Fix All in Cursor

Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.

ingested_byte_count = ingested_results[i * 2 + 1]

if ingested_count:
ingested_counts[key] = int(ingested_count)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unsafe ingested count capture ordering

High Severity

_load_segment_data captures ic after scanning segment sets. If new spans are added between scan completion and this GET, expected_ic includes spans not present in flushed data, allowing Phase 2 cleanup to delete unseen spans when counts still match.

Fix in Cursor Fix in Web

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Scope: Backend Automatically applied to PRs that change backend components

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants