Skip to content

alternator/streams: keep disabled streams usable and purge on re-enable#29413

Open
ScyllaPiotr wants to merge 4 commits intoscylladb:masterfrom
ScyllaPiotr:hollow/alternator-streams-7239
Open

alternator/streams: keep disabled streams usable and purge on re-enable#29413
ScyllaPiotr wants to merge 4 commits intoscylladb:masterfrom
ScyllaPiotr:hollow/alternator-streams-7239

Conversation

@ScyllaPiotr
Copy link
Copy Markdown
Contributor

@ScyllaPiotr ScyllaPiotr commented Apr 9, 2026

When an Alternator stream is disabled, the data should continue to be accessible so that consumers can finish reading. When the stream is later re-enabled, a new StreamArn is produced and only then the old data is purged.

On disable, the existing CDC options (including preimage and postimage) are preserved so that DescribeStream can still report StreamViewType. All stream APIs continue to work on the disabled stream, with all shards reported as closed (EndingSequenceNumber set). No new CDC records are written; existing data expires via TTL after 24 hours.

On re-enable, the old CDC log table is dropped as a separate Raft group0 schema change and a fresh one is created with a new UUID, giving a new StreamArn. This is Alternator-specific — CQL CDC keeps reusing the log table. Re-enabling is the only way to immediately purge old stream data.

Old stream data is removed immediately upon re-enable (a discrepancy with DynamoDB, which keeps it readable for 24 hours through the old StreamArn).

Tests updated to cover the new disable and re-enable behavior.

Fixes #7239
Fixes SCYLLADB-523

@ScyllaPiotr ScyllaPiotr force-pushed the hollow/alternator-streams-7239 branch from c898ba6 to 719517c Compare April 9, 2026 15:53
@ScyllaPiotr ScyllaPiotr added area/alternator Alternator related Issues area/alternator-streams backport/none Backport is not required labels Apr 9, 2026
@ScyllaPiotr ScyllaPiotr marked this pull request as ready for review April 9, 2026 15:55
@scylladbbot
Copy link
Copy Markdown

Docs Preview 📖

Docs Preview for this pull request is available here

Changed Files:

Note: This preview will be available for 30 days and will be automatically deleted after that period. You can manually trigger a new build by committing changes.

Comment thread test/alternator/test_streams.py Outdated
# through the old ARN. In Alternator, the old stream data is purged
# immediately upon re-enabling (a documented discrepancy).
# Reproduces issue #7239.
def test_streams_reenable(dynamodb, dynamodbstreams, scylla_only):
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this test should be xfail, not "scylla_only". This is a feature that DynamoDB has an Scylla doesn't - it's not a Scylla extension, it's a scylla deficiency.

Copy link
Copy Markdown
Contributor Author

@ScyllaPiotr ScyllaPiotr Apr 10, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're right.
I wanted the final assert to be with pytest.xfail for the non-AWS case, but that's not possible, so I will make it as two tests instead.

@scylladbbot
Copy link
Copy Markdown

🟢 CI State: SUCCESS

Mode Stage Status Queue Time Run Time Node
Checkout 2m spider9.cloudius-systems.com
Framework test 0m spider9.cloudius-systems.com
Build 3h 21m 38m spider7.cloudius-systems.com
dev test.py 15m spider7.cloudius-systems.com
Unit Tests Custom
🔹 test_streams_reenable
0m 2m spider7.cloudius-systems.com
release test.py 29m spider7.cloudius-systems.com
Unit Tests Custom
🔹 test_streams_reenable
0m 4m spider7.cloudius-systems.com
debug test.py 2h 41m spider7.cloudius-systems.com
Unit Tests Custom
🔹 test_streams_reenable
0m 8m spider7.cloudius-systems.com
dtest dtest with tablets 1h 52m
dtest with consistent topology changes 2h

Build Details:

  • Total Queue: 3h 21m
  • Total Run: 7h 40m

@scylladbbot
Copy link
Copy Markdown

🟢 CI State: SUCCESS

Mode Stage Status Queue Time Run Time Node
Checkout 2m spider9.cloudius-systems.com
Framework test 0m spider9.cloudius-systems.com
Build 0m 33m spider4.cloudius-systems.com
dev test.py 16m spider4.cloudius-systems.com
Unit Tests Custom
🔹 test_get_records
🔹 test_get_shard_iterator_for_nonexistent_stream
🔹 test_latest_stream_arn
🔹 test_latest_stream_label
🔹 test_list_streams_alter
🔹 test_list_streams_create
🔹 test_list_streams_paged
🔹 test_list_streams_paged_with_new_table
🔹 test_list_streams_too_high_limit
🔹 test_list_streams_zero_limit
🔹 test_stream_arn_unchanging
🔹 test_stream_table_name_length_192_update
🔹 test_stream_table_name_length_222_update
🔹 test_streams_closed_read
🔹 test_streams_disable_data_survives
🔹 test_streams_disabled_stream
🔹 test_streams_multiple_items_one_partition
🔹 test_streams_reenable
🔹 test_streams_starting_sequence_number
🔹 test_streams_trim_horizon
0m 6m spider4.cloudius-systems.com
release test.py 31m spider4.cloudius-systems.com
Unit Tests Custom
🔹 test_get_records
🔹 test_get_shard_iterator_for_nonexistent_stream
🔹 test_latest_stream_arn
🔹 test_latest_stream_label
🔹 test_list_streams_alter
🔹 test_list_streams_create
🔹 test_list_streams_paged
🔹 test_list_streams_paged_with_new_table
🔹 test_list_streams_too_high_limit
🔹 test_list_streams_zero_limit
🔹 test_stream_arn_unchanging
🔹 test_stream_table_name_length_192_update
🔹 test_stream_table_name_length_222_update
🔹 test_streams_closed_read
🔹 test_streams_disable_data_survives
🔹 test_streams_disabled_stream
🔹 test_streams_multiple_items_one_partition
🔹 test_streams_reenable
🔹 test_streams_starting_sequence_number
🔹 test_streams_trim_horizon
0m 9m spider4.cloudius-systems.com
debug test.py 2h 42m spider4.cloudius-systems.com
Unit Tests Custom
🔹 test_get_records
🔹 test_get_shard_iterator_for_nonexistent_stream
🔹 test_latest_stream_arn
🔹 test_latest_stream_label
🔹 test_list_streams_alter
🔹 test_list_streams_create
🔹 test_list_streams_paged
🔹 test_list_streams_paged_with_new_table
🔹 test_list_streams_too_high_limit
🔹 test_list_streams_zero_limit
🔹 test_stream_arn_unchanging
🔹 test_stream_table_name_length_192_update
🔹 test_stream_table_name_length_222_update
🔹 test_streams_closed_read
🔹 test_streams_disable_data_survives
🔹 test_streams_disabled_stream
🔹 test_streams_multiple_items_one_partition
🔹 test_streams_reenable
🔹 test_streams_starting_sequence_number
🔹 test_streams_trim_horizon
0m 31m spider4.cloudius-systems.com
dtest dtest with tablets 41m
dtest with consistent topology changes 47m

Build Details:

  • Total Queue: 0m
  • Total Run: 4h 50m

@nyh
Copy link
Copy Markdown
Contributor

nyh commented Apr 14, 2026

@scylladb/releng - FYI there is a "iwyu / Analyze #includes in source files (pull_request)" failure below which I don't understand, and probably has nothing to do with this PR.

@yaronkaikov
Copy link
Copy Markdown
Contributor

@scylladb/releng - FYI there is a "iwyu / Analyze #includes in source files (pull_request)" failure below which I don't understand, and probably has nothing to do with this PR.

@nyh We don't maintain this workflow, i am not sure what is about

@nyh
Copy link
Copy Markdown
Contributor

nyh commented Apr 14, 2026

@scylladb/releng - FYI there is a "iwyu / Analyze #includes in source files (pull_request)" failure below which I don't understand, and probably has nothing to do with this PR.

@nyh We don't maintain this workflow, i am not sure what is about

@avikivity @mykaul we need to either have one person in charge of all these github workflows - or at least maintain a document on who's responsible for which. When one of these bots breaks, it kills everyone's work, and I have no idea whom to ask.

Two more options we can have:

  1. Have fewer of these workflows. There is no law of nature that we need a "iwyu" checker in github.
  2. Move all these workflows from places and languages nobody knows into our main ninja build system, so for example we'll have a target "ninja ci-check" and it will do all the CI checks - and everybody will know how to find what this target does and propose patches.

Copy link
Copy Markdown
Collaborator

@annastuchlik annastuchlik left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The docs look good.

@mykaul
Copy link
Copy Markdown
Contributor

mykaul commented Apr 15, 2026

@scylladb/releng - FYI there is a "iwyu / Analyze #includes in source files (pull_request)" failure below which I don't understand, and probably has nothing to do with this PR.

@nyh We don't maintain this workflow, i am not sure what is about

@avikivity @mykaul we need to either have one person in charge of all these github workflows - or at least maintain a document on who's responsible for which. When one of these bots breaks, it kills everyone's work, and I have no idea whom to ask.

How is it different than code modules? There's no one person in charge of any code module / file / whatnot.

Two more options we can have:

  1. Have fewer of these workflows. There is no law of nature that we need a "iwyu" checker in github.
  2. Move all these workflows from places and languages nobody knows into our main ninja build system, so for example we'll have a target "ninja ci-check" and it will do all the CI checks - and everybody will know how to find what this target does and propose patches.

I was surprised as you were to see the new license. I don't know how it happened.

@scylladbbot
Copy link
Copy Markdown

🔴 CI State: FAILURE

Mode Stage Status Queue Time Run Time Node
Checkout 2m spider9.cloudius-systems.com
Framework test 0m spider9.cloudius-systems.com
Build 2h 36m 24m spider2.cloudius-systems.com
dev test.py 17m spider2.cloudius-systems.com
Unit Tests Custom
🔹 test_get_records
🔹 test_get_shard_iterator_for_nonexistent_stream
🔹 test_latest_stream_arn
🔹 test_latest_stream_label
🔹 test_list_streams_alter
🔹 test_list_streams_create
🔹 test_list_streams_paged
🔹 test_list_streams_paged_with_new_table
🔹 test_list_streams_too_high_limit
🔹 test_list_streams_zero_limit
🔹 test_stream_arn_unchanging
🔹 test_stream_table_name_length_192_update
🔹 test_stream_table_name_length_222_update
🔹 test_streams_closed_read
🔹 test_streams_disable_data_survives
🔹 test_streams_disabled_stream
🔹 test_streams_multiple_items_one_partition
🔹 test_streams_reenable
🔹 test_streams_starting_sequence_number
🔹 test_streams_trim_horizon
0m 7m spider2.cloudius-systems.com
release test.py 32m spider2.cloudius-systems.com
Unit Tests Custom
🔹 test_get_records
🔹 test_get_shard_iterator_for_nonexistent_stream
🔹 test_latest_stream_arn
🔹 test_latest_stream_label
🔹 test_list_streams_alter
🔹 test_list_streams_create
🔹 test_list_streams_paged
🔹 test_list_streams_paged_with_new_table
🔹 test_list_streams_too_high_limit
🔹 test_list_streams_zero_limit
🔹 test_stream_arn_unchanging
🔹 test_stream_table_name_length_192_update
🔹 test_stream_table_name_length_222_update
🔹 test_streams_closed_read
🔹 test_streams_disable_data_survives
🔹 test_streams_disabled_stream
🔹 test_streams_multiple_items_one_partition
🔹 test_streams_reenable
🔹 test_streams_starting_sequence_number
🔹 test_streams_trim_horizon
0m 9m spider2.cloudius-systems.com
debug test.py 3h 22m spider2.cloudius-systems.com
dtest dtest with tablets 41m
dtest with consistent topology changes 49m

Failed Tests (3/67720):

📄 Elasticsearch analysis for test_random_failures[stop_during_gossip_shadow_round-restart_coordinator_node]
Period CI Failures / Total CI Pass Rate (%) Next Failures / Total Next Pass Rate (%)
Week (7d) 3/3 0.0 0/0 0
Month (30d) 3/12 75.0 0/2 100.0
Quarter (90d) 3/22 86.36 0/11 100.0

Recent 10 Failures

Test Name Arch Mode Jenkins Node Jenkins Job Ref Timestamp Days Ago
test_random_failures[stop_during_gossip_shadow_round-restart_coordinator_node].debug.1 x86_64 debug spider2.cloudius-systems.com scylla-master/scylla-ci PR#29413 2026-04-15 16:37 today
test_random_failures[stop_during_gossip_shadow_round-restart_coordinator_node].debug.2 x86_64 debug spider2.cloudius-systems.com scylla-master/scylla-ci PR#29413 2026-04-15 16:37 today
test_random_failures[stop_during_gossip_shadow_round-restart_coordinator_node].debug.3 x86_64 debug spider2.cloudius-systems.com scylla-master/scylla-ci PR#29413 2026-04-15 16:37 today

Build Details:

  • Total Queue: 2h 36m
  • Total Run: 7h 30m

Note: To re-trigger CI for this PR, comment: @scylladbbot trigger-ci

@scylladbbot
Copy link
Copy Markdown

🟢 CI State: SUCCESS

Mode Stage Status Queue Time Run Time Node
Checkout 2m spider9.cloudius-systems.com
Framework test 0m spider9.cloudius-systems.com
Build 2h 15m 52m spider8.cloudius-systems.com
dev test.py 17m spider8.cloudius-systems.com
Unit Tests Custom
🔹 test_get_records
🔹 test_get_shard_iterator_for_nonexistent_stream
🔹 test_latest_stream_arn
🔹 test_latest_stream_label
🔹 test_list_streams_alter
🔹 test_list_streams_create
🔹 test_list_streams_paged
🔹 test_list_streams_paged_with_new_table
🔹 test_list_streams_too_high_limit
🔹 test_list_streams_zero_limit
🔹 test_stream_arn_unchanging
🔹 test_stream_table_name_length_192_update
🔹 test_stream_table_name_length_222_update
🔹 test_streams_closed_read
🔹 test_streams_disable_data_survives
🔹 test_streams_disabled_stream
🔹 test_streams_multiple_items_one_partition
🔹 test_streams_reenable
🔹 test_streams_starting_sequence_number
🔹 test_streams_trim_horizon
0m 6m spider8.cloudius-systems.com
release test.py 32m spider8.cloudius-systems.com
Unit Tests Custom
🔹 test_get_records
🔹 test_get_shard_iterator_for_nonexistent_stream
🔹 test_latest_stream_arn
🔹 test_latest_stream_label
🔹 test_list_streams_alter
🔹 test_list_streams_create
🔹 test_list_streams_paged
🔹 test_list_streams_paged_with_new_table
🔹 test_list_streams_too_high_limit
🔹 test_list_streams_zero_limit
🔹 test_stream_arn_unchanging
🔹 test_stream_table_name_length_192_update
🔹 test_stream_table_name_length_222_update
🔹 test_streams_closed_read
🔹 test_streams_disable_data_survives
🔹 test_streams_disabled_stream
🔹 test_streams_multiple_items_one_partition
🔹 test_streams_reenable
🔹 test_streams_starting_sequence_number
🔹 test_streams_trim_horizon
0m 8m spider8.cloudius-systems.com
debug test.py 2h 50m spider8.cloudius-systems.com
Unit Tests Custom
🔹 test_get_records
🔹 test_get_shard_iterator_for_nonexistent_stream
🔹 test_latest_stream_arn
🔹 test_latest_stream_label
🔹 test_list_streams_alter
🔹 test_list_streams_create
🔹 test_list_streams_paged
🔹 test_list_streams_paged_with_new_table
🔹 test_list_streams_too_high_limit
🔹 test_list_streams_zero_limit
🔹 test_stream_arn_unchanging
🔹 test_stream_table_name_length_192_update
🔹 test_stream_table_name_length_222_update
🔹 test_streams_closed_read
🔹 test_streams_disable_data_survives
🔹 test_streams_disabled_stream
🔹 test_streams_multiple_items_one_partition
🔹 test_streams_reenable
🔹 test_streams_starting_sequence_number
🔹 test_streams_trim_horizon
0m 32m spider8.cloudius-systems.com
dtest dtest with tablets 42m
dtest with consistent topology changes 48m

Build Details:

  • Total Queue: 2h 15m
  • Total Run: 7h 37m

@ScyllaPiotr ScyllaPiotr requested a review from nyh April 17, 2026 11:22
@ScyllaPiotr ScyllaPiotr force-pushed the hollow/alternator-streams-7239 branch from 3d29d91 to 72743db Compare April 21, 2026 08:07
Previously, disabling Alternator Streams would create a blank
cdc::options with only enabled=false, which meant losing access also
to stored Streams's data (including preimage and postimage).

Now, when a stream is disabled:
- The existing CDC options are preserved (only 'enabled' is flipped to
  false), so StreamViewType remains available.
- DescribeStream enumerates all shards with EndingSequenceNumber set,
  indicating they are closed.
- GetRecords omits NextShardIterator for disabled streams.
- DescribeTable (supplement_table_stream_info) reports the stream ARN
  and StreamEnabled: false when the CDC log table still exists.
- ListStreams uses get_base_table instead of is_log_for_some_table so
  that disabled streams whose log table still exists are listed.

When a stream is re-enabled on an Alternator table that has an existing
(disabled) CDC log table, the old log table is dropped and a fresh one
is created with a new UUID, producing a new StreamArn. This is
Alternator-specific behavior; CQL CDC tables continue to reuse the
existing log table.

The old stream data is lost immediately upon re-enable. DynamoDB keeps
it readable for 24 hours.

Tests:
- test_streams_closed_read, test_streams_disabled_stream: remove xfail
  now that disabled streams are usable.
- test_streams_disable_data_survives: new test verifying that stream
  data remains readable after disabling.
- test_streams_reenable: new test verifying that re-enabling produces
  a new ARN and the old data is still readable via the old ARN (xfail
  on Scylla, which currently purges old data on re-enable).

Fixes scylladb#7239
Both disable_stream and wait_for_active_stream used time.process_time()
for their timeouts, but process_time measures CPU time, not wall-clock
time. Since these loops spend most of their time sleeping and waiting on
API calls, the timeouts could last far longer than intended. Use
time.time() instead to enforce actual wall-clock deadlines.
@scylladbbot
Copy link
Copy Markdown

🔴 CI State: FAILURE

Mode Stage Status Queue Time Run Time Node
Checkout 2m spider9.cloudius-systems.com
Framework test 0m spider9.cloudius-systems.com
Build 16m 25m spider1.cloudius-systems.com
dev test.py 17m spider1.cloudius-systems.com

Failed Tests (1/8942):

📄 Elasticsearch analysis for test_deferred_stream_enablement_on_tablets
Period CI Failures / Total CI Pass Rate (%) Next Failures / Total Next Pass Rate (%)
Week (7d) 1/1129 99.91 0/198 100.0
Month (30d) 2/1952 99.9 0/198 100.0
Quarter (90d) 2/1952 99.9 0/198 100.0

Recent 10 Failures

Test Name Arch Mode Jenkins Node Jenkins Job Ref Timestamp Days Ago
test_deferred_stream_enablement_on_tablets.dev.1 x86_64 dev spider1.cloudius-systems.com scylla-master/scylla-ci PR#29413 2026-04-21 08:56 today
test_deferred_stream_enablement_on_tablets.dev.1 x86_64 dev spider6.cloudius-systems.com scylla-master/scylla-ci PR#29224 2026-04-10 13:42 10

Build Details:

  • Total Queue: 16m
  • Total Run: 1h 1m

Note: To re-trigger CI for this PR, comment: @scylladbbot trigger-ci

@nyh
Copy link
Copy Markdown
Contributor

nyh commented Apr 21, 2026

@ScyllaPiotr if I understand correctly, the flaky test test_deferred_stream_enablement_on_tablets is actually a new test that you wrote a few days ago (unrelated to this patch), in a5d35d2. Can you please take a look? I'm completely hopeless in finding anything in Jira these days.

@github-actions github-actions Bot added the P2 High Priority label Apr 21, 2026
…n disable

When disabling Alternator Streams, the disable path copied existing CDC
options and only flipped enabled to false, preserving enable_requested
and tablet_merge_blocked. This meant that:

1. Disabling during the ENABLING state left enable_requested=true, so
   the topology coordinator would finalize the enablement anyway.
2. tablet_merge_blocked=true persisted after disable, blocking tablet
   merges forever.

Clear both flags when disabling streams.

Fixes the Phase 4 failure in test_deferred_stream_enablement_on_tablets.
@ScyllaPiotr
Copy link
Copy Markdown
Contributor Author

The bug this test caught was related to this patch. Fixed, pushed.

@ScyllaPiotr
Copy link
Copy Markdown
Contributor Author

@nyh I know you can't merge, but please review if possible.

@nyh
Copy link
Copy Markdown
Contributor

nyh commented Apr 26, 2026

On re-enable, the old CDC log table is dropped as a separate Raft group0 schema change and a fresh one is created with a new UUID, giving a new StreamArn. This is Alternator-specific — CQL CDC keeps reusing the log table.

Do you know if this is this is intentional in CQL - that if you disable CDC and then re-enable it, the old data remains behind, and you get a mix of old data and new data with a time gap in the middle? Does anybody actually want this behavior? CC @piodul.

Re-enabling is the only way to immediately purge old stream data.

Old stream data is removed immediately upon re-enable (a discrepancy with DynamoDB, which keeps it readable for 24 hours through the old StreamArn).

I see. I hope we have a test showing this discrepancy (we don't need to fix it, we may never want to fix it, but we must know it exists).

Tests updated to cover the new disable and re-enable behavior.

Fixes #7239 Fixes SCYLLADB-523

Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR updates Alternator Streams behavior so that disabling a stream keeps existing stream data readable (while stopping new records), and re-enabling forces creation of a fresh stream (new StreamArn) while purging old data. It also updates tests and documentation to reflect these semantics and the remaining DynamoDB-compatibility differences.

Changes:

  • Keep disabled streams listed/usable for reads by preserving CDC options and treating shards as closed.
  • On re-enable, drop the old CDC log table first so the subsequent enable creates a new log table/UUID (new StreamArn).
  • Update Alternator Streams tests and compatibility documentation for the new disable/re-enable behavior.

Reviewed changes

Copilot reviewed 5 out of 5 changed files in this pull request and generated 5 comments.

Show a summary per file
File Description
test/alternator/test_streams.py Fixes stream-disable wait timing; adds new tests for post-disable readability and re-enable behavior.
docs/alternator/compatibility.md Documents how disabled streams remain readable and how re-enable purges old data (differs from DynamoDB).
alternator/streams.cc Lists disabled streams whose log tables still exist; describes disabled streams as closed; adjusts GetRecords behavior and CDC option handling on disable.
alternator/executor.hh Extends add_stream_options() API to accept existing CDC options when disabling.
alternator/executor.cc Passes existing CDC options on disable; on re-enable drops the old CDC log table before creating a new one.

Comment thread alternator/streams.cc
Comment on lines +1525 to 1531
if (!base->cdc_options().enabled()) {
// Stream is disabled -- all shards are closed (#7239).
// Don't return NextShardIterator.
} else if (shard.time < ts && ts < high_ts) {
// The DynamoDB documentation states that when a shard is
// closed, reading it until the end has NextShardIterator
// "set to null". Our test test_streams_closed_read
Copy link

Copilot AI Apr 26, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In this end-of-shard decision logic, the subsequent else-branch (when the shard is still open) constructs a next_iter but then returns the old iter as NextShardIterator. That leaves next_iter unused (will fail under -Werror) and prevents advancing the iterator threshold, risking callers repeatedly querying the same empty range. Ensure the returned NextShardIterator actually uses the advanced iterator value.

Copilot uses AI. Check for mistakes.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't understand what the AI meant here, but please investigate this.

print('disabled stream on {}'.format(table.name))
return
time.sleep(0.5)
time.sleep(0.1)
Copy link

Copilot AI Apr 26, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In disable_stream(), the polling loop now sleeps only 0.1s between ListStreams/DescribeStream calls. This can exceed DynamoDB’s documented DescribeStream rate limits (and will scale with the number of historic streams), making the tests more likely to be throttled/flaky. Consider restoring a slower poll interval (e.g., 0.5s) or adding backoff/jitter while still using time.time() for the timeout.

Suggested change
time.sleep(0.1)
time.sleep(0.5)

Copilot uses AI. Check for mistakes.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is fine.
I didn't realize that Copilot is right -DescribeStream indeed has a 10-per-second limit (https://docs.aws.amazon.com/streams/latest/dev/service-sizes-and-limits.html). But sleeping 0.1, and adding the unavoidable network lag, I think 0.1 is exactly right. Not to mention that if this request does get throttled, it just means that the SDK will retry, so no harm done. On Alternator, lower sleeps between retries can make tests faster.

Comment on lines +2424 to +2431
desc = dynamodbstreams.describe_stream(StreamArn=arn)['StreamDescription']
nrecords = 0
for shard in desc['Shards']:
iter = dynamodbstreams.get_shard_iterator(StreamArn=arn,
ShardId=shard['ShardId'], ShardIteratorType='TRIM_HORIZON')['ShardIterator']
response = dynamodbstreams.get_records(ShardIterator=iter)
if 'Records' in response:
nrecords += len(response['Records'])
Copy link

Copilot AI Apr 26, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

_count_stream_records() claims to count “all shards” and “total number of records”, but it only reads the first DescribeStream page and only a single GetRecords page per shard (ignoring NextShardIterator). This can undercount and make the new assertions unreliable. Consider iterating DescribeStream pages (LastEvaluatedShardId) and paging GetRecords until NextShardIterator is absent to actually consume the full stream (or rename/comment it to reflect what it really does).

Suggested change
desc = dynamodbstreams.describe_stream(StreamArn=arn)['StreamDescription']
nrecords = 0
for shard in desc['Shards']:
iter = dynamodbstreams.get_shard_iterator(StreamArn=arn,
ShardId=shard['ShardId'], ShardIteratorType='TRIM_HORIZON')['ShardIterator']
response = dynamodbstreams.get_records(ShardIterator=iter)
if 'Records' in response:
nrecords += len(response['Records'])
shards = []
last_evaluated_shard_id = None
while True:
kwargs = {'StreamArn': arn}
if last_evaluated_shard_id:
kwargs['ExclusiveStartShardId'] = last_evaluated_shard_id
desc = dynamodbstreams.describe_stream(**kwargs)['StreamDescription']
shards.extend(desc.get('Shards', []))
last_evaluated_shard_id = desc.get('LastEvaluatedShardId')
if not last_evaluated_shard_id:
break
nrecords = 0
for shard in shards:
shard_iter = dynamodbstreams.get_shard_iterator(StreamArn=arn,
ShardId=shard['ShardId'], ShardIteratorType='TRIM_HORIZON')['ShardIterator']
while shard_iter:
response = dynamodbstreams.get_records(ShardIterator=shard_iter)
nrecords += len(response.get('Records', []))
shard_iter = response.get('NextShardIterator')

Copilot uses AI. Check for mistakes.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copilot is right.
One option is to fix this function as copilot suggested. The other option, if you only care whether the count is zero or not, is to rename this function stream_is_empty() or something and return true or false. But even with stream_is_empty(), you should look at all shards, not just the first page.

table.update_item(Key={'p': p, 'c': random_string()},
UpdateExpression='SET x = :val1', ExpressionAttributeValues={':val1': 5})

assert _count_stream_records(dynamodbstreams, arn1) > 0
Copy link

Copilot AI Apr 26, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The new tests assert that stream records are immediately visible right after update_item(). Alternator streams are asynchronous (and this file already uses retry loops/timeouts elsewhere), so these immediate “> 0” assertions are likely to be flaky on both AWS and Scylla. Consider adding a bounded retry loop (like fetch_and_compare_events()/other tests above) to wait until at least one record becomes visible before asserting on counts.

Suggested change
assert _count_stream_records(dynamodbstreams, arn1) > 0
deadline = time.time() + 10
count = 0
while time.time() < deadline:
count = _count_stream_records(dynamodbstreams, arn1)
if count > 0:
break
time.sleep(0.1)
assert count > 0

Copilot uses AI. Check for mistakes.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch.

Comment thread alternator/executor.cc
Comment on lines +1893 to +1897
group0_guard = co_await mm.start_group0_operation();
tab = get_table(p.local(), request);
builder = schema_builder(tab);
add_stream_options(*stream_specification, builder, p.local(), tab->cdc_options());
}
Copy link

Copilot AI Apr 26, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After dropping the old CDC log table, the code reloads tab/builder and re-applies add_stream_options(), but it does not re-run the follow-up steps that were applied to the original builder when enabling streams (e.g., validate_cdc_log_name_length() and, on tablet tables, defer_enabling_streams_block_tablet_merges()). Because builder was reconstructed, those adjustments are lost and re-enable behavior can diverge from first-time enablement on tablet tables. Consider re-applying the same post-processing after the second add_stream_options() call (or refactor to avoid duplicating the enable flow).

Copilot uses AI. Check for mistakes.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think my proposal to just do "continue" will do the right thing for free.

Copy link
Copy Markdown
Contributor

@nyh nyh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, looks mostly good, but I have a couple of comments and requests (and copilot had too), some of them are actually potential bugs that should be fixed, others are less serious nitpicks.

Comment thread alternator/executor.cc
format("alternator-executor: drop old CDC log for {}", tab->cf_name()));
co_await mm.wait_for_schema_agreement(
p.local().local_db(), db::timeout_clock::now() + 10s, nullptr);
group0_guard = co_await mm.start_group0_operation();
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this code is not concurrency-safe:

Between dropping the old guard (the std::move above) and starting a new one here, things might have changed. Some parallel operation might have re-enabled CDC or deleted the table entirely or I don't know what.

I think there's an easy solution: you just need to call "continue" here to retry from the beginning, and not try to assume that re-capturing the guard will work. In the "good" case, after the "continue" you reach will this code again, and it will notice that the log table no longer exists, so it will have nothing to do.

Comment thread alternator/executor.cc
Comment on lines +1893 to +1897
group0_guard = co_await mm.start_group0_operation();
tab = get_table(p.local(), request);
builder = schema_builder(tab);
add_stream_options(*stream_specification, builder, p.local(), tab->cdc_options());
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think my proposal to just do "continue" will do the right thing for free.

Comment thread alternator/streams.cc
Comment on lines +1525 to 1531
if (!base->cdc_options().enabled()) {
// Stream is disabled -- all shards are closed (#7239).
// Don't return NextShardIterator.
} else if (shard.time < ts && ts < high_ts) {
// The DynamoDB documentation states that when a shard is
// closed, reading it until the end has NextShardIterator
// "set to null". Our test test_streams_closed_read
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't understand what the AI meant here, but please investigate this.

Comment thread alternator/streams.cc
// When disabling, preserve the existing CDC options (preimage,
// postimage, ttl, etc.) so that DescribeStream can still report the
// correct StreamViewType on a disabled stream. Only flip enabled to
// false. See issue #7239.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nitpick: I don't think we need to reference issue #7239 in every comment. The explanation that we need to keep the old options so DescribeStream can show them is good enough - nobody should go to issue 7239 for more information, you wont find additional options there.

Comment thread alternator/streams.cc
stream_arn arn(cf.schema(), cdc::get_base_table(db.real_database(), *cf.schema()));
// Show stream info when CDC is enabled, or when it was disabled but the
// log table still exists (the stream is usable for reads). See #7239.
auto db = sp.data_dictionary();
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The comment says "when CDC is enabled or ...", but I don't see the "when CDC is enabled" (opts.enabled()) in the new code. I am worried that because of this, UpdateTable may not add the right stream info in the response because the log table doesn't exist yet (?). I don't remember if we have a test for this - what UpdateTable adding a stream actually returns.

# Test that after disabling and re-enabling a stream on a table, the old
# stream data remains readable through the old ARN. In DynamoDB, it
# remains readable for 24 hours. In Scylla, it is currently purged upon
# re-enabling (issue #7239).
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You will be closing #7239, so please either create a new issue about this last remaining problem and mention that issue (in the comment and xfail), or don't mention any issue. I think it's confusing to list a closed issue as an xfail reason.

table.update_item(Key={'p': p, 'c': random_string()},
UpdateExpression='SET x = :val1', ExpressionAttributeValues={':val1': 5})

assert _count_stream_records(dynamodbstreams, arn1) > 0
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch.

stream purges the old data immediately and produces a new StreamArn.
In contrast, DynamoDB keeps the old stream and its data readable for
24 hours through the old StreamArn even after re-enabling.
<https://github.com/scylladb/scylla/issues/7239>
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You'll be closing this issue so please don't leave a link to it as a closed issue.
You should create a new issue for the last remaining point.

Very sadly, this new link be a JIRA issue that will users who can see compatibility.md will not actually be able to see. What a colossal mistake it was to outlaw github issues :-(

CC @dani-tweig - this was one of the reasons I didn't want to drop the public github issues.

# them (the current one) may remain DISABLING for some time.
exp = time.process_time() + 60
while time.process_time() < exp:
exp = time.time() + 60
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yikes. good catch. How did I ever write this code like this? :-( Thanks for fixing it.

In any case, in the working case the timeout will never be reached which is why we never noticed this problem.

Comment thread alternator/streams.cc
cdc::options opts = existing_cdc_opts;
opts.enabled(false);
opts.enable_requested(false);
opts.tablet_merge_blocked(false);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please merge this fix with the first patch, that introduced this bug.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area/alternator Alternator related Issues area/alternator-streams backport/none Backport is not required P2 High Priority

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Alternator Stream should still be usable (for 24 hours) after disabling

7 participants