alternator/streams: keep disabled streams usable and purge on re-enable by ScyllaPiotr · Pull Request #29413 · scylladb/scylladb

ScyllaPiotr · 2026-04-09T14:55:07Z

When an Alternator stream is disabled, the data should continue to be accessible so that consumers can finish reading. When the stream is later re-enabled, a new StreamArn is produced and only then the old data is purged.

On disable, the existing CDC options (including preimage and postimage) are preserved so that DescribeStream can still report StreamViewType. All stream APIs continue to work on the disabled stream, with all shards reported as closed (EndingSequenceNumber set). No new CDC records are written; existing data expires via TTL after 24 hours.

On re-enable, the old CDC log table is dropped as a separate Raft group0 schema change and a fresh one is created with a new UUID, giving a new StreamArn. This is Alternator-specific — CQL CDC keeps reusing the log table. Re-enabling is the only way to immediately purge old stream data.

Old stream data is removed immediately upon re-enable (a discrepancy with DynamoDB, which keeps it readable for 24 hours through the old StreamArn).

Tests updated to cover the new disable and re-enable behavior.

Fixes #7239
Fixes SCYLLADB-523

scylladbbot · 2026-04-09T16:04:33Z

Docs Preview 📖

Docs Preview for this pull request is available here

Changed Files:

alternator/compatibility

Note: This preview will be available for 30 days and will be automatically deleted after that period. You can manually trigger a new build by committing changes.

nyh · 2026-04-09T19:00:48Z

+# through the old ARN. In Alternator, the old stream data is purged
+# immediately upon re-enabling (a documented discrepancy).
+# Reproduces issue #7239.
+def test_streams_reenable(dynamodb, dynamodbstreams, scylla_only):


I think this test should be xfail, not "scylla_only". This is a feature that DynamoDB has an Scylla doesn't - it's not a Scylla extension, it's a scylla deficiency.

You're right.
I wanted the final assert to be with pytest.xfail for the non-AWS case, but that's not possible, so I will make it as two tests instead.

scylladbbot · 2026-04-09T23:43:18Z

🟢 CI State: SUCCESS

Mode	Stage	Status	Queue Time	Run Time	Node
	Checkout	✅		2m	spider9.cloudius-systems.com
	Framework test	✅		0m	spider9.cloudius-systems.com
	Build	✅	3h 21m	38m	spider7.cloudius-systems.com
dev	test.py	✅		15m	spider7.cloudius-systems.com
	Unit Tests Custom 🔹 test_streams_reenable	✅	0m	2m	spider7.cloudius-systems.com
release	test.py	✅		29m	spider7.cloudius-systems.com
	Unit Tests Custom 🔹 test_streams_reenable	✅	0m	4m	spider7.cloudius-systems.com
debug	test.py	✅		2h 41m	spider7.cloudius-systems.com
	Unit Tests Custom 🔹 test_streams_reenable	✅	0m	8m	spider7.cloudius-systems.com
dtest	dtest with tablets	✅		1h 52m
	dtest with consistent topology changes	✅		2h

Build Details:

Total Queue: 3h 21m
Total Run: 7h 40m

scylladbbot · 2026-04-10T14:58:21Z

🟢 CI State: SUCCESS

Mode	Stage	Status	Queue Time	Run Time	Node
	Checkout	✅		2m	spider9.cloudius-systems.com
	Framework test	✅		0m	spider9.cloudius-systems.com
	Build	✅	0m	33m	spider4.cloudius-systems.com
dev	test.py	✅		16m	spider4.cloudius-systems.com
	Unit Tests Custom 🔹 test_get_records 🔹 test_get_shard_iterator_for_nonexistent_stream 🔹 test_latest_stream_arn 🔹 test_latest_stream_label 🔹 test_list_streams_alter 🔹 test_list_streams_create 🔹 test_list_streams_paged 🔹 test_list_streams_paged_with_new_table 🔹 test_list_streams_too_high_limit 🔹 test_list_streams_zero_limit 🔹 test_stream_arn_unchanging 🔹 test_stream_table_name_length_192_update 🔹 test_stream_table_name_length_222_update 🔹 test_streams_closed_read 🔹 test_streams_disable_data_survives 🔹 test_streams_disabled_stream 🔹 test_streams_multiple_items_one_partition 🔹 test_streams_reenable 🔹 test_streams_starting_sequence_number 🔹 test_streams_trim_horizon	✅	0m	6m	spider4.cloudius-systems.com
release	test.py	✅		31m	spider4.cloudius-systems.com
	Unit Tests Custom 🔹 test_get_records 🔹 test_get_shard_iterator_for_nonexistent_stream 🔹 test_latest_stream_arn 🔹 test_latest_stream_label 🔹 test_list_streams_alter 🔹 test_list_streams_create 🔹 test_list_streams_paged 🔹 test_list_streams_paged_with_new_table 🔹 test_list_streams_too_high_limit 🔹 test_list_streams_zero_limit 🔹 test_stream_arn_unchanging 🔹 test_stream_table_name_length_192_update 🔹 test_stream_table_name_length_222_update 🔹 test_streams_closed_read 🔹 test_streams_disable_data_survives 🔹 test_streams_disabled_stream 🔹 test_streams_multiple_items_one_partition 🔹 test_streams_reenable 🔹 test_streams_starting_sequence_number 🔹 test_streams_trim_horizon	✅	0m	9m	spider4.cloudius-systems.com
debug	test.py	✅		2h 42m	spider4.cloudius-systems.com
	Unit Tests Custom 🔹 test_get_records 🔹 test_get_shard_iterator_for_nonexistent_stream 🔹 test_latest_stream_arn 🔹 test_latest_stream_label 🔹 test_list_streams_alter 🔹 test_list_streams_create 🔹 test_list_streams_paged 🔹 test_list_streams_paged_with_new_table 🔹 test_list_streams_too_high_limit 🔹 test_list_streams_zero_limit 🔹 test_stream_arn_unchanging 🔹 test_stream_table_name_length_192_update 🔹 test_stream_table_name_length_222_update 🔹 test_streams_closed_read 🔹 test_streams_disable_data_survives 🔹 test_streams_disabled_stream 🔹 test_streams_multiple_items_one_partition 🔹 test_streams_reenable 🔹 test_streams_starting_sequence_number 🔹 test_streams_trim_horizon	✅	0m	31m	spider4.cloudius-systems.com
dtest	dtest with tablets	✅		41m
	dtest with consistent topology changes	✅		47m

Build Details:

Total Queue: 0m
Total Run: 4h 50m

nyh · 2026-04-14T19:35:20Z

@scylladb/releng - FYI there is a "iwyu / Analyze #includes in source files (pull_request)" failure below which I don't understand, and probably has nothing to do with this PR.

yaronkaikov · 2026-04-14T20:01:45Z

@scylladb/releng - FYI there is a "iwyu / Analyze #includes in source files (pull_request)" failure below which I don't understand, and probably has nothing to do with this PR.

@nyh We don't maintain this workflow, i am not sure what is about

nyh · 2026-04-14T20:33:23Z

@scylladb/releng - FYI there is a "iwyu / Analyze #includes in source files (pull_request)" failure below which I don't understand, and probably has nothing to do with this PR.

@nyh We don't maintain this workflow, i am not sure what is about

@avikivity @mykaul we need to either have one person in charge of all these github workflows - or at least maintain a document on who's responsible for which. When one of these bots breaks, it kills everyone's work, and I have no idea whom to ask.

Two more options we can have:

Have fewer of these workflows. There is no law of nature that we need a "iwyu" checker in github.
Move all these workflows from places and languages nobody knows into our main ninja build system, so for example we'll have a target "ninja ci-check" and it will do all the CI checks - and everybody will know how to find what this target does and propose patches.

annastuchlik

The docs look good.

mykaul · 2026-04-15T13:27:30Z

@scylladb/releng - FYI there is a "iwyu / Analyze #includes in source files (pull_request)" failure below which I don't understand, and probably has nothing to do with this PR.

@nyh We don't maintain this workflow, i am not sure what is about

@avikivity @mykaul we need to either have one person in charge of all these github workflows - or at least maintain a document on who's responsible for which. When one of these bots breaks, it kills everyone's work, and I have no idea whom to ask.

How is it different than code modules? There's no one person in charge of any code module / file / whatnot.

Two more options we can have:

Have fewer of these workflows. There is no law of nature that we need a "iwyu" checker in github.

Move all these workflows from places and languages nobody knows into our main ninja build system, so for example we'll have a target "ninja ci-check" and it will do all the CI checks - and everybody will know how to find what this target does and propose patches.

I was surprised as you were to see the new license. I don't know how it happened.

scylladbbot · 2026-04-15T20:00:52Z

🔴 CI State: FAILURE

Mode	Stage	Status	Queue Time	Run Time	Node
	Checkout	✅		2m	spider9.cloudius-systems.com
	Framework test	✅		0m	spider9.cloudius-systems.com
	Build	✅	2h 36m	24m	spider2.cloudius-systems.com
dev	test.py	✅		17m	spider2.cloudius-systems.com
	Unit Tests Custom 🔹 test_get_records 🔹 test_get_shard_iterator_for_nonexistent_stream 🔹 test_latest_stream_arn 🔹 test_latest_stream_label 🔹 test_list_streams_alter 🔹 test_list_streams_create 🔹 test_list_streams_paged 🔹 test_list_streams_paged_with_new_table 🔹 test_list_streams_too_high_limit 🔹 test_list_streams_zero_limit 🔹 test_stream_arn_unchanging 🔹 test_stream_table_name_length_192_update 🔹 test_stream_table_name_length_222_update 🔹 test_streams_closed_read 🔹 test_streams_disable_data_survives 🔹 test_streams_disabled_stream 🔹 test_streams_multiple_items_one_partition 🔹 test_streams_reenable 🔹 test_streams_starting_sequence_number 🔹 test_streams_trim_horizon	✅	0m	7m	spider2.cloudius-systems.com
release	test.py	✅		32m	spider2.cloudius-systems.com
	Unit Tests Custom 🔹 test_get_records 🔹 test_get_shard_iterator_for_nonexistent_stream 🔹 test_latest_stream_arn 🔹 test_latest_stream_label 🔹 test_list_streams_alter 🔹 test_list_streams_create 🔹 test_list_streams_paged 🔹 test_list_streams_paged_with_new_table 🔹 test_list_streams_too_high_limit 🔹 test_list_streams_zero_limit 🔹 test_stream_arn_unchanging 🔹 test_stream_table_name_length_192_update 🔹 test_stream_table_name_length_222_update 🔹 test_streams_closed_read 🔹 test_streams_disable_data_survives 🔹 test_streams_disabled_stream 🔹 test_streams_multiple_items_one_partition 🔹 test_streams_reenable 🔹 test_streams_starting_sequence_number 🔹 test_streams_trim_horizon	✅	0m	9m	spider2.cloudius-systems.com
debug	test.py	❌		3h 22m	spider2.cloudius-systems.com
dtest	dtest with tablets	✅		41m
	dtest with consistent topology changes	✅		49m

Failed Tests (3/67720):

📄 Elasticsearch analysis for test_random_failures[stop_during_gossip_shadow_round-restart_coordinator_node]

Period	CI Failures / Total	CI Pass Rate (%)	Next Failures / Total	Next Pass Rate (%)
Week (7d)	3/3	0.0	0/0	0
Month (30d)	3/12	75.0	0/2	100.0
Quarter (90d)	3/22	86.36	0/11	100.0

Recent 10 Failures

Test Name	Arch	Mode	Jenkins Node	Jenkins Job	Ref	Timestamp	Days Ago
test_random_failures[stop_during_gossip_shadow_round-restart_coordinator_node].debug.1	x86_64	debug	spider2.cloudius-systems.com	scylla-master/scylla-ci	PR#29413	2026-04-15 16:37	today
test_random_failures[stop_during_gossip_shadow_round-restart_coordinator_node].debug.2	x86_64	debug	spider2.cloudius-systems.com	scylla-master/scylla-ci	PR#29413	2026-04-15 16:37	today
test_random_failures[stop_during_gossip_shadow_round-restart_coordinator_node].debug.3	x86_64	debug	spider2.cloudius-systems.com	scylla-master/scylla-ci	PR#29413	2026-04-15 16:37	today

Build Details:

Total Queue: 2h 36m
Total Run: 7h 30m

Note: To re-trigger CI for this PR, comment: @scylladbbot trigger-ci

ScyllaPiotr · 2026-04-17T00:23:48Z

Failed Tests (3/67720):

test_random_failures[stop_during_gossip_shadow_round-restart_coordinator_node].debug.1 🔍 🎫 📂
test_random_failures[stop_during_gossip_shadow_round-restart_coordinator_node].debug.2 🔍 🎫 📂
test_random_failures[stop_during_gossip_shadow_round-restart_coordinator_node].debug.3 🔍 🎫 📂

https://scylladb.atlassian.net/browse/SCYLLADB-1604

@scylladbbot trigger-ci

scylladbbot · 2026-04-17T08:01:23Z

🟢 CI State: SUCCESS

Mode	Stage	Status	Queue Time	Run Time	Node
	Checkout	✅		2m	spider9.cloudius-systems.com
	Framework test	✅		0m	spider9.cloudius-systems.com
	Build	✅	2h 15m	52m	spider8.cloudius-systems.com
dev	test.py	✅		17m	spider8.cloudius-systems.com
	Unit Tests Custom 🔹 test_get_records 🔹 test_get_shard_iterator_for_nonexistent_stream 🔹 test_latest_stream_arn 🔹 test_latest_stream_label 🔹 test_list_streams_alter 🔹 test_list_streams_create 🔹 test_list_streams_paged 🔹 test_list_streams_paged_with_new_table 🔹 test_list_streams_too_high_limit 🔹 test_list_streams_zero_limit 🔹 test_stream_arn_unchanging 🔹 test_stream_table_name_length_192_update 🔹 test_stream_table_name_length_222_update 🔹 test_streams_closed_read 🔹 test_streams_disable_data_survives 🔹 test_streams_disabled_stream 🔹 test_streams_multiple_items_one_partition 🔹 test_streams_reenable 🔹 test_streams_starting_sequence_number 🔹 test_streams_trim_horizon	✅	0m	6m	spider8.cloudius-systems.com
release	test.py	✅		32m	spider8.cloudius-systems.com
	Unit Tests Custom 🔹 test_get_records 🔹 test_get_shard_iterator_for_nonexistent_stream 🔹 test_latest_stream_arn 🔹 test_latest_stream_label 🔹 test_list_streams_alter 🔹 test_list_streams_create 🔹 test_list_streams_paged 🔹 test_list_streams_paged_with_new_table 🔹 test_list_streams_too_high_limit 🔹 test_list_streams_zero_limit 🔹 test_stream_arn_unchanging 🔹 test_stream_table_name_length_192_update 🔹 test_stream_table_name_length_222_update 🔹 test_streams_closed_read 🔹 test_streams_disable_data_survives 🔹 test_streams_disabled_stream 🔹 test_streams_multiple_items_one_partition 🔹 test_streams_reenable 🔹 test_streams_starting_sequence_number 🔹 test_streams_trim_horizon	✅	0m	8m	spider8.cloudius-systems.com
debug	test.py	✅		2h 50m	spider8.cloudius-systems.com
	Unit Tests Custom 🔹 test_get_records 🔹 test_get_shard_iterator_for_nonexistent_stream 🔹 test_latest_stream_arn 🔹 test_latest_stream_label 🔹 test_list_streams_alter 🔹 test_list_streams_create 🔹 test_list_streams_paged 🔹 test_list_streams_paged_with_new_table 🔹 test_list_streams_too_high_limit 🔹 test_list_streams_zero_limit 🔹 test_stream_arn_unchanging 🔹 test_stream_table_name_length_192_update 🔹 test_stream_table_name_length_222_update 🔹 test_streams_closed_read 🔹 test_streams_disable_data_survives 🔹 test_streams_disabled_stream 🔹 test_streams_multiple_items_one_partition 🔹 test_streams_reenable 🔹 test_streams_starting_sequence_number 🔹 test_streams_trim_horizon	✅	0m	32m	spider8.cloudius-systems.com
dtest	dtest with tablets	✅		42m
	dtest with consistent topology changes	✅		48m

Build Details:

Total Queue: 2h 15m
Total Run: 7h 37m

Previously, disabling Alternator Streams would create a blank cdc::options with only enabled=false, which meant losing access also to stored Streams's data (including preimage and postimage). Now, when a stream is disabled: - The existing CDC options are preserved (only 'enabled' is flipped to false), so StreamViewType remains available. - DescribeStream enumerates all shards with EndingSequenceNumber set, indicating they are closed. - GetRecords omits NextShardIterator for disabled streams. - DescribeTable (supplement_table_stream_info) reports the stream ARN and StreamEnabled: false when the CDC log table still exists. - ListStreams uses get_base_table instead of is_log_for_some_table so that disabled streams whose log table still exists are listed. When a stream is re-enabled on an Alternator table that has an existing (disabled) CDC log table, the old log table is dropped and a fresh one is created with a new UUID, producing a new StreamArn. This is Alternator-specific behavior; CQL CDC tables continue to reuse the existing log table. The old stream data is lost immediately upon re-enable. DynamoDB keeps it readable for 24 hours. Tests: - test_streams_closed_read, test_streams_disabled_stream: remove xfail now that disabled streams are usable. - test_streams_disable_data_survives: new test verifying that stream data remains readable after disabling. - test_streams_reenable: new test verifying that re-enabling produces a new ARN and the old data is still readable via the old ARN (xfail on Scylla, which currently purges old data on re-enable). Fixes scylladb#7239

Both disable_stream and wait_for_active_stream used time.process_time() for their timeouts, but process_time measures CPU time, not wall-clock time. Since these loops spend most of their time sleeping and waiting on API calls, the timeouts could last far longer than intended. Use time.time() instead to enforce actual wall-clock deadlines.

scylladbbot · 2026-04-21T09:14:15Z

🔴 CI State: FAILURE

Mode	Stage	Status	Queue Time	Run Time	Node
	Checkout	✅		2m	spider9.cloudius-systems.com
	Framework test	✅		0m	spider9.cloudius-systems.com
	Build	✅	16m	25m	spider1.cloudius-systems.com
dev	test.py	❌		17m	spider1.cloudius-systems.com

Failed Tests (1/8942):

test_deferred_stream_enablement_on_tablets.dev.1 🔍 🎫 📂

📄 Elasticsearch analysis for test_deferred_stream_enablement_on_tablets

Period	CI Failures / Total	CI Pass Rate (%)	Next Failures / Total	Next Pass Rate (%)
Week (7d)	1/1129	99.91	0/198	100.0
Month (30d)	2/1952	99.9	0/198	100.0
Quarter (90d)	2/1952	99.9	0/198	100.0

Recent 10 Failures

Test Name	Arch	Mode	Jenkins Node	Jenkins Job	Ref	Timestamp	Days Ago
test_deferred_stream_enablement_on_tablets.dev.1	x86_64	dev	spider1.cloudius-systems.com	scylla-master/scylla-ci	PR#29413	2026-04-21 08:56	today
test_deferred_stream_enablement_on_tablets.dev.1	x86_64	dev	spider6.cloudius-systems.com	scylla-master/scylla-ci	PR#29224	2026-04-10 13:42	10

Build Details:

Total Queue: 16m
Total Run: 1h 1m

Note: To re-trigger CI for this PR, comment: @scylladbbot trigger-ci

nyh · 2026-04-21T10:02:33Z

@ScyllaPiotr if I understand correctly, the flaky test test_deferred_stream_enablement_on_tablets is actually a new test that you wrote a few days ago (unrelated to this patch), in a5d35d2. Can you please take a look? I'm completely hopeless in finding anything in Jira these days.

…n disable When disabling Alternator Streams, the disable path copied existing CDC options and only flipped enabled to false, preserving enable_requested and tablet_merge_blocked. This meant that: 1. Disabling during the ENABLING state left enable_requested=true, so the topology coordinator would finalize the enablement anyway. 2. tablet_merge_blocked=true persisted after disable, blocking tablet merges forever. Clear both flags when disabling streams. Fixes the Phase 4 failure in test_deferred_stream_enablement_on_tablets.

ScyllaPiotr · 2026-04-21T14:23:47Z

The bug this test caught was related to this patch. Fixed, pushed.

ScyllaPiotr · 2026-04-24T16:53:21Z

@nyh I know you can't merge, but please review if possible.

nyh · 2026-04-26T11:44:41Z

On re-enable, the old CDC log table is dropped as a separate Raft group0 schema change and a fresh one is created with a new UUID, giving a new StreamArn. This is Alternator-specific — CQL CDC keeps reusing the log table.

Do you know if this is this is intentional in CQL - that if you disable CDC and then re-enable it, the old data remains behind, and you get a mix of old data and new data with a time gap in the middle? Does anybody actually want this behavior? CC @piodul.

Re-enabling is the only way to immediately purge old stream data.

Old stream data is removed immediately upon re-enable (a discrepancy with DynamoDB, which keeps it readable for 24 hours through the old StreamArn).

I see. I hope we have a test showing this discrepancy (we don't need to fix it, we may never want to fix it, but we must know it exists).

Tests updated to cover the new disable and re-enable behavior.

Fixes #7239 Fixes SCYLLADB-523

Copilot

Pull request overview

This PR updates Alternator Streams behavior so that disabling a stream keeps existing stream data readable (while stopping new records), and re-enabling forces creation of a fresh stream (new StreamArn) while purging old data. It also updates tests and documentation to reflect these semantics and the remaining DynamoDB-compatibility differences.

Changes:

Keep disabled streams listed/usable for reads by preserving CDC options and treating shards as closed.
On re-enable, drop the old CDC log table first so the subsequent enable creates a new log table/UUID (new StreamArn).
Update Alternator Streams tests and compatibility documentation for the new disable/re-enable behavior.

Reviewed changes

Copilot reviewed 5 out of 5 changed files in this pull request and generated 5 comments.

Show a summary per file

File	Description
test/alternator/test_streams.py	Fixes stream-disable wait timing; adds new tests for post-disable readability and re-enable behavior.
docs/alternator/compatibility.md	Documents how disabled streams remain readable and how re-enable purges old data (differs from DynamoDB).
alternator/streams.cc	Lists disabled streams whose log tables still exist; describes disabled streams as closed; adjusts GetRecords behavior and CDC option handling on disable.
alternator/executor.hh	Extends add_stream_options() API to accept existing CDC options when disabling.
alternator/executor.cc	Passes existing CDC options on disable; on re-enable drops the old CDC log table before creating a new one.

Copilot · 2026-04-26T11:53:42Z

+    if (!base->cdc_options().enabled()) {
+        // Stream is disabled -- all shards are closed (#7239).
+        // Don't return NextShardIterator.
+    } else if (shard.time < ts && ts < high_ts) {
        // The DynamoDB documentation states that when a shard is
        // closed, reading it until the end has NextShardIterator
        // "set to null". Our test test_streams_closed_read


In this end-of-shard decision logic, the subsequent else-branch (when the shard is still open) constructs a next_iter but then returns the old iter as NextShardIterator. That leaves next_iter unused (will fail under -Werror) and prevents advancing the iterator threshold, risking callers repeatedly querying the same empty range. Ensure the returned NextShardIterator actually uses the advanced iterator value.

I didn't understand what the AI meant here, but please investigate this.

Copilot · 2026-04-26T11:53:42Z

            print('disabled stream on {}'.format(table.name))
            return
-        time.sleep(0.5)
+        time.sleep(0.1)


In disable_stream(), the polling loop now sleeps only 0.1s between ListStreams/DescribeStream calls. This can exceed DynamoDB’s documented DescribeStream rate limits (and will scale with the number of historic streams), making the tests more likely to be throttled/flaky. Consider restoring a slower poll interval (e.g., 0.5s) or adding backoff/jitter while still using time.time() for the timeout.

Suggested change

time.sleep(0.1)

time.sleep(0.5)

I think this is fine.
I didn't realize that Copilot is right -DescribeStream indeed has a 10-per-second limit (https://docs.aws.amazon.com/streams/latest/dev/service-sizes-and-limits.html). But sleeping 0.1, and adding the unavoidable network lag, I think 0.1 is exactly right. Not to mention that if this request does get throttled, it just means that the SDK will retry, so no harm done. On Alternator, lower sleeps between retries can make tests faster.

Copilot · 2026-04-26T11:53:43Z

+    desc = dynamodbstreams.describe_stream(StreamArn=arn)['StreamDescription']
+    nrecords = 0
+    for shard in desc['Shards']:
+        iter = dynamodbstreams.get_shard_iterator(StreamArn=arn,
+            ShardId=shard['ShardId'], ShardIteratorType='TRIM_HORIZON')['ShardIterator']
+        response = dynamodbstreams.get_records(ShardIterator=iter)
+        if 'Records' in response:
+            nrecords += len(response['Records'])


_count_stream_records() claims to count “all shards” and “total number of records”, but it only reads the first DescribeStream page and only a single GetRecords page per shard (ignoring NextShardIterator). This can undercount and make the new assertions unreliable. Consider iterating DescribeStream pages (LastEvaluatedShardId) and paging GetRecords until NextShardIterator is absent to actually consume the full stream (or rename/comment it to reflect what it really does).

Suggested change

desc = dynamodbstreams.describe_stream(StreamArn=arn)['StreamDescription']

nrecords = 0

for shard in desc['Shards']:

iter = dynamodbstreams.get_shard_iterator(StreamArn=arn,

ShardId=shard['ShardId'], ShardIteratorType='TRIM_HORIZON')['ShardIterator']

response = dynamodbstreams.get_records(ShardIterator=iter)

if 'Records' in response:

nrecords += len(response['Records'])

shards = []

last_evaluated_shard_id = None

while True:

kwargs = {'StreamArn': arn}

if last_evaluated_shard_id:

kwargs['ExclusiveStartShardId'] = last_evaluated_shard_id

desc = dynamodbstreams.describe_stream(**kwargs)['StreamDescription']

shards.extend(desc.get('Shards', []))

last_evaluated_shard_id = desc.get('LastEvaluatedShardId')

if not last_evaluated_shard_id:

break

nrecords = 0

for shard in shards:

shard_iter = dynamodbstreams.get_shard_iterator(StreamArn=arn,

ShardId=shard['ShardId'], ShardIteratorType='TRIM_HORIZON')['ShardIterator']

while shard_iter:

response = dynamodbstreams.get_records(ShardIterator=shard_iter)

nrecords += len(response.get('Records', []))

shard_iter = response.get('NextShardIterator')

Copilot is right.
One option is to fix this function as copilot suggested. The other option, if you only care whether the count is zero or not, is to rename this function stream_is_empty() or something and return true or false. But even with stream_is_empty(), you should look at all shards, not just the first page.

Copilot · 2026-04-26T11:53:43Z

+        table.update_item(Key={'p': p, 'c': random_string()},
+            UpdateExpression='SET x = :val1', ExpressionAttributeValues={':val1': 5})
+
+        assert _count_stream_records(dynamodbstreams, arn1) > 0


The new tests assert that stream records are immediately visible right after update_item(). Alternator streams are asynchronous (and this file already uses retry loops/timeouts elsewhere), so these immediate “> 0” assertions are likely to be flaky on both AWS and Scylla. Consider adding a bounded retry loop (like fetch_and_compare_events()/other tests above) to wait until at least one record becomes visible before asserting on counts.

Suggested change

assert _count_stream_records(dynamodbstreams, arn1) > 0

deadline = time.time() + 10

count = 0

while time.time() < deadline:

count = _count_stream_records(dynamodbstreams, arn1)

if count > 0:

break

time.sleep(0.1)

assert count > 0

Good catch.

Copilot · 2026-04-26T11:53:43Z

+                            group0_guard = co_await mm.start_group0_operation();
+                            tab = get_table(p.local(), request);
+                            builder = schema_builder(tab);
+                            add_stream_options(*stream_specification, builder, p.local(), tab->cdc_options());
+                        }


After dropping the old CDC log table, the code reloads tab/builder and re-applies add_stream_options(), but it does not re-run the follow-up steps that were applied to the original builder when enabling streams (e.g., validate_cdc_log_name_length() and, on tablet tables, defer_enabling_streams_block_tablet_merges()). Because builder was reconstructed, those adjustments are lost and re-enable behavior can diverge from first-time enablement on tablet tables. Consider re-applying the same post-processing after the second add_stream_options() call (or refactor to avoid duplicating the enable flow).

I think my proposal to just do "continue" will do the right thing for free.

nyh

Thanks, looks mostly good, but I have a couple of comments and requests (and copilot had too), some of them are actually potential bugs that should be fixed, others are less serious nitpicks.

nyh · 2026-04-26T12:00:17Z

+                                format("alternator-executor: drop old CDC log for {}", tab->cf_name()));
+                            co_await mm.wait_for_schema_agreement(
+                                p.local().local_db(), db::timeout_clock::now() + 10s, nullptr);
+                            group0_guard = co_await mm.start_group0_operation();


I think this code is not concurrency-safe:

Between dropping the old guard (the std::move above) and starting a new one here, things might have changed. Some parallel operation might have re-enabled CDC or deleted the table entirely or I don't know what.

I think there's an easy solution: you just need to call "continue" here to retry from the beginning, and not try to assume that re-capturing the guard will work. In the "good" case, after the "continue" you reach will this code again, and it will notice that the log table no longer exists, so it will have nothing to do.

nyh · 2026-04-26T13:07:41Z

+                            group0_guard = co_await mm.start_group0_operation();
+                            tab = get_table(p.local(), request);
+                            builder = schema_builder(tab);
+                            add_stream_options(*stream_specification, builder, p.local(), tab->cdc_options());
+                        }


I think my proposal to just do "continue" will do the right thing for free.

nyh · 2026-04-26T13:16:09Z

+    if (!base->cdc_options().enabled()) {
+        // Stream is disabled -- all shards are closed (#7239).
+        // Don't return NextShardIterator.
+    } else if (shard.time < ts && ts < high_ts) {
        // The DynamoDB documentation states that when a shard is
        // closed, reading it until the end has NextShardIterator
        // "set to null". Our test test_streams_closed_read


I didn't understand what the AI meant here, but please investigate this.

nyh · 2026-04-26T13:17:39Z

+        // When disabling, preserve the existing CDC options (preimage,
+        // postimage, ttl, etc.) so that DescribeStream can still report the
+        // correct StreamViewType on a disabled stream. Only flip enabled to
+        // false. See issue #7239.


Nitpick: I don't think we need to reference issue #7239 in every comment. The explanation that we need to keep the old options so DescribeStream can show them is good enough - nobody should go to issue 7239 for more information, you wont find additional options there.

nyh · 2026-04-26T13:21:24Z

-        stream_arn arn(cf.schema(), cdc::get_base_table(db.real_database(), *cf.schema()));
+    // Show stream info when CDC is enabled, or when it was disabled but the
+    // log table still exists (the stream is usable for reads). See #7239.
+    auto db = sp.data_dictionary();


The comment says "when CDC is enabled or ...", but I don't see the "when CDC is enabled" (opts.enabled()) in the new code. I am worried that because of this, UpdateTable may not add the right stream info in the response because the log table doesn't exist yet (?). I don't remember if we have a test for this - what UpdateTable adding a stream actually returns.

nyh · 2026-04-26T13:34:52Z

+# Test that after disabling and re-enabling a stream on a table, the old
+# stream data remains readable through the old ARN. In DynamoDB, it
+# remains readable for 24 hours. In Scylla, it is currently purged upon
+# re-enabling (issue #7239).


You will be closing #7239, so please either create a new issue about this last remaining problem and mention that issue (in the comment and xfail), or don't mention any issue. I think it's confusing to list a closed issue as an xfail reason.

nyh · 2026-04-26T13:35:29Z

+        table.update_item(Key={'p': p, 'c': random_string()},
+            UpdateExpression='SET x = :val1', ExpressionAttributeValues={':val1': 5})
+
+        assert _count_stream_records(dynamodbstreams, arn1) > 0


Good catch.

nyh · 2026-04-26T13:42:08Z

+    stream purges the old data immediately and produces a new StreamArn.
+    In contrast, DynamoDB keeps the old stream and its data readable for
+    24 hours through the old StreamArn even after re-enabling.
+    <https://github.com/scylladb/scylla/issues/7239>


You'll be closing this issue so please don't leave a link to it as a closed issue.
You should create a new issue for the last remaining point.

Very sadly, this new link be a JIRA issue that will users who can see compatibility.md will not actually be able to see. What a colossal mistake it was to outlaw github issues :-(

CC @dani-tweig - this was one of the reasons I didn't want to drop the public github issues.

nyh · 2026-04-26T13:44:12Z

    # them (the current one) may remain DISABLING for some time.
-    exp = time.process_time() + 60
-    while time.process_time() < exp:
+    exp = time.time() + 60


yikes. good catch. How did I ever write this code like this? :-( Thanks for fixing it.

In any case, in the working case the timeout will never be reached which is why we never noticed this problem.

nyh · 2026-04-26T13:46:16Z

        cdc::options opts = existing_cdc_opts;
        opts.enabled(false);
+        opts.enable_requested(false);
+        opts.tablet_merge_blocked(false);


Please merge this fix with the first patch, that introduced this bug.

ScyllaPiotr force-pushed the hollow/alternator-streams-7239 branch from c898ba6 to 719517c Compare April 9, 2026 15:53

ScyllaPiotr added area/alternator Alternator related Issues area/alternator-streams backport/none Backport is not required labels Apr 9, 2026

ScyllaPiotr marked this pull request as ready for review April 9, 2026 15:55

ScyllaPiotr requested review from annastuchlik and nyh as code owners April 9, 2026 15:55

scylladbbot added the status/ci_in_progress label Apr 9, 2026

nyh reviewed Apr 9, 2026

View reviewed changes

scylladbbot added status/ci_in_progress and removed status/ci_in_progress labels Apr 9, 2026

ScyllaPiotr force-pushed the hollow/alternator-streams-7239 branch from d1b01c7 to d5b7208 Compare April 10, 2026 10:03

scylladbbot added status/ci_in_progress and removed status/ci_in_progress labels Apr 10, 2026

scylladbbot removed the status/ci_in_progress label Apr 10, 2026

ScyllaPiotr added this to the 2026.3.0 milestone Apr 13, 2026

ScyllaPiotr force-pushed the hollow/alternator-streams-7239 branch from d5b7208 to 3d29d91 Compare April 15, 2026 12:24

scylladbbot added the status/ci_in_progress label Apr 15, 2026

annastuchlik approved these changes Apr 15, 2026

View reviewed changes

scylladbbot removed the status/ci_in_progress label Apr 15, 2026

scylladbbot added the status/ci_in_progress label Apr 17, 2026

scylladbbot removed the status/ci_in_progress label Apr 17, 2026

ScyllaPiotr requested a review from nyh April 17, 2026 11:22

ScyllaPiotr force-pushed the hollow/alternator-streams-7239 branch from 3d29d91 to 72743db Compare April 21, 2026 08:07

ScyllaPiotr added 3 commits April 21, 2026 10:11

docs/alternator: document stream disable/re-enable behavior

e596b1d

ScyllaPiotr force-pushed the hollow/alternator-streams-7239 branch from 72743db to 71ee33e Compare April 21, 2026 08:12

scylladbbot added the status/ci_in_progress label Apr 21, 2026

scylladbbot removed the status/ci_in_progress label Apr 21, 2026

github-actions Bot added the P2 High Priority label Apr 21, 2026

scylladbbot added status/ci_in_progress and removed status/ci_in_progress labels Apr 21, 2026

nyh requested a review from Copilot April 26, 2026 11:48

Copilot started reviewing on behalf of nyh April 26, 2026 11:48 View session

Copilot AI reviewed Apr 26, 2026

View reviewed changes

nyh requested changes Apr 26, 2026

View reviewed changes

-    desc = dynamodbstreams.describe_stream(StreamArn=arn)['StreamDescription']
-    nrecords = 0
-    for shard in desc['Shards']:
-        iter = dynamodbstreams.get_shard_iterator(StreamArn=arn,
-            ShardId=shard['ShardId'], ShardIteratorType='TRIM_HORIZON')['ShardIterator']
-        response = dynamodbstreams.get_records(ShardIterator=iter)
-        if 'Records' in response:
-            nrecords += len(response['Records'])
+    shards = []
+    last_evaluated_shard_id = None
+    while True:
+        kwargs = {'StreamArn': arn}
+        if last_evaluated_shard_id:
+            kwargs['ExclusiveStartShardId'] = last_evaluated_shard_id
+        desc = dynamodbstreams.describe_stream(**kwargs)['StreamDescription']
+        shards.extend(desc.get('Shards', []))
+        last_evaluated_shard_id = desc.get('LastEvaluatedShardId')
+        if not last_evaluated_shard_id:
+            break
+    nrecords = 0
+    for shard in shards:
+        shard_iter = dynamodbstreams.get_shard_iterator(StreamArn=arn,
+            ShardId=shard['ShardId'], ShardIteratorType='TRIM_HORIZON')['ShardIterator']
+        while shard_iter:
+            response = dynamodbstreams.get_records(ShardIterator=shard_iter)
+            nrecords += len(response.get('Records', []))
+            shard_iter = response.get('NextShardIterator')

-        assert _count_stream_records(dynamodbstreams, arn1) > 0
+        deadline = time.time() + 10
+        count = 0
+        while time.time() < deadline:
+            count = _count_stream_records(dynamodbstreams, arn1)
+            if count > 0:
+                break
+            time.sleep(0.1)
+        assert count > 0

Conversation

ScyllaPiotr commented Apr 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

scylladbbot commented Apr 9, 2026

Docs Preview 📖

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ScyllaPiotr Apr 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

scylladbbot commented Apr 9, 2026

🟢 CI State: SUCCESS

Build Details:

Uh oh!

scylladbbot commented Apr 10, 2026

🟢 CI State: SUCCESS

Build Details:

Uh oh!

nyh commented Apr 14, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

yaronkaikov commented Apr 14, 2026

Uh oh!

nyh commented Apr 14, 2026

Uh oh!

annastuchlik left a comment

Choose a reason for hiding this comment

Uh oh!

mykaul commented Apr 15, 2026

Uh oh!

scylladbbot commented Apr 15, 2026

🔴 CI State: FAILURE

Failed Tests (3/67720):

Recent 10 Failures

Build Details:

Uh oh!

ScyllaPiotr commented Apr 17, 2026

Failed Tests (3/67720):

Uh oh!

scylladbbot commented Apr 17, 2026

🟢 CI State: SUCCESS

Build Details:

Uh oh!

scylladbbot commented Apr 21, 2026

🔴 CI State: FAILURE

Failed Tests (1/8942):

Recent 10 Failures

Build Details:

Uh oh!

nyh commented Apr 21, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ScyllaPiotr commented Apr 21, 2026

Uh oh!

ScyllaPiotr commented Apr 24, 2026

Uh oh!

nyh commented Apr 26, 2026 • edited by atlassian Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI Apr 26, 2026

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 26, 2026

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 26, 2026

Choose a reason for hiding this comment

Uh oh!

ScyllaPiotr commented Apr 9, 2026 •

edited

Loading

ScyllaPiotr Apr 10, 2026 •

edited

Loading

nyh commented Apr 14, 2026 •

edited

Loading

nyh commented Apr 21, 2026 •

edited

Loading

nyh commented Apr 26, 2026 •

edited by atlassian Bot

Loading