test/alternator: clean up write isolation default and add more tests for the different modes by nyh · Pull Request #24493 · scylladb/scylladb

nyh · 2025-06-12T10:37:42Z

In #24442 it was noticed that accidentally, for a year now, test.py and CI were running the Alternator functional tests (test/alternator) using one write isolation mode (only_rmw_uses_lwt) while the manual test/alternator/run used a different write isolation mode (always_use_lwt). There is no good reason for this discrepancy, so in the second patch of this 2-patch series we change test/alternator/run to use the write isolation mode that we've had in CI for the last year.

But then, discussion on #24442 started: Instead of picking one mode or the other, don't we need test both modes? In fact, all four modes?

The honest answer is that running all tests with all combinations of options is not practical - we'll find ourselves with an exponentially growing number of tests. What we really need to do is to run most tests that have nothing to do with write isolation modes on just one arbitrary write isolation mode like we're doing today. For example, numerous tests for the finer details of the ConditionExpression syntax will run on one mode. But then, have a separate test that verifies that one representative example of ConditionExpression (for example) works correctly on all four write isolation modes - rejected in forbid_rmw mode, allowed and behaves as expected on the other three. We had some tests like that in our test suite already, but the first patch in this series adds many more, making the test much more exhaustive and making it easier to review that we're really testing all four write isolation modes in every scenario that matters.

Fixes #24442

No need to backport this patch - it's just adding more tests and changing developer-only test behavior.

scylladb-promoter · 2025-06-12T19:54:32Z

🔴 CI State: FAILURE

✅ - Framework test
✅ - Build
✅ - Unit Tests Custom
The following new/updated tests ran 100 times for each mode:
🔹 alternator
🔹 alternator/test_condition_expression
🔹 alternator/test_update_expression
🔹 alternator/test_write_isolation
❌ - Unit Tests

Failed Tests (1/97479):

test_random_failures[stop_after_init_of_system_ks-add_new_udt].debug.1 🔍

Build Details:

Duration: 9 hr 16 min
Builder: i-04dc601b0e025fd1f (m5d.16xlarge)

ScyllaPiotr

Wow, very nice suite of tests!
I only raised a couple of concerns, I wonder if you will share them.

ScyllaPiotr · 2025-06-12T14:40:54Z

    cmd += [
        '--alternator-address', ip,
        '--alternator-enforce-authorization', '1',
-        '--alternator-write-isolation', 'always_use_lwt',


We ultimately never finished the discussion in the issue thread. With this change, you decide to move all the tests to default from fully consistent to eventually consistent (right?) and therefore susceptible to data inconsistencies.

It's not exactly "fully consistent vs eventually consistent". I tried to explain the differences between these modes in the documentation, but to make a long story short "always_use_lwt" uses LWT for every write and guarantees that any concurrent writes are isolated - even if you mix RMW and pure writes (e.g., consider i=3 and doing "i=7" and "i++" in parallel. With correct isolation you end up with either i=7 or i=8. With incorrect isolation you can end up with i=4.). The mode "only_rmw_use_lwt" only isolates RMW, and assumes you know you don't mix pure writes and RMW to the same item - which is an assumption that DynamoDB doesn't make, but some users know is true.

Anyway, for the tests it doesn't really matter what mode to use. As I noted as an example, test_condition_expression.py has almost 2,000 lines of code, 53 test functions, to test a large number of syntax features and corner cases of condition expression. None of these tests care how we do write isolation. We just need one test (that I cleaned up in this patch) to check how condition expressions - any condition expressions - work with the different write isolation modes, and all the 53 other tests - they can run on any mode we choose. We don't need to choose the mode after very deliberate thinking.

It's not exactly "fully consistent vs eventually consistent".

That's clear now, thanks for the good example.

Anyway, for the tests it doesn't really matter what mode to use.

Well, the issue cover letter asks which isolation mode should we choose, so it must be relevant, at least - between the test suites (test.py vs test/alternator/run).

ScyllaPiotr · 2025-06-12T14:41:56Z

+#
+# SPDX-License-Identifier: LicenseRef-ScyllaDB-Source-Available-1.0
+
+# Tests for the the operations allowed under the four different write isolation


repetition: the the

Thanks. Fixed.

ScyllaPiotr · 2025-06-12T15:04:46Z

+# are marked scylla_only, so all tests that use one of them are skipped
+# when not running against Scylla.
+@pytest.fixture(scope='module')
+def test_table_s_forbid(dynamodb, scylla_only):


I am generally a fan of elaborate names. I propose to use the following suffixes for the fuxtures:

forbid_rmw

always_lwt

only_rmw_lwt

unsafe_rmw

I find the current names ambiguous.

Perhaps. I originally decided on these shorter names because:

These names are used dozens of times throughout the test code, e.g., test_table_s_forbid.put_item(Item={'p': p, 'a': 2},, and the longer they are the harder it is to read those lines - I think.

Alternator already supports any subset of these names, so instead of "forbid_rmw" you can use "forbid" or even "f" and this is even documented.

Ad. 1 Well, new devs obviously have less context to base upon and for them longer (more verbose) names are easier to read (understand).

Ad 2. Well I'm not sure it was a good idea to allow these abbreviations, but it's pointless to exchange arguments about it now. However, the way one uses it now can be argued and a lone forbid is ambiguous, while f for me is blatantly misleading - for me it associates with force.

But it's obviously only an opinion.

Ok, I'll change in this test to the long names (I won't change the API). I want this PR to go in :-)

While at it, I'll remove the word "test" from the table fixture. While it made semantic sense when I called the original table in these tests "test_table" (it's a test table, after all!), it's really confusing for seasoned pytest developers who are used to test_something being the name of a test, not a fixture. I don't want to overhaul the entire test suite to rename test_table, but here we create four new tables, so I can give them a better name like table_s_forbid_rmw.

Or even shorter - just table_forbid_rmw. It's not like I have in this file tables with a different schema.

ScyllaPiotr · 2025-06-12T20:09:04Z

+# the forbid mode.
+
+# PutItem & ConditionExpression:
+def test_isolation_putitem_conditionexpression(test_table_s_forbid, test_table_s_permit):


This name I like 👍😄

After this test, the data stays in the tables? Even more incentive to merge this test with the deleteitem one, that I write about further down.

Yes, all throughout the Alternator test suite, tests share the same tables, and add data to a unique key, and often this data just stays in the table which doesn't matter at all - and is probably even faster than bothering to delete the items with additional comments. Instead, the entire is delete in one fell swoop when the test ends (for the tables like test_table_s shared by all tests) or in the case of the tables used only in this test file, in a "module"-scoped fixture, the table is deleted at the end of the file.

So there is no penalty for leaving a live item.
You're right that there's a small penalty when two tests need to create the same item (and each test creates the item in a different key) but this penality is absolutely tiny - less than a millisecond, and not worth optimizing from.

I had thought about an influence on correctness of the tests, assuming the tests are run (at least statistically) in a specific order.

That's why Alternator tests are all written to assume they do not work on a brand new table, so should work on a unique key (currently "random", if you're super-super-super unlucky it might not be unique, see #9988) and not a fixed key like "x".
A minority of the tests that do need very unique or brand-new tables can create a new table in the test, but most don't. Depending on your disk setup (/tmp or current directory), this testing philosophy can save 0.01 to 0.1 seconds on each test, which adds up when you have a thousand tests.

ScyllaPiotr · 2025-06-12T20:14:10Z

+            ExpressionAttributeValues={':oldval': 1})
+
+# DeleteItem & ConditionExpression:
+def test_isolation_deleteitem_conditionexpression(test_table_s_forbid, test_table_s_permit):


Perhaps it would be beneficial (execution time-wise) to merge this test with the putitem one. There are many similar steps in the two tests.

As I noted above perhaps we can save a couple of milliseconds this way, but I don't think it's worth it.

ScyllaPiotr · 2025-06-12T21:08:51Z

+        assert ret['Attributes'] == {'p': p, 'a': 1}
+        assert 'Item' not in table.get_item(Key={'p': p}, ConsistentRead=True)
+    with pytest.raises(ClientError, match=rmw_forbidden):
+        test_table_s_forbid.delete_item(Key={'p': p}, ReturnValues='ALL_OLD')


You test with different data in the tables inside test_table_s_permit vs the test_table_s_forbid. It's not a problem, but an actual difference in the test conditions.

ScyllaPiotr · 2025-06-12T21:10:25Z

+            AttributeUpdates={'a': {'Value': 1, 'Action': 'ADD'}})
+        assert table.get_item(Key={'p': p}, ConsistentRead=True)['Item'] == {'p': p, 'a': 2}
+    with pytest.raises(ClientError, match=rmw_forbidden):
+        test_table_s_forbid.update_item(Key={'p': p},


You test with different data in the tables inside test_table_s_permit vs the test_table_s_forbid. It's not a problem, but an actual difference in the test conditions.

ScyllaPiotr · 2025-06-12T21:13:42Z

+            Expected={'a': {'ComparisonOperator': 'EQ', 'AttributeValueList': [1]}})
+        assert 'Item' not in table.get_item(Key={'p': p}, ConsistentRead=True)
+    with pytest.raises(ClientError, match=rmw_forbidden):
+        test_table_s_forbid.delete_item(Key={'p': p},


You test delete_item with different data in the tables inside test_table_s_permit vs the test_table_s_forbid. It's not a problem, but an actual difference in the test conditions.

Yes, I answered this above as well. If you feel this is a problem I can change both^H^H^H^Hall four places.

ScyllaPiotr · 2025-06-12T21:14:22Z

+            Expected={'a': {'ComparisonOperator': 'EQ', 'AttributeValueList': [1]}})
+        assert table.get_item(Key={'p': p}, ConsistentRead=True)['Item'] == {'p': p, 'a': 2}
+    with pytest.raises(ClientError, match=rmw_forbidden):
+        test_table_s_forbid.update_item(Key={'p': p},


You test update_item with different data in the tables inside test_table_s_permit vs the test_table_s_forbid. It's not a problem, but an actual difference in the test conditions.

ScyllaPiotr · 2025-06-12T21:16:06Z

+            Expected={'a': {'ComparisonOperator': 'EQ', 'AttributeValueList': [1]}})
+        assert table.get_item(Key={'p': p}, ConsistentRead=True)['Item'] == {'p': p, 'a': 2}
+    with pytest.raises(ClientError, match=rmw_forbidden):
+        test_table_s_forbid.put_item(Item={'p': p, 'a': 2},


You test put_item with different data in the tables inside test_table_s_permit vs the test_table_s_forbid. It's not a problem, but an actual difference in the test conditions.

scylladb-promoter · 2025-06-16T23:16:14Z

🔴 CI State: FAILURE

✅ - Framework test
✅ - Build
✅ - Unit Tests Custom
The following new/updated tests ran 100 times for each mode:
🔹 alternator
🔹 alternator/test_condition_expression
🔹 alternator/test_update_expression
🔹 alternator/test_write_isolation
❌ - Unit Tests

Failed Tests (1/97657):

test_random_failures[stop_after_starting_group0_service-add_new_udt].debug.2 🔍

Build Details:

Duration: 9 hr 6 min
Builder: spider7.cloudius-systems.com

nyh · 2025-06-17T07:30:11Z

* [test_random_failures[stop_after_starting_group0_service-add_new_udt].debug.2](https://jenkins.scylladb.com//job/scylla-master/job/scylla-ci/17716/testReport/junit/cluster.random_failures/test_random_failures/Tests___Unit_Tests___test_random_failures_stop_after_starting_group0_service_add_new_udt__debug_2) [🔍](https://github.com/scylladb/scylladb/issues?q=is:issue+is:open+test_random_failures%5Bstop_after_starting_group0_service-add_new_udt%5D.debug.2)

Known flakiness - this is #24513 - a recent regression that Avi had to revert. I'll rebase to get the newer master with the revert.

scylladb-promoter · 2025-06-17T16:31:36Z

🟢 CI State: SUCCESS

✅ - Framework test
✅ - Build
✅ - Unit Tests Custom
The following new/updated tests ran 100 times for each mode:
🔹 alternator
🔹 alternator/test_condition_expression
🔹 alternator/test_update_expression
🔹 alternator/test_write_isolation
✅ - Unit Tests

Build Details:

Duration: 8 hr 46 min
Builder: spider1.cloudius-systems.com

Before this patch, we had in test_condition_expression.py and test_update_expression.py some rudimentary tests that the different write isolation modes behave as expected. Basically, we wanted to test that read-modify-write (RMW) operations are recognized and forbidden in forbid_rmw mode, but work correctly in the three other modes. We only check non-concurrent writes, so the actual write isolation is NOT checked, just the correctness of non-concurrent writes. However, since these tests were split across several files, and many of the tests just ran other existing tests in different write isolation modes, it was hard to see what exactly was being tested, and what was missed. And indeed we missed checking some RMW operations, such as requests with ReturnValues, requests with the older Expected or AttributeUpdates (only the newer ConditionExpression and UpdateExpression were tested), and ADD and DELETE operations in UpdateExpression. So this patch replaces the existing partial tests with a new test file test_write_isolation.py dedicated to testing all kinds of RMW operations in one place, and how they don't work in forbid_rmw and do work in the other modes. Writing all these tests in one place made it easier to create a really exhaustive test of all the different operations and optional parameters, and conversely - make sure that we don't test *unnecessary* things such as different ConditionExpression expressions (we already have 1800 lines of tests for ConditionExpression, and the actual content of the condition is unrelated to write isolation modes). Signed-off-by: Nadav Har'El <[email protected]>

Originally (since commit c3da9f2), Alternator's functional test suite (test/alternator) ran "always_use_lwt" write isolation mode. The original thinking was that we need to exercise this more difficult mode and it's the most important mode. This mode was originally chosen in test/alternator/run. However, starting with commit 76a766c (a year ago), test.py no longer runs test/alternator/run. Instead, it runs Scylla itself, and the options for running Scylla appear in test/alternator/suite.yaml, and accidentally the write isolation mode only_rmw_uses_lwt was chosen there. The purpose of this patch is to reconcile this difference and use the same mode in test.py (which CI is using) and test/alternator/run (which is only used by some developers, during development). I decided to have this patch change test/alternator/run to use only_rmw_uses_lwt. As noted above, this is anyway how all Alternator tests have been running in CI in the past year (through test.py). Also, the mode only_rmw_uses_lwt makes running the Alternator test suite slightly faster (52 seconds instead of 58 seconds, on my laptop) which is always nice for developers. This patch changes nothing for testing in CI - only manual runs through test/alternator/run are affected. Signed-off-by: Nadav Har'El <[email protected]>

nyh · 2025-06-29T11:02:23Z

Pushed a new version implementing most of @ScyllaPiotr 's suggestions about renaming the test tables etc, but didn't act on a bunch of places where you told me the test for "forbid" and "permit" modes are not identical (I know, and I think it's fine. the intention was to test all four modes - it's not necessary that the tests will be 100% identical).

@ScyllaPiotr you left "request changes" so please re-review.

scylladb-promoter · 2025-06-29T13:26:23Z

🟢 CI State: SUCCESS

Mode	Stage	Status	Node
	Framework test	✅	godzilla
dev	Build	✅	spider8.cloudius-systems.com
	test.py	✅	spider8.cloudius-systems.com
	Unit Tests Custom 🔹 alternator 🔹 alternator/test_condition_expression 🔹 alternator/test_update_expression 🔹 alternator/test_write_isolation	✅	spider8.cloudius-systems.com
release	Build	✅	spider5.cloudius-systems.com
	test.py	✅	spider5.cloudius-systems.com
	Unit Tests Custom 🔹 alternator 🔹 alternator/test_condition_expression 🔹 alternator/test_update_expression 🔹 alternator/test_write_isolation	✅	spider5.cloudius-systems.com
debug	Build	✅	spider3.cloudius-systems.com
	test.py	✅	spider3.cloudius-systems.com
	Unit Tests Custom 🔹 alternator 🔹 alternator/test_condition_expression 🔹 alternator/test_update_expression 🔹 alternator/test_write_isolation	✅	spider3.cloudius-systems.com

Build Details:

Duration: 2 hr 14 min

nyh · 2025-07-06T07:48:02Z

@scylladb/scylla-maint please consider merging: There is one reviewer approval and the CI passed.

nyh · 2025-07-08T06:42:08Z

@scylladb/scylla-maint please consider merging: There is one reviewer approval and the CI passed.

ping

nyh requested review from ScyllaPiotr and xtrey June 12, 2025 10:37

nyh added the backport/none Backport is not required label Jun 12, 2025

github-actions Bot added area/test Issues related to the testing system code and environment area/alternator Alternator related Issues labels Jun 12, 2025

scylladbbot added status/ci in progress and removed status/ci in progress labels Jun 12, 2025

ScyllaPiotr suggested changes Jun 12, 2025

View reviewed changes

ScyllaPiotr mentioned this pull request Jun 13, 2025

cql, schema: Extend keyspace, table, views, indexes name length limit from 48 to 192 bytes #24500

Merged

nyh force-pushed the fix-24442 branch from c352d69 to 6684e46 Compare June 16, 2025 13:41

scylladbbot added status/ci in progress and removed status/ci in progress labels Jun 16, 2025

nyh force-pushed the fix-24442 branch from 6684e46 to a066ee2 Compare June 17, 2025 07:33

nyh mentioned this pull request Jun 17, 2025

cluster.random_failures.test_random_failures.test_random_failures[stop_after_starting_group0_service-add_new_udt] is flaky #24513

Closed

scylladbbot added status/ci in progress and removed status/ci in progress labels Jun 17, 2025

nyh added 2 commits June 29, 2025 13:58

nyh force-pushed the fix-24442 branch from a066ee2 to e7257b1 Compare June 29, 2025 11:00

scylladbbot added status/ci_in_progress and removed status/ci_in_progress labels Jun 29, 2025

ScyllaPiotr approved these changes Jul 4, 2025

View reviewed changes

scylladb-promoter merged commit 641a907 into scylladb:master Jul 15, 2025
25 checks passed

scylladbbot added the promoted-to-master label Jul 15, 2025

Conversation

nyh commented Jun 12, 2025

Uh oh!

scylladb-promoter commented Jun 12, 2025

🔴 CI State: FAILURE

Failed Tests (1/97479):

Build Details:

Uh oh!

ScyllaPiotr left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

nyh Jun 29, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

nyh Jun 15, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

scylladb-promoter commented Jun 16, 2025

🔴 CI State: FAILURE

Failed Tests (1/97657):

Build Details:

Uh oh!

nyh commented Jun 17, 2025

Uh oh!

scylladb-promoter commented Jun 17, 2025

🟢 CI State: SUCCESS

Build Details:

Uh oh!

nyh commented Jun 29, 2025

Uh oh!

scylladb-promoter commented Jun 29, 2025

🟢 CI State: SUCCESS

nyh Jun 29, 2025 •

edited

Loading

nyh Jun 15, 2025 •

edited

Loading