Skip to content

test/alternator: clean up write isolation default and add more tests for the different modes#24493

Merged
scylladb-promoter merged 2 commits intoscylladb:masterfrom
nyh:fix-24442
Jul 15, 2025
Merged

test/alternator: clean up write isolation default and add more tests for the different modes#24493
scylladb-promoter merged 2 commits intoscylladb:masterfrom
nyh:fix-24442

Conversation

@nyh
Copy link
Copy Markdown
Contributor

@nyh nyh commented Jun 12, 2025

In #24442 it was noticed that accidentally, for a year now, test.py and CI were running the Alternator functional tests (test/alternator) using one write isolation mode (only_rmw_uses_lwt) while the manual test/alternator/run used a different write isolation mode (always_use_lwt). There is no good reason for this discrepancy, so in the second patch of this 2-patch series we change test/alternator/run to use the write isolation mode that we've had in CI for the last year.

But then, discussion on #24442 started: Instead of picking one mode or the other, don't we need test both modes? In fact, all four modes?

The honest answer is that running all tests with all combinations of options is not practical - we'll find ourselves with an exponentially growing number of tests. What we really need to do is to run most tests that have nothing to do with write isolation modes on just one arbitrary write isolation mode like we're doing today. For example, numerous tests for the finer details of the ConditionExpression syntax will run on one mode. But then, have a separate test that verifies that one representative example of ConditionExpression (for example) works correctly on all four write isolation modes - rejected in forbid_rmw mode, allowed and behaves as expected on the other three. We had some tests like that in our test suite already, but the first patch in this series adds many more, making the test much more exhaustive and making it easier to review that we're really testing all four write isolation modes in every scenario that matters.

Fixes #24442

No need to backport this patch - it's just adding more tests and changing developer-only test behavior.

@nyh nyh requested review from ScyllaPiotr and xtrey June 12, 2025 10:37
@nyh nyh added the backport/none Backport is not required label Jun 12, 2025
@github-actions github-actions Bot added area/test Issues related to the testing system code and environment area/alternator Alternator related Issues labels Jun 12, 2025
@scylladb-promoter
Copy link
Copy Markdown
Contributor

🔴 CI State: FAILURE

✅ - Framework test
✅ - Build
✅ - Unit Tests Custom
The following new/updated tests ran 100 times for each mode:
🔹 alternator
🔹 alternator/test_condition_expression
🔹 alternator/test_update_expression
🔹 alternator/test_write_isolation
❌ - Unit Tests

Failed Tests (1/97479):

Build Details:

  • Duration: 9 hr 16 min
  • Builder: i-04dc601b0e025fd1f (m5d.16xlarge)

Copy link
Copy Markdown
Contributor

@ScyllaPiotr ScyllaPiotr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wow, very nice suite of tests!
I only raised a couple of concerns, I wonder if you will share them.

Comment thread test/alternator/run
cmd += [
'--alternator-address', ip,
'--alternator-enforce-authorization', '1',
'--alternator-write-isolation', 'always_use_lwt',
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We ultimately never finished the discussion in the issue thread. With this change, you decide to move all the tests to default from fully consistent to eventually consistent (right?) and therefore susceptible to data inconsistencies.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not exactly "fully consistent vs eventually consistent". I tried to explain the differences between these modes in the documentation, but to make a long story short "always_use_lwt" uses LWT for every write and guarantees that any concurrent writes are isolated - even if you mix RMW and pure writes (e.g., consider i=3 and doing "i=7" and "i++" in parallel. With correct isolation you end up with either i=7 or i=8. With incorrect isolation you can end up with i=4.). The mode "only_rmw_use_lwt" only isolates RMW, and assumes you know you don't mix pure writes and RMW to the same item - which is an assumption that DynamoDB doesn't make, but some users know is true.

Anyway, for the tests it doesn't really matter what mode to use. As I noted as an example, test_condition_expression.py has almost 2,000 lines of code, 53 test functions, to test a large number of syntax features and corner cases of condition expression. None of these tests care how we do write isolation. We just need one test (that I cleaned up in this patch) to check how condition expressions - any condition expressions - work with the different write isolation modes, and all the 53 other tests - they can run on any mode we choose. We don't need to choose the mode after very deliberate thinking.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not exactly "fully consistent vs eventually consistent".

That's clear now, thanks for the good example.

Anyway, for the tests it doesn't really matter what mode to use.

Well, the issue cover letter asks which isolation mode should we choose, so it must be relevant, at least - between the test suites (test.py vs test/alternator/run).

Comment thread test/alternator/test_write_isolation.py Outdated
#
# SPDX-License-Identifier: LicenseRef-ScyllaDB-Source-Available-1.0

# Tests for the the operations allowed under the four different write isolation
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

repetition: the the

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks. Fixed.

Comment thread test/alternator/test_write_isolation.py Outdated
# are marked scylla_only, so all tests that use one of them are skipped
# when not running against Scylla.
@pytest.fixture(scope='module')
def test_table_s_forbid(dynamodb, scylla_only):
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am generally a fan of elaborate names. I propose to use the following suffixes for the fuxtures:

  • forbid_rmw
  • always_lwt
  • only_rmw_lwt
  • unsafe_rmw

I find the current names ambiguous.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps. I originally decided on these shorter names because:

  1. These names are used dozens of times throughout the test code, e.g., test_table_s_forbid.put_item(Item={'p': p, 'a': 2},, and the longer they are the harder it is to read those lines - I think.
  2. Alternator already supports any subset of these names, so instead of "forbid_rmw" you can use "forbid" or even "f" and this is even documented.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ad. 1 Well, new devs obviously have less context to base upon and for them longer (more verbose) names are easier to read (understand).

Ad 2. Well I'm not sure it was a good idea to allow these abbreviations, but it's pointless to exchange arguments about it now. However, the way one uses it now can be argued and a lone forbid is ambiguous, while f for me is blatantly misleading - for me it associates with force.

But it's obviously only an opinion.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, I'll change in this test to the long names (I won't change the API). I want this PR to go in :-)

Copy link
Copy Markdown
Contributor Author

@nyh nyh Jun 29, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While at it, I'll remove the word "test" from the table fixture. While it made semantic sense when I called the original table in these tests "test_table" (it's a test table, after all!), it's really confusing for seasoned pytest developers who are used to test_something being the name of a test, not a fixture. I don't want to overhaul the entire test suite to rename test_table, but here we create four new tables, so I can give them a better name like table_s_forbid_rmw.

Or even shorter - just table_forbid_rmw. It's not like I have in this file tables with a different schema.

Comment thread test/alternator/test_write_isolation.py Outdated
# the forbid mode.

# PutItem & ConditionExpression:
def test_isolation_putitem_conditionexpression(test_table_s_forbid, test_table_s_permit):
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This name I like 👍😄

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After this test, the data stays in the tables? Even more incentive to merge this test with the deleteitem one, that I write about further down.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, all throughout the Alternator test suite, tests share the same tables, and add data to a unique key, and often this data just stays in the table which doesn't matter at all - and is probably even faster than bothering to delete the items with additional comments. Instead, the entire is delete in one fell swoop when the test ends (for the tables like test_table_s shared by all tests) or in the case of the tables used only in this test file, in a "module"-scoped fixture, the table is deleted at the end of the file.

So there is no penalty for leaving a live item.
You're right that there's a small penalty when two tests need to create the same item (and each test creates the item in a different key) but this penality is absolutely tiny - less than a millisecond, and not worth optimizing from.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had thought about an influence on correctness of the tests, assuming the tests are run (at least statistically) in a specific order.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's why Alternator tests are all written to assume they do not work on a brand new table, so should work on a unique key (currently "random", if you're super-super-super unlucky it might not be unique, see #9988) and not a fixed key like "x".
A minority of the tests that do need very unique or brand-new tables can create a new table in the test, but most don't. Depending on your disk setup (/tmp or current directory), this testing philosophy can save 0.01 to 0.1 seconds on each test, which adds up when you have a thousand tests.

Comment thread test/alternator/test_write_isolation.py Outdated
ExpressionAttributeValues={':oldval': 1})

# DeleteItem & ConditionExpression:
def test_isolation_deleteitem_conditionexpression(test_table_s_forbid, test_table_s_permit):
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps it would be beneficial (execution time-wise) to merge this test with the putitem one. There are many similar steps in the two tests.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As I noted above perhaps we can save a couple of milliseconds this way, but I don't think it's worth it.

Comment thread test/alternator/test_write_isolation.py Outdated
assert ret['Attributes'] == {'p': p, 'a': 1}
assert 'Item' not in table.get_item(Key={'p': p}, ConsistentRead=True)
with pytest.raises(ClientError, match=rmw_forbidden):
test_table_s_forbid.delete_item(Key={'p': p}, ReturnValues='ALL_OLD')
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You test with different data in the tables inside test_table_s_permit vs the test_table_s_forbid. It's not a problem, but an actual difference in the test conditions.

Comment thread test/alternator/test_write_isolation.py Outdated
AttributeUpdates={'a': {'Value': 1, 'Action': 'ADD'}})
assert table.get_item(Key={'p': p}, ConsistentRead=True)['Item'] == {'p': p, 'a': 2}
with pytest.raises(ClientError, match=rmw_forbidden):
test_table_s_forbid.update_item(Key={'p': p},
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You test with different data in the tables inside test_table_s_permit vs the test_table_s_forbid. It's not a problem, but an actual difference in the test conditions.

Comment thread test/alternator/test_write_isolation.py Outdated
Expected={'a': {'ComparisonOperator': 'EQ', 'AttributeValueList': [1]}})
assert 'Item' not in table.get_item(Key={'p': p}, ConsistentRead=True)
with pytest.raises(ClientError, match=rmw_forbidden):
test_table_s_forbid.delete_item(Key={'p': p},
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You test delete_item with different data in the tables inside test_table_s_permit vs the test_table_s_forbid. It's not a problem, but an actual difference in the test conditions.

Copy link
Copy Markdown
Contributor Author

@nyh nyh Jun 15, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I answered this above as well. If you feel this is a problem I can change both^H^H^H^Hall four places.

Comment thread test/alternator/test_write_isolation.py Outdated
Expected={'a': {'ComparisonOperator': 'EQ', 'AttributeValueList': [1]}})
assert table.get_item(Key={'p': p}, ConsistentRead=True)['Item'] == {'p': p, 'a': 2}
with pytest.raises(ClientError, match=rmw_forbidden):
test_table_s_forbid.update_item(Key={'p': p},
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You test update_item with different data in the tables inside test_table_s_permit vs the test_table_s_forbid. It's not a problem, but an actual difference in the test conditions.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ditto.

Comment thread test/alternator/test_write_isolation.py Outdated
Expected={'a': {'ComparisonOperator': 'EQ', 'AttributeValueList': [1]}})
assert table.get_item(Key={'p': p}, ConsistentRead=True)['Item'] == {'p': p, 'a': 2}
with pytest.raises(ClientError, match=rmw_forbidden):
test_table_s_forbid.put_item(Item={'p': p, 'a': 2},
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You test put_item with different data in the tables inside test_table_s_permit vs the test_table_s_forbid. It's not a problem, but an actual difference in the test conditions.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ditto.

@scylladb-promoter
Copy link
Copy Markdown
Contributor

🔴 CI State: FAILURE

✅ - Framework test
✅ - Build
✅ - Unit Tests Custom
The following new/updated tests ran 100 times for each mode:
🔹 alternator
🔹 alternator/test_condition_expression
🔹 alternator/test_update_expression
🔹 alternator/test_write_isolation
❌ - Unit Tests

Failed Tests (1/97657):

Build Details:

  • Duration: 9 hr 6 min
  • Builder: spider7.cloudius-systems.com

@nyh
Copy link
Copy Markdown
Contributor Author

nyh commented Jun 17, 2025

* [test_random_failures[stop_after_starting_group0_service-add_new_udt].debug.2](https://jenkins.scylladb.com//job/scylla-master/job/scylla-ci/17716/testReport/junit/cluster.random_failures/test_random_failures/Tests___Unit_Tests___test_random_failures_stop_after_starting_group0_service_add_new_udt__debug_2) [🔍](https://github.com/scylladb/scylladb/issues?q=is:issue+is:open+test_random_failures%5Bstop_after_starting_group0_service-add_new_udt%5D.debug.2)

Known flakiness - this is #24513 - a recent regression that Avi had to revert. I'll rebase to get the newer master with the revert.

@scylladb-promoter
Copy link
Copy Markdown
Contributor

🟢 CI State: SUCCESS

✅ - Framework test
✅ - Build
✅ - Unit Tests Custom
The following new/updated tests ran 100 times for each mode:
🔹 alternator
🔹 alternator/test_condition_expression
🔹 alternator/test_update_expression
🔹 alternator/test_write_isolation
✅ - Unit Tests

Build Details:

  • Duration: 8 hr 46 min
  • Builder: spider1.cloudius-systems.com

nyh added 2 commits June 29, 2025 13:58
Before this patch, we had in test_condition_expression.py and
test_update_expression.py some rudimentary tests that the different
write isolation modes behave as expected. Basically, we wanted to test
that read-modify-write (RMW) operations are recognized and forbidden
in forbid_rmw mode, but work correctly in the three other modes.
We only check non-concurrent writes, so the actual write isolation is
NOT checked, just the correctness of non-concurrent writes.

However, since these tests were split across several files, and many
of the tests just ran other existing tests in different write isolation
modes, it was hard to see what exactly was being tested, and what was
missed. And indeed we missed checking some RMW operations, such as
requests with ReturnValues, requests with the older Expected or
AttributeUpdates (only the newer ConditionExpression and UpdateExpression
were tested), and ADD and DELETE operations in UpdateExpression.

So this patch replaces the existing partial tests with a new test file
test_write_isolation.py dedicated to testing all kinds of RMW operations
in one place, and how they don't work in forbid_rmw and do work in
the other modes. Writing all these tests in one place made it easier
to create a really exhaustive test of all the different operations and
optional parameters, and conversely - make sure that we don't test
*unnecessary* things such as different ConditionExpression expressions
(we already have 1800 lines of tests for ConditionExpression, and the
actual content of the condition is unrelated to write isolation modes).

Signed-off-by: Nadav Har'El <[email protected]>
Originally (since commit c3da9f2), Alternator's functional test suite
(test/alternator) ran "always_use_lwt" write isolation mode. The original
thinking was that we need to exercise this more difficult mode and it's
the most important mode. This mode was originally chosen in
test/alternator/run.

However, starting with commit 76a766c (a year ago), test.py no longer
runs test/alternator/run. Instead, it runs Scylla itself, and the options
for running Scylla appear in test/alternator/suite.yaml, and accidentally
the write isolation mode only_rmw_uses_lwt was chosen there.

The purpose of this patch is to reconcile this difference and use the
same mode in test.py (which CI is using) and test/alternator/run (which
is only used by some developers, during development).

I decided to have this patch change test/alternator/run to use
only_rmw_uses_lwt. As noted above, this is anyway how all Alternator
tests have been running in CI in the past year (through test.py).
Also, the mode only_rmw_uses_lwt makes running the Alternator test
suite slightly faster (52 seconds instead of 58 seconds, on my laptop)
which is always nice for developers.

This patch changes nothing for testing in CI - only manual runs through
test/alternator/run are affected.

Signed-off-by: Nadav Har'El <[email protected]>
@nyh
Copy link
Copy Markdown
Contributor Author

nyh commented Jun 29, 2025

Pushed a new version implementing most of @ScyllaPiotr 's suggestions about renaming the test tables etc, but didn't act on a bunch of places where you told me the test for "forbid" and "permit" modes are not identical (I know, and I think it's fine. the intention was to test all four modes - it's not necessary that the tests will be 100% identical).

@ScyllaPiotr you left "request changes" so please re-review.

@scylladb-promoter
Copy link
Copy Markdown
Contributor

🟢 CI State: SUCCESS

Mode Stage Status Node
Framework test godzilla
dev Build spider8.cloudius-systems.com
test.py spider8.cloudius-systems.com
Unit Tests Custom
🔹 alternator
🔹 alternator/test_condition_expression
🔹 alternator/test_update_expression
🔹 alternator/test_write_isolation
spider8.cloudius-systems.com
release Build spider5.cloudius-systems.com
test.py spider5.cloudius-systems.com
Unit Tests Custom
🔹 alternator
🔹 alternator/test_condition_expression
🔹 alternator/test_update_expression
🔹 alternator/test_write_isolation
spider5.cloudius-systems.com
debug Build spider3.cloudius-systems.com
test.py spider3.cloudius-systems.com
Unit Tests Custom
🔹 alternator
🔹 alternator/test_condition_expression
🔹 alternator/test_update_expression
🔹 alternator/test_write_isolation
spider3.cloudius-systems.com

Build Details:

  • Duration: 2 hr 14 min

@nyh
Copy link
Copy Markdown
Contributor Author

nyh commented Jul 6, 2025

@scylladb/scylla-maint please consider merging: There is one reviewer approval and the CI passed.

@nyh
Copy link
Copy Markdown
Contributor Author

nyh commented Jul 8, 2025

@scylladb/scylla-maint please consider merging: There is one reviewer approval and the CI passed.

ping

@scylladb-promoter scylladb-promoter merged commit 641a907 into scylladb:master Jul 15, 2025
25 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area/alternator Alternator related Issues area/test Issues related to the testing system code and environment backport/none Backport is not required promoted-to-master

Projects

None yet

Development

Successfully merging this pull request may close these issues.

test/alternator: improve write isolation default and testing

4 participants