Some cleanups in tests for tablets + MV by nyh · Pull Request #16440 · scylladb/scylladb

nyh · 2023-12-18T11:47:31Z

This small series improves two things in the multi-node tests for tablet supports in materialized views:

The test for Alternator LSI, which "sometimes" could reproduce the bug by creating 10-node cluster with a random tablet distribution, is replaced by a reliable 2-node cluster which controls the tablet distribution. The new test also confirms that tablets are actually enabled in Alternator (reviewers of the original test noted it would be easy to pass the test if tablets were accidentally not enabled... :-)).
Simplify the tablet lookup code in the test to not go through a "table id", and lookup the table's (or view's) name directly (requires a full-table of the tablets table, but that's entirely reasonable in a test).

The third patch in this series also fixes a comment typo discovered in a previous review.

nyh · 2023-12-18T11:48:48Z

I marked this PR a draft because it depends on two yet-unmerged PRs. Only the last two commits are new commits - the rest are stuff from those earlier PRs. I'll rebase and mark this PR ready to review when those earlier PRs go in.

eliransin

I mainly have questions about the proper place to deal with tablet information and
tablet manipulation functions. Going forward it seams that we will have increasing number of tests relying on such functionality.

eliransin · 2023-12-19T11:12:23Z

+# ensure that if you send tablet-migration commands to one server, you also
+# read the replicas information from the same server (it takes time for this
+# information to propagate to all servers).
+async def get_tablet_replicas(manager: ManagerClient, server: ServerInfo, keyspace_name: str, table_or_view_name: str, token: int):


Wouldn't this function will better fit in the manager itself?

This is exactly what I'm trying to avoid - random test developers coming up with what they think might be some useful general-purpose functions and putting those functions in some "global" place, and then when the next test developer comes along, they discover they need slight (or large) variations on these functions. You can see the result in dtest - a large number of functions that nobody understands and have dozens of options each.

Please note that this function also appears in test_tablets.py (written by @tgrabiec), but in a different form - it didn't support materialized views. So 1. Tomek also didn't put this in manager, 2. It demonstrates my point that a different test couldn't use the same function exactly, and had to modify it anyway.

Eventually, it will be nice to take some common useful functions and move them into a central place, but I don't want to do it too early. Also, this has absolutely no business being in the "manager". The "manager" is about managing a cluster of nodes - we don't need to stuff collections of utility functions into the manager, unless we want the result to be a huge monolithic mess.

Please note that this function also appears in test_tablets.py (written by @tgrabiec), but in a different form - it didn't support materialized views. So 1. Tomek also didn't put this in manager, 2. It demonstrates my point that a different test couldn't use the same function exactly, and had to modify it anyway.

I think this function is actually a counter example. get_tablet_replicas() is something which many tablet test cases will use, we should just make it work for all kinds of tables. The first usage which didn't take views into account won't mind.

Agree that the manager is not the best place.

In my most recent version (I'll push it in a few minutes), I strayed even further from your original implementation - I no longer need the extra get_table_id() and instead do a full-table scan to look for the right table without knowing its id. The result is much simpler, but I don't know if you'd like to use it or not. It's also six lines of code. It's not a library, it's not a 300-line implementation of some traditional data structure you learn in CS courses - it's an ad-hoc 6-line function. I don't want to decide right now if it should be shared with other tests or not. I don't have to decide it right now. If later these 6-line function is determined to be very useful, we can move it to a common place.

eliransin · 2023-12-19T11:16:08Z

+# verify that a test that attempted to enable tablets - and set up only
+# one tablet for the entire table - actually succeeded in doing that.
+async def assert_one_tablet(cql, keyspace_name, table_or_view_name):
+    rows = await cql.run_async(f"SELECT last_token, replicas FROM system.tablets where keyspace_name = '{keyspace_name}' and table_name = '{table_or_view_name}' ALLOW FILTERING")


I am tempted to also ask about this part if it doesn't belong in the manager, maybe as some api that returns tablet information optionally for a keyspace or keyspace,table combination
it sounds like a useful API for future tests as well.

Again, maybe it will be useful, maybe it won't.

Most importantly, as you said, if you wanted to move something into a library, it shouldn't be the specific assert_one_tablet() but rather something more general to enquire about tablets - basically just the one run_async line. If I were to move assert_one_tablet() into a library the result would be - again - dtest - a big mess where if the test fails you don't understand what it does because all it does is to call into utility functions you are not familiar with and are probably buggy themselves.

nyh · 2023-12-19T11:56:56Z

I mainly have questions about the proper place to deal with tablet information and tablet manipulation functions. Going forward it seams that we will have increasing number of tests relying on such functionality.

Beyond everything I wrote in my replies to your smaller comments, one thing that is important to remember is you can always share functions between different tests without moving these functions into the global "manager". You can, in increasing granularity:

Share convenience functions between different tests in the same file. This is what I did in this set of patches - e.g., pin_the_only_tablet() is used by both Alternator LSI and CQL LSI test.
Share convenience functions between different tests in different files, using import. I saw some topology tests do that.
Share convenience functions between different tests by importing a common file. E.g., we have "test.pylib.rest_client".
Share convenience functions between different tests by making them available in the "manager" that all MUST use.

I think that jumping directly to step 4 is a big mistake.

nyh · 2023-12-24T08:10:21Z

Rebased on top of the latest master which already has my previous tablets+MV PRs, so now all the commits in this PR are relevant, and it is ready for review.
@kbr-scylla this replaces the 10-node test that bothered you (because it was slow on arm+debug) by a nicer 2-node test.

The test test_tablet_alternator_lsi_consistency, checking that Alternator LSI allow strongly-consistent reads even with tablets, used a large cluster (10 nodes), to improve the chance of reaching an "unlucky" tablet placement - and even then only failed in about half the runs without the code fixed. In this patch, we rewrite the test using a much more reliable approach: We start only two nodes, and force the base's tablet onto one node, and the view's table onto the second node. This ensures with 100% certainty that the view update is remote, and the new test fails every single time before the code fix (I reverted the fix to verify) - and passes after it. The new test is not only more reliable, it's also significantly faster because it doesn't need to start a 10-node cluster. We can also remove the tag that excluded this test from debug build mode tests because the 10-node boot was too slow. Signed-off-by: Nadav Har'El <[email protected]>

The tests looked up a table's tablets in an elaborate two-stage search - first find the table's "id", and then look up this id in the list of tablets. It is much simpler to just look up the table's name directly in the list of tablets - although this name is not a key, an ALLOW FILTERING search is good enough for a test. As a bonus, with the new technique we don't care if the given name is the name of a table or a view, further simplifying the test. This is just a test code cleanup - there is no functional change in the test. Signed-off-by: Nadav Har'El <[email protected]>

Signed-off-by: Nadav Har'El <[email protected]>

scylladb-promoter · 2023-12-24T13:18:04Z

🔴 CI State: ABORTED

✅ - Build
✅ - Unit Tests
❌ - dtest

Build Details:

Duration: 5 hr 7 min
Builder: spider1.cloudius-systems.com

nyh · 2023-12-24T13:47:00Z

🔴 CI State: ABORTED

✅ - Build ✅ - Unit Tests ❌ - dtest

Build Details:
* Duration: 5 hr 7 min

* Builder: spider1.cloudius-systems.com

I can't figure out what failed. In any case, dtest cannot fail on this PR, because the only code change is a comment typo... So I restarted CI.

avikivity · 2023-12-24T14:24:59Z

🔴 CI State: ABORTED

✅ - Build ✅ - Unit Tests ❌ - dtest

Build Details:
* Duration: 5 hr 7 min

* Builder: spider1.cloudius-systems.com
I can't figure out what failed. In any case, dtest cannot fail on this PR, because the only code change is a comment typo... So I restarted CI.

A flaky test can fail even on no changes.

It's important to figure out what failed, or we'll continue to be plagues by flaky tests.

nyh · 2023-12-24T14:34:30Z

It's important to figure out what failed, or we'll continue to be plagues by flaky tests.

I couldn't find any specific test failing. I didn't understand the log and no specific test seems to be mentioned. I am guessing this was some infrastructure flakyness, spot instance killing, or I don't know what, but don't know how to prove it.

scylladb-promoter · 2023-12-24T17:50:46Z

🟢 CI State: SUCCESS

✅ - Build
✅ - Unit Tests
✅ - dtest

Build Details:

Duration: 4 hr 4 min
Builder: spider2.cloudius-systems.com

tchaikov · 2023-12-25T07:49:54Z

It's important to figure out what failed, or we'll continue to be plagues by flaky tests.

I couldn't find any specific test failing. I didn't understand the log and no specific test seems to be mentioned. I am guessing this was some infrastructure flakyness, spot instance killing, or I don't know what, but don't know how to prove it.

17:19:47  [conditionalRetry] Tries left 51
...
17:20:04  Waiting for next available executor on ‘[ec2-strong-dtest-asg-spot-releng](https://jenkins.scylladb.com/label/ec2-strong-dtest-asg-spot-releng/)’
17:20:04  Still waiting to schedule task
17:20:04  Waiting for next available executor on ‘[ec2-strong-dtest-asg-spot-releng](https://jenkins.scylladb.com/label/ec2-strong-dtest-asg-spot-releng/)’
17:20:04  Still waiting to schedule task
17:20:04
...
Waiting for next available executor on ‘[ec2-strong-dtest-asg-spot-releng](https://jenkins.scylladb.com/label/ec2-strong-dtest-asg-spot-releng/)’
21:17:09  Cancelling nested steps due to timeout

the job waited 4 hours for an executor with the given label. guess all of the nodes (https://jenkins.scylladb.com/label/ec2-strong-dtest-asg-spot-releng/) were occupied in that 4 hours?

mykaul · 2023-12-25T09:22:31Z

@benipeled ^^^ ?

benipeled · 2023-12-25T09:29:36Z

AWS quota-limit reported and handled on https://github.com/scylladb/scylla-pkg/issues/3758

nyh · 2023-12-26T12:23:56Z

Review ping.
Note that despite the discussion above, the CI did pass.

nyh · 2023-12-27T08:55:40Z

Another ping. @scylladb/scylla-maint

avikivity · 2023-12-27T18:13:44Z

It's important to figure out what failed, or we'll continue to be plagues by flaky tests.

I couldn't find any specific test failing. I didn't understand the log and no specific test seems to be mentioned. I am guessing this was some infrastructure flakyness, spot instance killing, or I don't know what, but don't know how to prove it.

The trick is to ask for help rather than ignore the problem.

avikivity · 2023-12-27T18:17:12Z

+
+       We use a cluster of just two nodes and RF=1, and control the tablets
+       so all base tablets will be in node 0 and all view tablets will be
+       in node 1, to ensure that the view update is remote and therefore


This won't work once we fix local indexes, as local indexes will be co-located with the table even when using tablets.

The test is is currently written will force the view's tablets to be moved to a different node, so if "co-location" will forbid this, the test will fail clearly. If it won't be forbidden to explicitly move the tablets, the test will continue to work, but arguably won't really be an interesting regression test because it tests a non-realistic scenario. I'm not sure exactly what to do. I guess that when we do this "co-location" thing we'll need to add more tests.

It will be forbidden.

nyh requested review from eliransin, nuivall, tgrabiec and wmitros December 18, 2023 11:47

eliransin reviewed Dec 19, 2023

View reviewed changes

This was referenced Dec 19, 2023

test_mv_tablets.test_tablet_mv_simple_6node failed during concurrent bootstrap of 6 nodes #16423

Closed

Add support for tablets in Alternator #16353

Merged

nyh force-pushed the test-improve-lsi branch from a24c163 to efd9995 Compare December 24, 2023 08:07

nyh marked this pull request as ready for review December 24, 2023 08:07

nyh added 3 commits December 24, 2023 10:11

materialized views: fix typo in comment

6640278

Signed-off-by: Nadav Har'El <[email protected]>

nyh force-pushed the test-improve-lsi branch from efd9995 to 6640278 Compare December 24, 2023 08:13

avikivity reviewed Dec 27, 2023

View reviewed changes

scylladb-promoter merged commit 6394854 into scylladb:master Dec 27, 2023

Conversation

nyh commented Dec 18, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

nyh commented Dec 18, 2023

Uh oh!

eliransin left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

nyh commented Dec 19, 2023

Uh oh!

nyh commented Dec 24, 2023

Uh oh!

scylladb-promoter commented Dec 24, 2023

🔴 CI State: ABORTED

Build Details:

Uh oh!

nyh commented Dec 24, 2023

🔴 CI State: ABORTED

Build Details:

Uh oh!

avikivity commented Dec 24, 2023

🔴 CI State: ABORTED

Build Details:

Uh oh!

nyh commented Dec 24, 2023

Uh oh!

scylladb-promoter commented Dec 24, 2023

🟢 CI State: SUCCESS

Build Details:

Uh oh!

tchaikov commented Dec 25, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mykaul commented Dec 25, 2023

Uh oh!

benipeled commented Dec 25, 2023

Uh oh!

nyh commented Dec 26, 2023

Uh oh!

nyh commented Dec 27, 2023

Uh oh!

avikivity commented Dec 27, 2023

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

8 participants

nyh commented Dec 18, 2023 •

edited

Loading

tchaikov commented Dec 25, 2023 •

edited

Loading