Skip to content

Some cleanups in tests for tablets + MV #16440

Merged
scylladb-promoter merged 3 commits intoscylladb:masterfrom
nyh:test-improve-lsi
Dec 27, 2023
Merged

Some cleanups in tests for tablets + MV #16440
scylladb-promoter merged 3 commits intoscylladb:masterfrom
nyh:test-improve-lsi

Conversation

@nyh
Copy link
Copy Markdown
Contributor

@nyh nyh commented Dec 18, 2023

This small series improves two things in the multi-node tests for tablet supports in materialized views:

  1. The test for Alternator LSI, which "sometimes" could reproduce the bug by creating 10-node cluster with a random tablet distribution, is replaced by a reliable 2-node cluster which controls the tablet distribution. The new test also confirms that tablets are actually enabled in Alternator (reviewers of the original test noted it would be easy to pass the test if tablets were accidentally not enabled... :-)).
  2. Simplify the tablet lookup code in the test to not go through a "table id", and lookup the table's (or view's) name directly (requires a full-table of the tablets table, but that's entirely reasonable in a test).

The third patch in this series also fixes a comment typo discovered in a previous review.

@nyh
Copy link
Copy Markdown
Contributor Author

nyh commented Dec 18, 2023

I marked this PR a draft because it depends on two yet-unmerged PRs. Only the last two commits are new commits - the rest are stuff from those earlier PRs. I'll rebase and mark this PR ready to review when those earlier PRs go in.

Copy link
Copy Markdown

@eliransin eliransin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I mainly have questions about the proper place to deal with tablet information and
tablet manipulation functions. Going forward it seams that we will have increasing number of tests relying on such functionality.

# ensure that if you send tablet-migration commands to one server, you also
# read the replicas information from the same server (it takes time for this
# information to propagate to all servers).
async def get_tablet_replicas(manager: ManagerClient, server: ServerInfo, keyspace_name: str, table_or_view_name: str, token: int):
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wouldn't this function will better fit in the manager itself?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is exactly what I'm trying to avoid - random test developers coming up with what they think might be some useful general-purpose functions and putting those functions in some "global" place, and then when the next test developer comes along, they discover they need slight (or large) variations on these functions. You can see the result in dtest - a large number of functions that nobody understands and have dozens of options each.

Please note that this function also appears in test_tablets.py (written by @tgrabiec), but in a different form - it didn't support materialized views. So 1. Tomek also didn't put this in manager, 2. It demonstrates my point that a different test couldn't use the same function exactly, and had to modify it anyway.

Eventually, it will be nice to take some common useful functions and move them into a central place, but I don't want to do it too early. Also, this has absolutely no business being in the "manager". The "manager" is about managing a cluster of nodes - we don't need to stuff collections of utility functions into the manager, unless we want the result to be a huge monolithic mess.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please note that this function also appears in test_tablets.py (written by @tgrabiec), but in a different form - it didn't support materialized views. So 1. Tomek also didn't put this in manager, 2. It demonstrates my point that a different test couldn't use the same function exactly, and had to modify it anyway.

I think this function is actually a counter example. get_tablet_replicas() is something which many tablet test cases will use, we should just make it work for all kinds of tables. The first usage which didn't take views into account won't mind.

Agree that the manager is not the best place.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In my most recent version (I'll push it in a few minutes), I strayed even further from your original implementation - I no longer need the extra get_table_id() and instead do a full-table scan to look for the right table without knowing its id. The result is much simpler, but I don't know if you'd like to use it or not. It's also six lines of code. It's not a library, it's not a 300-line implementation of some traditional data structure you learn in CS courses - it's an ad-hoc 6-line function. I don't want to decide right now if it should be shared with other tests or not. I don't have to decide it right now. If later these 6-line function is determined to be very useful, we can move it to a common place.

# verify that a test that attempted to enable tablets - and set up only
# one tablet for the entire table - actually succeeded in doing that.
async def assert_one_tablet(cql, keyspace_name, table_or_view_name):
rows = await cql.run_async(f"SELECT last_token, replicas FROM system.tablets where keyspace_name = '{keyspace_name}' and table_name = '{table_or_view_name}' ALLOW FILTERING")
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am tempted to also ask about this part if it doesn't belong in the manager, maybe as some api that returns tablet information optionally for a keyspace or keyspace,table combination
it sounds like a useful API for future tests as well.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Again, maybe it will be useful, maybe it won't.

Most importantly, as you said, if you wanted to move something into a library, it shouldn't be the specific assert_one_tablet() but rather something more general to enquire about tablets - basically just the one run_async line. If I were to move assert_one_tablet() into a library the result would be - again - dtest - a big mess where if the test fails you don't understand what it does because all it does is to call into utility functions you are not familiar with and are probably buggy themselves.

@nyh
Copy link
Copy Markdown
Contributor Author

nyh commented Dec 19, 2023

I mainly have questions about the proper place to deal with tablet information and tablet manipulation functions. Going forward it seams that we will have increasing number of tests relying on such functionality.

Beyond everything I wrote in my replies to your smaller comments, one thing that is important to remember is you can always share functions between different tests without moving these functions into the global "manager". You can, in increasing granularity:

  1. Share convenience functions between different tests in the same file. This is what I did in this set of patches - e.g., pin_the_only_tablet() is used by both Alternator LSI and CQL LSI test.
  2. Share convenience functions between different tests in different files, using import. I saw some topology tests do that.
  3. Share convenience functions between different tests by importing a common file. E.g., we have "test.pylib.rest_client".
  4. Share convenience functions between different tests by making them available in the "manager" that all MUST use.

I think that jumping directly to step 4 is a big mistake.

@nyh
Copy link
Copy Markdown
Contributor Author

nyh commented Dec 24, 2023

Rebased on top of the latest master which already has my previous tablets+MV PRs, so now all the commits in this PR are relevant, and it is ready for review.
@kbr-scylla this replaces the 10-node test that bothered you (because it was slow on arm+debug) by a nicer 2-node test.

nyh added 3 commits December 24, 2023 10:11
The test test_tablet_alternator_lsi_consistency, checking that Alternator
LSI allow strongly-consistent reads even with tablets, used a large
cluster (10 nodes), to improve the chance of reaching an "unlucky" tablet
placement - and even then only failed in about half the runs without
the code fixed.

In this patch, we rewrite the test using a much more reliable approach:
We start only two nodes, and force the base's tablet onto one node, and
the view's table onto the second node. This ensures with 100% certainty
that the view update is remote, and the new test fails every single time
before the code fix (I reverted the fix to verify) - and passes after it.

The new test is not only more reliable, it's also significantly faster
because it doesn't need to start a 10-node cluster.

We can also remove the tag that excluded this test from debug build
mode tests because the 10-node boot was too slow.

Signed-off-by: Nadav Har'El <[email protected]>
The tests looked up a table's tablets in an elaborate two-stage search -
first find the table's "id", and then look up this id in the list of
tablets. It is much simpler to just look up the table's name directly
in the list of tablets - although this name is not a key, an ALLOW
FILTERING search is good enough for a test.

As a bonus, with the new technique we don't care if the given name
is the name of a table or a view, further simplifying the test.

This is just a test code cleanup - there is no functional change in
the test.

Signed-off-by: Nadav Har'El <[email protected]>
@nyh nyh force-pushed the test-improve-lsi branch from efd9995 to 6640278 Compare December 24, 2023 08:13
@scylladb-promoter
Copy link
Copy Markdown
Contributor

🔴 CI State: ABORTED

✅ - Build
✅ - Unit Tests
❌ - dtest

Build Details:

  • Duration: 5 hr 7 min
  • Builder: spider1.cloudius-systems.com

@nyh
Copy link
Copy Markdown
Contributor Author

nyh commented Dec 24, 2023

🔴 CI State: ABORTED

✅ - Build ✅ - Unit Tests ❌ - dtest

Build Details:

* Duration: 5 hr 7 min

* Builder: spider1.cloudius-systems.com

I can't figure out what failed. In any case, dtest cannot fail on this PR, because the only code change is a comment typo... So I restarted CI.

@avikivity
Copy link
Copy Markdown
Member

🔴 CI State: ABORTED

✅ - Build ✅ - Unit Tests ❌ - dtest

Build Details:

* Duration: 5 hr 7 min

* Builder: spider1.cloudius-systems.com

I can't figure out what failed. In any case, dtest cannot fail on this PR, because the only code change is a comment typo... So I restarted CI.

A flaky test can fail even on no changes.

It's important to figure out what failed, or we'll continue to be plagues by flaky tests.

@nyh
Copy link
Copy Markdown
Contributor Author

nyh commented Dec 24, 2023

It's important to figure out what failed, or we'll continue to be plagues by flaky tests.

I couldn't find any specific test failing. I didn't understand the log and no specific test seems to be mentioned. I am guessing this was some infrastructure flakyness, spot instance killing, or I don't know what, but don't know how to prove it.

@scylladb-promoter
Copy link
Copy Markdown
Contributor

🟢 CI State: SUCCESS

✅ - Build
✅ - Unit Tests
✅ - dtest

Build Details:

  • Duration: 4 hr 4 min
  • Builder: spider2.cloudius-systems.com

@tchaikov
Copy link
Copy Markdown
Contributor

tchaikov commented Dec 25, 2023

It's important to figure out what failed, or we'll continue to be plagues by flaky tests.

I couldn't find any specific test failing. I didn't understand the log and no specific test seems to be mentioned. I am guessing this was some infrastructure flakyness, spot instance killing, or I don't know what, but don't know how to prove it.

17:19:47  [conditionalRetry] Tries left 51
...
17:20:04  Waiting for next available executor on ‘[ec2-strong-dtest-asg-spot-releng](https://jenkins.scylladb.com/label/ec2-strong-dtest-asg-spot-releng/)’
17:20:04  Still waiting to schedule task
17:20:04  Waiting for next available executor on ‘[ec2-strong-dtest-asg-spot-releng](https://jenkins.scylladb.com/label/ec2-strong-dtest-asg-spot-releng/)’
17:20:04  Still waiting to schedule task
17:20:04
...
Waiting for next available executor on ‘[ec2-strong-dtest-asg-spot-releng](https://jenkins.scylladb.com/label/ec2-strong-dtest-asg-spot-releng/)’
21:17:09  Cancelling nested steps due to timeout

the job waited 4 hours for an executor with the given label. guess all of the nodes (https://jenkins.scylladb.com/label/ec2-strong-dtest-asg-spot-releng/) were occupied in that 4 hours?

@mykaul
Copy link
Copy Markdown
Contributor

mykaul commented Dec 25, 2023

@benipeled ^^^ ?

@benipeled
Copy link
Copy Markdown
Contributor

@nyh
Copy link
Copy Markdown
Contributor Author

nyh commented Dec 26, 2023

Review ping.
Note that despite the discussion above, the CI did pass.

@nyh
Copy link
Copy Markdown
Contributor Author

nyh commented Dec 27, 2023

Another ping. @scylladb/scylla-maint

@avikivity
Copy link
Copy Markdown
Member

It's important to figure out what failed, or we'll continue to be plagues by flaky tests.

I couldn't find any specific test failing. I didn't understand the log and no specific test seems to be mentioned. I am guessing this was some infrastructure flakyness, spot instance killing, or I don't know what, but don't know how to prove it.

The trick is to ask for help rather than ignore the problem.


We use a cluster of just two nodes and RF=1, and control the tablets
so all base tablets will be in node 0 and all view tablets will be
in node 1, to ensure that the view update is remote and therefore
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This won't work once we fix local indexes, as local indexes will be co-located with the table even when using tablets.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The test is is currently written will force the view's tablets to be moved to a different node, so if "co-location" will forbid this, the test will fail clearly. If it won't be forbidden to explicitly move the tablets, the test will continue to work, but arguably won't really be an interesting regression test because it tests a non-realistic scenario. I'm not sure exactly what to do. I guess that when we do this "co-location" thing we'll need to add more tests.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It will be forbidden.

@scylladb-promoter scylladb-promoter merged commit 6394854 into scylladb:master Dec 27, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

8 participants