Some cleanups in tests for tablets + MV #16440
Some cleanups in tests for tablets + MV #16440scylladb-promoter merged 3 commits intoscylladb:masterfrom
Conversation
|
I marked this PR a draft because it depends on two yet-unmerged PRs. Only the last two commits are new commits - the rest are stuff from those earlier PRs. I'll rebase and mark this PR ready to review when those earlier PRs go in. |
eliransin
left a comment
There was a problem hiding this comment.
I mainly have questions about the proper place to deal with tablet information and
tablet manipulation functions. Going forward it seams that we will have increasing number of tests relying on such functionality.
| # ensure that if you send tablet-migration commands to one server, you also | ||
| # read the replicas information from the same server (it takes time for this | ||
| # information to propagate to all servers). | ||
| async def get_tablet_replicas(manager: ManagerClient, server: ServerInfo, keyspace_name: str, table_or_view_name: str, token: int): |
There was a problem hiding this comment.
Wouldn't this function will better fit in the manager itself?
There was a problem hiding this comment.
This is exactly what I'm trying to avoid - random test developers coming up with what they think might be some useful general-purpose functions and putting those functions in some "global" place, and then when the next test developer comes along, they discover they need slight (or large) variations on these functions. You can see the result in dtest - a large number of functions that nobody understands and have dozens of options each.
Please note that this function also appears in test_tablets.py (written by @tgrabiec), but in a different form - it didn't support materialized views. So 1. Tomek also didn't put this in manager, 2. It demonstrates my point that a different test couldn't use the same function exactly, and had to modify it anyway.
Eventually, it will be nice to take some common useful functions and move them into a central place, but I don't want to do it too early. Also, this has absolutely no business being in the "manager". The "manager" is about managing a cluster of nodes - we don't need to stuff collections of utility functions into the manager, unless we want the result to be a huge monolithic mess.
There was a problem hiding this comment.
Please note that this function also appears in test_tablets.py (written by @tgrabiec), but in a different form - it didn't support materialized views. So 1. Tomek also didn't put this in manager, 2. It demonstrates my point that a different test couldn't use the same function exactly, and had to modify it anyway.
I think this function is actually a counter example. get_tablet_replicas() is something which many tablet test cases will use, we should just make it work for all kinds of tables. The first usage which didn't take views into account won't mind.
Agree that the manager is not the best place.
There was a problem hiding this comment.
In my most recent version (I'll push it in a few minutes), I strayed even further from your original implementation - I no longer need the extra get_table_id() and instead do a full-table scan to look for the right table without knowing its id. The result is much simpler, but I don't know if you'd like to use it or not. It's also six lines of code. It's not a library, it's not a 300-line implementation of some traditional data structure you learn in CS courses - it's an ad-hoc 6-line function. I don't want to decide right now if it should be shared with other tests or not. I don't have to decide it right now. If later these 6-line function is determined to be very useful, we can move it to a common place.
| # verify that a test that attempted to enable tablets - and set up only | ||
| # one tablet for the entire table - actually succeeded in doing that. | ||
| async def assert_one_tablet(cql, keyspace_name, table_or_view_name): | ||
| rows = await cql.run_async(f"SELECT last_token, replicas FROM system.tablets where keyspace_name = '{keyspace_name}' and table_name = '{table_or_view_name}' ALLOW FILTERING") |
There was a problem hiding this comment.
I am tempted to also ask about this part if it doesn't belong in the manager, maybe as some api that returns tablet information optionally for a keyspace or keyspace,table combination
it sounds like a useful API for future tests as well.
There was a problem hiding this comment.
Again, maybe it will be useful, maybe it won't.
Most importantly, as you said, if you wanted to move something into a library, it shouldn't be the specific assert_one_tablet() but rather something more general to enquire about tablets - basically just the one run_async line. If I were to move assert_one_tablet() into a library the result would be - again - dtest - a big mess where if the test fails you don't understand what it does because all it does is to call into utility functions you are not familiar with and are probably buggy themselves.
Beyond everything I wrote in my replies to your smaller comments, one thing that is important to remember is you can always share functions between different tests without moving these functions into the global "manager". You can, in increasing granularity:
I think that jumping directly to step 4 is a big mistake. |
a24c163 to
efd9995
Compare
|
Rebased on top of the latest master which already has my previous tablets+MV PRs, so now all the commits in this PR are relevant, and it is ready for review. |
The test test_tablet_alternator_lsi_consistency, checking that Alternator LSI allow strongly-consistent reads even with tablets, used a large cluster (10 nodes), to improve the chance of reaching an "unlucky" tablet placement - and even then only failed in about half the runs without the code fixed. In this patch, we rewrite the test using a much more reliable approach: We start only two nodes, and force the base's tablet onto one node, and the view's table onto the second node. This ensures with 100% certainty that the view update is remote, and the new test fails every single time before the code fix (I reverted the fix to verify) - and passes after it. The new test is not only more reliable, it's also significantly faster because it doesn't need to start a 10-node cluster. We can also remove the tag that excluded this test from debug build mode tests because the 10-node boot was too slow. Signed-off-by: Nadav Har'El <[email protected]>
The tests looked up a table's tablets in an elaborate two-stage search - first find the table's "id", and then look up this id in the list of tablets. It is much simpler to just look up the table's name directly in the list of tablets - although this name is not a key, an ALLOW FILTERING search is good enough for a test. As a bonus, with the new technique we don't care if the given name is the name of a table or a view, further simplifying the test. This is just a test code cleanup - there is no functional change in the test. Signed-off-by: Nadav Har'El <[email protected]>
Signed-off-by: Nadav Har'El <[email protected]>
efd9995 to
6640278
Compare
🔴 CI State: ABORTED✅ - Build Build Details:
|
I can't figure out what failed. In any case, dtest cannot fail on this PR, because the only code change is a comment typo... So I restarted CI. |
A flaky test can fail even on no changes. It's important to figure out what failed, or we'll continue to be plagues by flaky tests. |
I couldn't find any specific test failing. I didn't understand the log and no specific test seems to be mentioned. I am guessing this was some infrastructure flakyness, spot instance killing, or I don't know what, but don't know how to prove it. |
🟢 CI State: SUCCESS✅ - Build Build Details:
|
the job waited 4 hours for an executor with the given label. guess all of the nodes (https://jenkins.scylladb.com/label/ec2-strong-dtest-asg-spot-releng/) were occupied in that 4 hours? |
|
@benipeled ^^^ ? |
|
AWS quota-limit reported and handled on https://github.com/scylladb/scylla-pkg/issues/3758 |
|
Review ping. |
|
Another ping. @scylladb/scylla-maint |
The trick is to ask for help rather than ignore the problem. |
|
|
||
| We use a cluster of just two nodes and RF=1, and control the tablets | ||
| so all base tablets will be in node 0 and all view tablets will be | ||
| in node 1, to ensure that the view update is remote and therefore |
There was a problem hiding this comment.
This won't work once we fix local indexes, as local indexes will be co-located with the table even when using tablets.
There was a problem hiding this comment.
The test is is currently written will force the view's tablets to be moved to a different node, so if "co-location" will forbid this, the test will fail clearly. If it won't be forbidden to explicitly move the tablets, the test will continue to work, but arguably won't really be an interesting regression test because it tests a non-realistic scenario. I'm not sure exactly what to do. I guess that when we do this "co-location" thing we'll need to add more tests.
This small series improves two things in the multi-node tests for tablet supports in materialized views:
The third patch in this series also fixes a comment typo discovered in a previous review.