Add support for tablets in Alternator#16353
Add support for tablets in Alternator#16353scylladb-promoter merged 6 commits intoscylladb:masterfrom
Conversation
🟢 CI State: SUCCESS✅ - Build Build Details:
|
| auto ksm = create_keyspace_metadata(keyspace_name, sp, gossiper, ts, tags_map); | ||
| try { | ||
| schema_mutations = service::prepare_new_keyspace_announcement(sp.local_db(), ksm, ts); | ||
| schema_mutations = service::prepare_new_keyspace_announcement(sp.local_db(), ksm, ts); |
There was a problem hiding this comment.
This is a code style regression
There was a problem hiding this comment.
Sorry, I don't know what happened here, I'll fix it.
There was a problem hiding this comment.
More likely, a cat on my keyboard + me too lazy to read the patch before sending it :-(
| auto opts = get_network_topology_options(sp, gossiper, rf); | ||
|
|
||
| // Tablets are not yet enabled by default on Alternator tables, because of | ||
| // missing support for CDC (see issue #16313). Until then, allow |
There was a problem hiding this comment.
Because they're an experimental feature, too, and incomplete.
There was a problem hiding this comment.
When I wrote "not enabled by default", I meant not even if the tablets experimental features is turned on. Even with the tablets experimental feature, I didn't want to enable tablets on Alternator by default. By the way, the same is currently the case in CQL - it's not enough to enable the experimental feature, you need to enable it per table.
| tracing::trace(tr_state, "Failed to apply view update for {} and {} remote endpoints", | ||
| // The "delay_before_remote_view_update" injection point can be | ||
| // used to add a short delay (currently 0.5 seconds) before a base | ||
| // replica sends its update to the remove view replica. |
There was a problem hiding this comment.
Isn't using a delay asking for a failure on a busy system? In the case of tests, probably false negatives.
Probably, such a fault injection point should pause to coordinate on some named http endpoint. For example:
- test setup registers an http endpoint (provided by the test script) with the API server
- inject() calls the http endpoint with the input string
- the http server does something and returns a response
- based on the response, the injector continues normally or throws an exception
This allows the test system to inject any action, not just a delay.
We won't do this now, but maybe we'll find the time to reorganize this one day.
There was a problem hiding this comment.
You're absolutely right - a fixed delay may add false negatives - if the test machine is so slow that it can't perform a single read in 0.5 seconds, it may miss a bug. But if there is a bug, in some of the runs it won't be missed, and that's enough. Moreover, during the development process - when I wanted to reproduced the bug - I ran it on an idle machine.
Please note that this test really needs a finite delay - and not a very long delay or an infinite pause:
- When synchronous_update=false (the bug), the delay delays the view updates, and the immediate read doesn't see the data.
- When synchronous_update=true (the fix), the delay delays the view updates, but the write waits for it, and when the write returns, it sees the data.
Maybe there's a better way to do it, but I didn't think of a different approach that is reasonable. I also tried to work within what the injection framework (that is the first time I ever used...) gives me and not to reinvent it.
|
#16364 conflicts with this PR - after that other PR the way to enable or disable tablets will be "tablets: true" / "tablets: false", not via initial_tablets. So if that PR goes in first, I'll need to redo parts of this PR. |
wmitros
left a comment
There was a problem hiding this comment.
Looks good to me, however, reading this patch set has been mostly a learning experience for me as I'm not that familiar with the Alternator implementation, and the changes were good for that as they included an abundance of comments.
eliransin
left a comment
There was a problem hiding this comment.
I also think that eventually our injection should be more precise that just injecting some delays but it is not something for now 🙂
| # an async wrapper to the boto3 functions ourselves (e.g., run them in a | ||
| # separate thread) ourselves. | ||
| @pytest.mark.asyncio | ||
| async def test_tablet_alternator_lsi_consistency(manager: ManagerClient): |
There was a problem hiding this comment.
Is there any point running this on release mode?
In release mode the injection will not do anything so it is kind of pointless to run it (unless you are counting on accidentally catching this bug without the delay.
There was a problem hiding this comment.
I am not strongly opinionated about this so maybe we can mark this test (and probably should do it for all other tests that uses injection) somehow in a follow up.
There was a problem hiding this comment.
@eliransin I don't know that injection is ignored in release mode. Why?
It means that any test that uses injection cannot be run on release mode - just on dev mode. Are we fine with that? CC @avikivity
The way I would do this in the cql-pytest framework is to skip a test when it cannot be done on this build mode - the inject function can detect it's not supported and skip the calling test - with no need for any separate marks. I can do this here too but am still trying to figure out the conventions of the topology test framework. Note that my absolute last preference would be to manually list this test in some yaml file of excluded tests as I've seen people doing - I think this approach is a mistake.
There was a problem hiding this comment.
@nyh This is how you skip a test in release mode:
@pytest.mark.asyncio
@skip_mode('release', 'error injections are not supported in release mode')
async def test_table_dropped_during_streaming(manager: ManagerClient):
If you skip the test before invoking the injection API we save some run time because we avoid cluster bootstrap.
There was a problem hiding this comment.
Thanks, I'll do this, though I don't love the need to remember to add this marker.
I still have an uneasy feeling about the principle though - just a few days we had a big discussion about needed more "testability", more injection points and so on that will make it easier to write reliable and shorter tests. But if we say that these injection points don't work in release mode, isn't this a problem? Doesn't it mean we should avoid using injection points in tests, or at least most tests, so that we'll still have most testings be able to run on "release mode" (which is the most important binary to be testing!!!!)
CC @avikivity I wonder what your thoughts on this are.
There was a problem hiding this comment.
I don't see a problem. We're testing the algorithm here.
There's a chance that there's an additional bug that is hidden by both dev and debug modes and is exposed by release mode, but the probability is low, and together with the low probability of the injected failure happening in production, I think it's okay.
It's possible to control when the injection should stop sleeping from the test, see 7d0f4c1. Injection points can wait on messages. |
I don't think it's relevant to this test, unfortunately. The whole point of this LSI consistency check is to verify that if the view update takes a long time (e.g., 0.5 seconds) then the base write waits for it (i.e., it is a "synchronous view update"). So we need the view update to actually wait for some time - I don't know if we should use 0.5 seconds or 50 seconds, but it needs to be some sort of fixed wait. It's not that the test needs to tell the server when to stop sleeping - the server doesn't know when it wants to stop sleeping. Again I want to emphasize that a short sleep is important for keeping the test short, and the only downside of a too-short sleep is that it can cause a false-negative - i.e., a failed test will not be recognized on a very slow and overloaded test machine because it took more than 0.5 seconds for the test to even try to read. Even in that case, if the code is buggy, it will sometimes fail (if the machine happens to not be super-slow), which may be good enough. I can increase 0.5 seconds to 30 seconds, but it means that a successful test will always take 30 seconds. In dtest I wouldn't think twice about this, but here I do. |
| view_ptr(view_builder.build()), ts, true, schema_mutations); | ||
| view, ts, true, schema_mutations); | ||
| // add_table_or_view_to_schema_mutation() is a low-level function that | ||
| // doesn't call the callbacks that prepare_new_view_announcement() |
There was a problem hiding this comment.
is there a reason to not use prepare_new_view_announcement here?
There was a problem hiding this comment.
Unfortunately, yes. Each of these functions makes some very specific assumptions - for example prepare_new_view_announcement works like CREATE VIEW - adding a view to an existing table. Alternator creates a keyspace, base-table and view all in one operation.
| // LSIs have no tags, but Scylla's "synchronous_updates" feature | ||
| // (which an LSIs need), is actually implemented as a tag so we | ||
| // need to add it here: | ||
| std::map<sstring, sstring> tags_map = {{db::SYNCHRONOUS_VIEW_UPDATES_TAG_KEY, "true"}}; |
There was a problem hiding this comment.
Eeek. You're right. I'll fix that.
| // (which an LSIs need), is actually implemented as a tag so we | ||
| // need to add it here: | ||
| std::map<sstring, sstring> tags_map = {{db::SYNCHRONOUS_VIEW_UPDATES_TAG_KEY, "true"}}; | ||
| view_builder.add_extension(db::tags_extension::NAME, ::make_shared<db::tags_extension>(tags_map)); |
There was a problem hiding this comment.
is tags_map copied here? maybe move or construct in place
There was a problem hiding this comment.
The tags_extension constructor takes a const std::map<sstring, sstring>& tags so it copies, indeed. I'll try to remember to follow this up with a fix for this class and all its callers to avoid this copy, although I think this is completely negligible - we're talking here about one of the slowest operations in Scylla - creating a table with a view (it's even worse in DynamoDB - taking many seconds), copying an in-memory map with one item isn't a large part of it.
|
|
||
| # Alternator convenience function for fetching the entire result set of a | ||
| # query into an array of items. | ||
| def full_query(table, ConsistentRead=True, **kwargs): |
There was a problem hiding this comment.
wouldn't be better to import those functions from alternator/util? deduplication later is harder, especially in case the code diverges
There was a problem hiding this comment.
I would vote no. We should minimize the number of convenience functions, and it's not always possible to share them between suites because of completely different assumptions and coding styles.
Sometimes copying 7 lines of code is better than turning the world up-side-down just to be able to reuse a 7 line function. It's also much easier for reviewers and future readers of this test to see in front of them the 7-line convenience function, instead of trying to inspect a huge library what this function really does (if you worked on dtest, you'll know what I mean and what I'm trying to avoid).
There was a problem hiding this comment.
Duplicating functions doesn't minimize it's number unless you count differently. My main point is that different suites should not be treated as separate projects, that's unfortunate. The coding style should converge instead of allowing more more differgence.
There was a problem hiding this comment.
- I meant we should minimize the number of global convenience functions. I don't know how many dtests you wrote, but their global "convenience" functions are everything that I want to avoid: I never understand them when I read a test, there are a hundred of them and each one takes a hundred different parameters. It's something I don't want to duplicate in my own tests.
- I don't know if "suites should be treated as separate projects" but in reality they are. And to be honest, if I need to choose between duplicating a 7-line function and enforcing irrelevant standards across the board (e.g., you must use async in test/alternator because we wanted to use it in test/topology!), I prefer duplicating a 7-line function.
Specifically, I noted above in the comment that according to the test/topology_experimental_raft coding standard, these Alternator tests will eventually need to be moved to using some sort of async library, not boto3. This full_query implementation will need to become async as well. The one in Alternator won't. I don't want to make 800 tests in test/alternator harder to read and write just to make 7 tests in test/topology_experimental_raft happy.
Finally, my philosophy of these convenience functions is that they should be very short and very simple - the exact opposite of dtest's functions that take 10 different parameters to suite 10 different needs. Such short and simple functions should not "diverge" over time. They shouldn't even change (or change very rarely) - my whole point is that I don't want people to start adding more and more features to the existing functions.
There was a problem hiding this comment.
The problem you're describing falls more into good library api design category than duplicate vs code reuse. Comparing bad design to different bad design doesn't prove much in absolute terms. Agree that one should not create functions like
def full_scan(table, ConsistentRead=True, Count=0, WithRestart=false, OnlyEven=true, **kwargs)
but something like
def full_scan(table)
def full_scan_and_count(table)
but that's orthogonal to code duplication. It's not very bad though as you said function is very small. Just for the future I'd expect alternator part of topology tests to grow - ideally alternator should have some more system testing not only single node api tests.
I am not fully grasping async thing in python, mostly I think it's a distraction currently. I don't see why we should have both async and non async functions (see coroutines vs continuations pattern). Either this or that, mixed state should be seen as temporary not the goal.
There was a problem hiding this comment.
Let's reconsider the library issue when it's clearer what the library should contain. My experience with dtest's collection of convenience functions is terrible (I don't think that this review is the place to discuss it), whereas in cql-pytest it mostly worked because the "library" is so small, deliberately.
Regarding async - the topology guys decided to use it, but it's not used in test/alternator. So I don't understand what you are proposing. Should we edit 800 Alternator test functions, almost 20,000 lines of code, to make everything async? Just so we could reuse a 7-line function (or maybe five such functions) between the projects? I'm not convinced.
|
so in theory there is no test checking if enabling tablets is doing anything in alternator? |
As explained in issue scylladb#16203, we cannot yet enable tablets on Alternator keyspaces by default, because support for some of the features that Alternator needs, such as CDC, is not yet available. Nevertheless, to start testing Alternator integration with tablets, we want to provide a way to enable tablets in Alternator for tests. In this patch we add support for a tag, 'experimental:initial_tablets', which if added on a table during creation, uses tablets for its keyspace. The value of this tag is a numeric string, and it is exactly analogous to the 'initial_tablets' property we have in CQL's NetworkTopologyStrategy. We name this tag with the "experimental:" prefix to emphesize that it is experimental, and the way to enable or disable tablets will probably change later. The new tag only has effect when added while *creating* a table. Adding, deleting or changing it later on an existing table will have no effect. A later patch will have tests that use this tag to test Alternator with tablets. Refs scylladb#16203. Signed-off-by: Nadav Har'El <[email protected]>
In commit 88a5dda, we fixed materialized view creation to support tablets. We added to the function called to create materialized views in CQL, prepare_new_view_announcement() a missing call to the on_before_create_column_family() notifier that creates tablets for this new view. We have the same problem in Alternator when creating a view (GSI or LSI). The Alternator code does not use prepare_new_view_announcement(), and instead uses the lower-level function add_table_or_view_to_schema_mutation() so it didn't get the call to the notifier, so we must add it here too. Before this patch, creating an Alternator table with tablets (which has become possible after the previous patch) fails with "Tablet map not found for table <uuid>". With this patch, it works. A test for materialized views in Alternator will come in a following patch, and will test everything together - the CreateTable tag to use tablets (from the previous patch), the LSI/GSI creation (fixed in this patch) and the correct consistency of the LSI (fixed in the next patch). Signed-off-by: Nadav Har'El <[email protected]>
Good question. In the later patches, I use the tablet-enabling tag to enable tablets and then (before fixing the bugs) obvious bugs appeared, so I know that this tag had an effect. However, you're right that we should have a regression test that the tablet-enabling tag actually does something and continues to do something (and isn't, say, just outright ignored). I'll add such a test in a followup PR. |
DynamoDB's *local* secondary index (LSI) allows strongly-consistent reads from the materialized view, which must be able to read what was previously written to the base. To support this, we need the view to use the "synchronous_updates". Previously, with vnodes, there was no need for using this option explicitly, because an LSI has the same partition key as the base table so the base and view replicas are the same, and the local writes are done synchronously. But with tablets, this changes - there is no longer a guarantee that the base and view tablets are located on the same node. So to restore the strong consistency of LSIs when tablets are enabled, this patch explicitly adds the "synchronous_updates" option to views created by Alternator LSIs. We do *not* add this option for GSIs - those do not support strongly-consistent reads. This fix was tested by a test that will be introduced in the following patches. The test showed that before this patch, it was possible that reading with ConsistentRead=True from an LSI right after the base was written would miss the new changes, but after this patch, it always sees the new data in the LSI. Fixes scylladb#16313. Signed-off-by: Nadav Har'El <[email protected]>
It's difficult to write a test (as we plan to do in to in the next patch) that verifies that synchronous view updates are indeed synchronous, i.e., that write with CL=QUORUM on the base-table write returns only after CL=QUORUM was also achieved in the view table. The difficulty is that in a fast test machine, even if the synchronous-view-update is completely buggy, it's likely that by the time the test reads from the view, all view updates will have been completed anyway. So in this patch we introduce an injection point, for testing, named "delay_before_remote_view_update", which adds a delay before the base replica sends its update to the remote view replica (in case the view replica is indeed remote). As usual, this injection point isn't configurable - when enabled it adds a fixed (0.5 second) delay, on all view updates on all tables. The existing code used continuation-style Seastar programming, and the addition of the injection point in this patch made it even uglier, so in the next patch we will coroutine-ize this code. Signed-off-by: Nadav Har'El <[email protected]>
In the previous patch we added a delay injection point (for testing) in the view update code. Because the code was using continuation style, this resulted in increased indentation and ugly repetition of captures. So in this patch we coroutinize the code that waits for remote view updates, making it simpler, shorter, and less indented. Note that this function still uses continuations in one place: The remote view update is still composed of two steps that need to happen one after another, but we don't necessarily need to wait for them to happen. This is easiest to do with chaining continuations, and then either waiting or not waiting for the resulting future. Signed-off-by: Nadav Har'El <[email protected]>
|
Pushed a new version of this PR, addressing a few minor comments but not all of them (see my replies to your comments). I already have a house of cards built on top of this PR (e.g., fixing secondary indexes on tablets) so would like to get this PR in to build on it, and will continue to improve using some more suggestions from this review. Please note that in particular, the test I have here for Alternator LSI consistency is awful, and depending on a 10-node cluster's random tablet placement to reproduce - sometimes - the inconsistency. In followup patches (for CQL LSIs) I developed a much better test setup, using just 2 nodes and deliberately moving tablets (so the base's tablets are on one node, the view's on the other nodes) for 100% reliable test failure, so later I'll send a followup to this PR to replace its test by a better one. But first I'd like to get this PR in before I get a stack overflow in my brain :-) |
🟢 CI State: SUCCESS✅ - Build Build Details:
|
In a followup patch f4b21d9 I rewrite the Alternator LSI test, and as part of it I verify that the test really sees that the Alternator table really uses tablets (because the test would trivially pass before tablets). |
|
Hi @scylladb/scylla-maint @tgrabiec there is a lot of discussion here but two positive reviews and no negative reviews so please consider merging this PR - I already have several newer PRs for tablets+secondary-index support waiting to be rebased on this one. |
The pull requests adds support for tablets in Alternator, and particularly focuses in getting Alternator's GSI and LSI (i.e., materialized views) to work. After this series support for tablets in Alternator _mostly_ work, but not completely: 1. CDC doesn't yet work with tablets, and Alternator needs to provide CDC (known as "DynamoDB Streams"). 2. Alternator's TTL feature was not tested with tablets, and probably doesn't work because it assumes the replication map belongs to a keyspace. Because of these reasons, Alternator does not yet use tablets by default and it needs to be enabled explicitly be adding an experimental tag to the new table. This will allow us to test Alternator with tablets even before it is ready for the limelight. Fixes #16203 Fixes #16313 Closes #16353 * github.com:scylladb/scylladb: mv, tablets, alternator: test for Alternator LSI with tablets mv: coroutinize wait code for remote view updates mv, test: add injection point to delay remove view update alternator: explicitly request synchronous updates for LSI alternator: fix view creation when using tablets alternator: add experimental method to create a table with tablets
|
Dequeued, causes next promotion failures It looks like the test simply timed out waiting to bootstrap 10 nodes. This node was booting for 17 minutes. This included waiting for other nodes to boot, which took 14 minutes. I'm not sure if we can afford having such long tests in test.py. Another problem is that this test file already has multiple long test cases. They should be in separate test files or they cannot be run concurrently. Regarding the failing test itself, the suggestion from comment makes sense to me maybe we could also use an error injection instead of API? |
We have an API to control tablet placement, you can move tablets with manager.api.move_tablet() |
|
@kbr-scylla pushed a new version with a last patch that adds the tablets+MV tests (the entire file...) to the skip-on-debug list, so we don't have the timeout on the arm+debug build. I hate this solution, but I can't fix everything in this PR - better solutions will have to come later. As I already noted, there is already a PR in the pipeline to switch this test to using just two nodes. But a 10 nodes test shouldn't have broken the world. |
|
@nyh why not use the BTW. the file should be probably split into multiple files anyway so the test cases can run concurrently. |
No, I think that's a mistake. There's elegance in putting related things in the same file (this isn't Java), and this is NOT a big test file, and not even the biggest test file (test_tablets.py is bigger). |
Because I didn't realize it works (I used it for skipping release - copying Tomek's use of it, but didn't realize I could use it for debug as well). Ok, I'll push a new version like that. |
This patch adds a test (in the topology test framework) for issue scylladb#16313 - the bug where Alternator LSI must use synchronous view updates but didn't. This test fails with high probability (around 50%) before the previous patch, which fixed this bug - and passes consistently after the patch (I ran it 100 times and it didn't fail even once). This is the first test in the topology framework that uses the DynamoDB API and not CQL. This required a couple of tiny convenience functions, which are introduced in the only test file that uses them - but if we want we can later move them out to a library file. Unfortunately, the standard AWS SDK for Python - boto3 - is *not* asynchronous, so this test is also not really asynchronous, and will block the event loop while making requests to Alternator. However, for now it doesn't matter (we do NOT run multiple tests in the same event loop), and if it ever matters, I mentioned a couple of options what we can do in a comment. Because this test uses a 10-node cluster, it is skipped in debug-mode runs. In a later patch we will replace it by a more efficent - and more reliable - 2-node test. Refs scylladb#16313 Signed-off-by: Nadav Har'El <[email protected]>
|
Pushed a new version that uses |
🔴 CI State: FAILURE✅ - Build Failed Tests (1/23482):Build Details:
|
A known flaky test: #16219 |
🟢 CI State: SUCCESS✅ - Build Build Details:
|
|
Hallelujah. @kbr-scylla please consider re-merging this PR. It now has a @skip_mode to skip the 10-node test on debug mode. And as I said, this 10-node test will soon disappear anyway (PR #16440 gets rid of it, and I can push these next PRs as soon as this one finally gets in) |
|
@scylladb/scylla-maint merge ping. |
|
Qd |
The pull requests adds support for tablets in Alternator, and particularly focuses in getting Alternator's GSI and LSI (i.e., materialized views) to work.
After this series support for tablets in Alternator mostly work, but not completely:
Because of these reasons, Alternator does not yet use tablets by default and it needs to be enabled explicitly be adding an experimental tag to the new table. This will allow us to test Alternator with tablets even before it is ready for the limelight.
Fixes #16203
Fixes #16313