Skip to content

Add support for tablets in Alternator#16353

Merged
scylladb-promoter merged 6 commits intoscylladb:masterfrom
nyh:fix-16313
Dec 20, 2023
Merged

Add support for tablets in Alternator#16353
scylladb-promoter merged 6 commits intoscylladb:masterfrom
nyh:fix-16313

Conversation

@nyh
Copy link
Copy Markdown
Contributor

@nyh nyh commented Dec 10, 2023

The pull requests adds support for tablets in Alternator, and particularly focuses in getting Alternator's GSI and LSI (i.e., materialized views) to work.

After this series support for tablets in Alternator mostly work, but not completely:

  1. CDC doesn't yet work with tablets, and Alternator needs to provide CDC (known as "DynamoDB Streams").
  2. Alternator's TTL feature was not tested with tablets, and probably doesn't work because it assumes the replication map belongs to a keyspace.

Because of these reasons, Alternator does not yet use tablets by default and it needs to be enabled explicitly be adding an experimental tag to the new table. This will allow us to test Alternator with tablets even before it is ready for the limelight.

Fixes #16203
Fixes #16313

@scylladb-promoter
Copy link
Copy Markdown
Contributor

🟢 CI State: SUCCESS

✅ - Build
✅ - dtest
✅ - Unit Tests

Build Details:

  • Duration: 2 hr 27 min
  • Builder: spider4.cloudius-systems.com

Comment thread alternator/executor.cc Outdated
auto ksm = create_keyspace_metadata(keyspace_name, sp, gossiper, ts, tags_map);
try {
schema_mutations = service::prepare_new_keyspace_announcement(sp.local_db(), ksm, ts);
schema_mutations = service::prepare_new_keyspace_announcement(sp.local_db(), ksm, ts);
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a code style regression

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, I don't know what happened here, I'll fix it.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

disk corruption?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

More likely, a cat on my keyboard + me too lazy to read the patch before sending it :-(

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed.

Comment thread alternator/executor.cc
auto opts = get_network_topology_options(sp, gossiper, rf);

// Tablets are not yet enabled by default on Alternator tables, because of
// missing support for CDC (see issue #16313). Until then, allow
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because they're an experimental feature, too, and incomplete.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When I wrote "not enabled by default", I meant not even if the tablets experimental features is turned on. Even with the tablets experimental feature, I didn't want to enable tablets on Alternator by default. By the way, the same is currently the case in CQL - it's not enough to enable the experimental feature, you need to enable it per table.

Comment thread db/view/view.cc
tracing::trace(tr_state, "Failed to apply view update for {} and {} remote endpoints",
// The "delay_before_remote_view_update" injection point can be
// used to add a short delay (currently 0.5 seconds) before a base
// replica sends its update to the remove view replica.
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't using a delay asking for a failure on a busy system? In the case of tests, probably false negatives.

Probably, such a fault injection point should pause to coordinate on some named http endpoint. For example:

  1. test setup registers an http endpoint (provided by the test script) with the API server
  2. inject() calls the http endpoint with the input string
  3. the http server does something and returns a response
  4. based on the response, the injector continues normally or throws an exception

This allows the test system to inject any action, not just a delay.

We won't do this now, but maybe we'll find the time to reorganize this one day.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're absolutely right - a fixed delay may add false negatives - if the test machine is so slow that it can't perform a single read in 0.5 seconds, it may miss a bug. But if there is a bug, in some of the runs it won't be missed, and that's enough. Moreover, during the development process - when I wanted to reproduced the bug - I ran it on an idle machine.

Please note that this test really needs a finite delay - and not a very long delay or an infinite pause:

  1. When synchronous_update=false (the bug), the delay delays the view updates, and the immediate read doesn't see the data.
  2. When synchronous_update=true (the fix), the delay delays the view updates, but the write waits for it, and when the write returns, it sees the data.

Maybe there's a better way to do it, but I didn't think of a different approach that is reasonable. I also tried to work within what the injection framework (that is the first time I ever used...) gives me and not to reinvent it.

@nyh
Copy link
Copy Markdown
Contributor Author

nyh commented Dec 11, 2023

#16364 conflicts with this PR - after that other PR the way to enable or disable tablets will be "tablets: true" / "tablets: false", not via initial_tablets. So if that PR goes in first, I'll need to redo parts of this PR.

Copy link
Copy Markdown
Contributor

@wmitros wmitros left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me, however, reading this patch set has been mostly a learning experience for me as I'm not that familiar with the Alternator implementation, and the changes were good for that as they included an abundance of comments.

Copy link
Copy Markdown

@eliransin eliransin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I also think that eventually our injection should be more precise that just injecting some delays but it is not something for now 🙂

# an async wrapper to the boto3 functions ourselves (e.g., run them in a
# separate thread) ourselves.
@pytest.mark.asyncio
async def test_tablet_alternator_lsi_consistency(manager: ManagerClient):
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there any point running this on release mode?
In release mode the injection will not do anything so it is kind of pointless to run it (unless you are counting on accidentally catching this bug without the delay.

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not strongly opinionated about this so maybe we can mark this test (and probably should do it for all other tests that uses injection) somehow in a follow up.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@eliransin I don't know that injection is ignored in release mode. Why?
It means that any test that uses injection cannot be run on release mode - just on dev mode. Are we fine with that? CC @avikivity

The way I would do this in the cql-pytest framework is to skip a test when it cannot be done on this build mode - the inject function can detect it's not supported and skip the calling test - with no need for any separate marks. I can do this here too but am still trying to figure out the conventions of the topology test framework. Note that my absolute last preference would be to manually list this test in some yaml file of excluded tests as I've seen people doing - I think this approach is a mistake.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@nyh This is how you skip a test in release mode:

@pytest.mark.asyncio
@skip_mode('release', 'error injections are not supported in release mode')
async def test_table_dropped_during_streaming(manager: ManagerClient):

If you skip the test before invoking the injection API we save some run time because we avoid cluster bootstrap.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, I'll do this, though I don't love the need to remember to add this marker.
I still have an uneasy feeling about the principle though - just a few days we had a big discussion about needed more "testability", more injection points and so on that will make it easier to write reliable and shorter tests. But if we say that these injection points don't work in release mode, isn't this a problem? Doesn't it mean we should avoid using injection points in tests, or at least most tests, so that we'll still have most testings be able to run on "release mode" (which is the most important binary to be testing!!!!)
CC @avikivity I wonder what your thoughts on this are.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't see a problem. We're testing the algorithm here.

There's a chance that there's an additional bug that is hidden by both dev and debug modes and is exposed by release mode, but the probability is low, and together with the low probability of the injected failure happening in production, I think it's okay.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do this skip.

@tgrabiec
Copy link
Copy Markdown
Contributor

I also think that eventually our injection should be more precise that just injecting some delays but it is not something for now 🙂

It's possible to control when the injection should stop sleeping from the test, see 7d0f4c1. Injection points can wait on messages.

@nyh
Copy link
Copy Markdown
Contributor Author

nyh commented Dec 13, 2023

I also think that eventually our injection should be more precise that just injecting some delays but it is not something for now 🙂

It's possible to control when the injection should stop sleeping from the test, see 7d0f4c1. Injection points can wait on messages.

I don't think it's relevant to this test, unfortunately. The whole point of this LSI consistency check is to verify that if the view update takes a long time (e.g., 0.5 seconds) then the base write waits for it (i.e., it is a "synchronous view update"). So we need the view update to actually wait for some time - I don't know if we should use 0.5 seconds or 50 seconds, but it needs to be some sort of fixed wait. It's not that the test needs to tell the server when to stop sleeping - the server doesn't know when it wants to stop sleeping.

Again I want to emphasize that a short sleep is important for keeping the test short, and the only downside of a too-short sleep is that it can cause a false-negative - i.e., a failed test will not be recognized on a very slow and overloaded test machine because it took more than 0.5 seconds for the test to even try to read. Even in that case, if the code is buggy, it will sometimes fail (if the machine happens to not be super-slow), which may be good enough. I can increase 0.5 seconds to 30 seconds, but it means that a successful test will always take 30 seconds. In dtest I wouldn't think twice about this, but here I do.

Comment thread alternator/executor.cc
view_ptr(view_builder.build()), ts, true, schema_mutations);
view, ts, true, schema_mutations);
// add_table_or_view_to_schema_mutation() is a low-level function that
// doesn't call the callbacks that prepare_new_view_announcement()
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is there a reason to not use prepare_new_view_announcement here?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unfortunately, yes. Each of these functions makes some very specific assumptions - for example prepare_new_view_announcement works like CREATE VIEW - adding a view to an existing table. Alternator creates a keyspace, base-table and view all in one operation.

Comment thread alternator/executor.cc Outdated
// LSIs have no tags, but Scylla's "synchronous_updates" feature
// (which an LSIs need), is actually implemented as a tag so we
// need to add it here:
std::map<sstring, sstring> tags_map = {{db::SYNCHRONOUS_VIEW_UPDATES_TAG_KEY, "true"}};
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: extra space before std?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Eeek. You're right. I'll fix that.

Comment thread alternator/executor.cc
// (which an LSIs need), is actually implemented as a tag so we
// need to add it here:
std::map<sstring, sstring> tags_map = {{db::SYNCHRONOUS_VIEW_UPDATES_TAG_KEY, "true"}};
view_builder.add_extension(db::tags_extension::NAME, ::make_shared<db::tags_extension>(tags_map));
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is tags_map copied here? maybe move or construct in place

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The tags_extension constructor takes a const std::map<sstring, sstring>& tags so it copies, indeed. I'll try to remember to follow this up with a fix for this class and all its callers to avoid this copy, although I think this is completely negligible - we're talking here about one of the slowest operations in Scylla - creating a table with a view (it's even worse in DynamoDB - taking many seconds), copying an in-memory map with one item isn't a large part of it.


# Alternator convenience function for fetching the entire result set of a
# query into an array of items.
def full_query(table, ConsistentRead=True, **kwargs):
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

wouldn't be better to import those functions from alternator/util? deduplication later is harder, especially in case the code diverges

Copy link
Copy Markdown
Contributor Author

@nyh nyh Dec 17, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would vote no. We should minimize the number of convenience functions, and it's not always possible to share them between suites because of completely different assumptions and coding styles.
Sometimes copying 7 lines of code is better than turning the world up-side-down just to be able to reuse a 7 line function. It's also much easier for reviewers and future readers of this test to see in front of them the 7-line convenience function, instead of trying to inspect a huge library what this function really does (if you worked on dtest, you'll know what I mean and what I'm trying to avoid).

Copy link
Copy Markdown
Member

@nuivall nuivall Dec 18, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Duplicating functions doesn't minimize it's number unless you count differently. My main point is that different suites should not be treated as separate projects, that's unfortunate. The coding style should converge instead of allowing more more differgence.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. I meant we should minimize the number of global convenience functions. I don't know how many dtests you wrote, but their global "convenience" functions are everything that I want to avoid: I never understand them when I read a test, there are a hundred of them and each one takes a hundred different parameters. It's something I don't want to duplicate in my own tests.
  2. I don't know if "suites should be treated as separate projects" but in reality they are. And to be honest, if I need to choose between duplicating a 7-line function and enforcing irrelevant standards across the board (e.g., you must use async in test/alternator because we wanted to use it in test/topology!), I prefer duplicating a 7-line function.

Specifically, I noted above in the comment that according to the test/topology_experimental_raft coding standard, these Alternator tests will eventually need to be moved to using some sort of async library, not boto3. This full_query implementation will need to become async as well. The one in Alternator won't. I don't want to make 800 tests in test/alternator harder to read and write just to make 7 tests in test/topology_experimental_raft happy.

Finally, my philosophy of these convenience functions is that they should be very short and very simple - the exact opposite of dtest's functions that take 10 different parameters to suite 10 different needs. Such short and simple functions should not "diverge" over time. They shouldn't even change (or change very rarely) - my whole point is that I don't want people to start adding more and more features to the existing functions.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The problem you're describing falls more into good library api design category than duplicate vs code reuse. Comparing bad design to different bad design doesn't prove much in absolute terms. Agree that one should not create functions like
def full_scan(table, ConsistentRead=True, Count=0, WithRestart=false, OnlyEven=true, **kwargs)
but something like
def full_scan(table)
def full_scan_and_count(table)
but that's orthogonal to code duplication. It's not very bad though as you said function is very small. Just for the future I'd expect alternator part of topology tests to grow - ideally alternator should have some more system testing not only single node api tests.

I am not fully grasping async thing in python, mostly I think it's a distraction currently. I don't see why we should have both async and non async functions (see coroutines vs continuations pattern). Either this or that, mixed state should be seen as temporary not the goal.

Copy link
Copy Markdown
Contributor Author

@nyh nyh Dec 18, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's reconsider the library issue when it's clearer what the library should contain. My experience with dtest's collection of convenience functions is terrible (I don't think that this review is the place to discuss it), whereas in cql-pytest it mostly worked because the "library" is so small, deliberately.

Regarding async - the topology guys decided to use it, but it's not used in test/alternator. So I don't understand what you are proposing. Should we edit 800 Alternator test functions, almost 20,000 lines of code, to make everything async? Just so we could reuse a 7-line function (or maybe five such functions) between the projects? I'm not convinced.

@nuivall
Copy link
Copy Markdown
Member

nuivall commented Dec 13, 2023

so in theory there is no test checking if enabling tablets is doing anything in alternator?

Copy link
Copy Markdown
Member

@nuivall nuivall left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM (haven't looked closely at db/view/*)

nyh added 2 commits December 17, 2023 19:55
As explained in issue scylladb#16203, we cannot yet enable tablets on Alternator
keyspaces by default, because support for some of the features that
Alternator needs, such as CDC, is not yet available.
Nevertheless, to start testing Alternator integration with tablets,
we want to provide a way to enable tablets in Alternator for tests.

In this patch we add support for a tag, 'experimental:initial_tablets',
which if added on a table during creation, uses tablets for its keyspace.
The value of this tag is a numeric string, and it is exactly analogous
to the 'initial_tablets' property we have in CQL's NetworkTopologyStrategy.

We name this tag with the "experimental:" prefix to emphesize that it
is experimental, and the way to enable or disable tablets will probably
change later.

The new tag only has effect when added while *creating* a table.
Adding, deleting or changing it later on an existing table will have
no effect.

A later patch will have tests that use this tag to test Alternator with
tablets.

Refs scylladb#16203.

Signed-off-by: Nadav Har'El <[email protected]>
In commit 88a5dda, we fixed materialized
view creation to support tablets. We added to the function called to
create materialized views in CQL, prepare_new_view_announcement()
a missing call to the on_before_create_column_family() notifier that
creates tablets for this new view.

We have the same problem in Alternator when creating a view (GSI or LSI).
The Alternator code does not use prepare_new_view_announcement(), and
instead uses the lower-level function add_table_or_view_to_schema_mutation()
so it didn't get the call to the notifier, so we must add it here too.

Before this patch, creating an Alternator table with tablets (which has
become possible after the previous patch) fails with "Tablet map not found
for table <uuid>". With this patch, it works.

A test for materialized views in Alternator will come in a following
patch, and will test everything together - the CreateTable tag to use
tablets (from the previous patch), the LSI/GSI creation (fixed in this patch)
and the correct consistency of the LSI (fixed in the next patch).

Signed-off-by: Nadav Har'El <[email protected]>
@nyh
Copy link
Copy Markdown
Contributor Author

nyh commented Dec 17, 2023

so in theory there is no test checking if enabling tablets is doing anything in alternator?

Good question. In the later patches, I use the tablet-enabling tag to enable tablets and then (before fixing the bugs) obvious bugs appeared, so I know that this tag had an effect. However, you're right that we should have a regression test that the tablet-enabling tag actually does something and continues to do something (and isn't, say, just outright ignored).

I'll add such a test in a followup PR.

nyh added 3 commits December 17, 2023 20:14
DynamoDB's *local* secondary index (LSI) allows strongly-consistent
reads from the materialized view, which must be able to read what was
previously written to the base. To support this, we need the view to
use the "synchronous_updates".

Previously, with vnodes, there was no need for using this option
explicitly, because an LSI has the same partition key as the base table
so the base and view replicas are the same, and the local writes are
done synchronously. But with tablets, this changes - there is no longer
a guarantee that the base and view tablets are located on the same node.
So to restore the strong consistency of LSIs when tablets are enabled,
this patch explicitly adds the "synchronous_updates" option to views
created by Alternator LSIs. We do *not* add this option for GSIs - those
do not support strongly-consistent reads.

This fix was tested by a test that will be introduced in the following
patches. The test showed that before this patch, it was possible that
reading with ConsistentRead=True from an LSI right after the base was
written would miss the new changes, but after this patch, it always
sees the new data in the LSI.

Fixes scylladb#16313.

Signed-off-by: Nadav Har'El <[email protected]>
It's difficult to write a test (as we plan to do in to in the next patch)
that verifies that synchronous view updates are indeed synchronous, i.e.,
that write with CL=QUORUM on the base-table write returns only after
CL=QUORUM was also achieved in the view table. The difficulty is that in a
fast test machine, even if the synchronous-view-update is completely buggy,
it's likely that by the time the test reads from the view, all view updates
will have been completed anyway.

So in this patch we introduce an injection point, for testing, named
"delay_before_remote_view_update", which adds a delay before the base
replica sends its update to the remote view replica (in case the view
replica is indeed remote). As usual, this injection point isn't
configurable - when enabled it adds a fixed (0.5 second) delay, on all
view updates on all tables.

The existing code used continuation-style Seastar programming, and the
addition of the injection point in this patch made it even uglier, so
in the next patch we will coroutine-ize this code.

Signed-off-by: Nadav Har'El <[email protected]>
In the previous patch we added a delay injection point (for testing)
in the view update code. Because the code was using continuation style,
this resulted in increased indentation and ugly repetition of captures.

So in this patch we coroutinize the code that waits for remote view
updates, making it simpler, shorter, and less indented.

Note that this function still uses continuations in one place:
The remote view update is still composed of two steps that need
to happen one after another, but we don't necessarily need to wait
for them to happen. This is easiest to do with chaining continuations,
and then either waiting or not waiting for the resulting future.

Signed-off-by: Nadav Har'El <[email protected]>
@nyh
Copy link
Copy Markdown
Contributor Author

nyh commented Dec 17, 2023

Pushed a new version of this PR, addressing a few minor comments but not all of them (see my replies to your comments).

I already have a house of cards built on top of this PR (e.g., fixing secondary indexes on tablets) so would like to get this PR in to build on it, and will continue to improve using some more suggestions from this review.

Please note that in particular, the test I have here for Alternator LSI consistency is awful, and depending on a 10-node cluster's random tablet placement to reproduce - sometimes - the inconsistency. In followup patches (for CQL LSIs) I developed a much better test setup, using just 2 nodes and deliberately moving tablets (so the base's tablets are on one node, the view's on the other nodes) for 100% reliable test failure, so later I'll send a followup to this PR to replace its test by a better one. But first I'd like to get this PR in before I get a stack overflow in my brain :-)

@scylladb-promoter
Copy link
Copy Markdown
Contributor

🟢 CI State: SUCCESS

✅ - Build
✅ - dtest
✅ - Unit Tests

Build Details:

  • Duration: 1 hr 52 min
  • Builder: spider5.cloudius-systems.com

@nyh
Copy link
Copy Markdown
Contributor Author

nyh commented Dec 18, 2023

so in theory there is no test checking if enabling tablets is doing anything in alternator?

Good question. In the later patches, I use the tablet-enabling tag to enable tablets and then (before fixing the bugs) obvious bugs appeared, so I know that this tag had an effect. However, you're right that we should have a regression test that the tablet-enabling tag actually does something and continues to do something (and isn't, say, just outright ignored).

I'll add such a test in a followup PR.

In a followup patch f4b21d9 I rewrite the Alternator LSI test, and as part of it I verify that the test really sees that the Alternator table really uses tablets (because the test would trivially pass before tablets).

@nyh
Copy link
Copy Markdown
Contributor Author

nyh commented Dec 18, 2023

Hi @scylladb/scylla-maint @tgrabiec there is a lot of discussion here but two positive reviews and no negative reviews so please consider merging this PR - I already have several newer PRs for tablets+secondary-index support waiting to be rebased on this one.

tgrabiec added a commit that referenced this pull request Dec 18, 2023
The pull requests adds support for tablets in Alternator, and particularly focuses in getting Alternator's GSI and LSI (i.e., materialized views)  to work.

After this series support for tablets in Alternator _mostly_ work, but not completely:
1. CDC doesn't yet work with tablets, and Alternator needs to provide CDC (known as "DynamoDB Streams").
2. Alternator's TTL feature was not tested with tablets, and probably doesn't work because it assumes the replication map belongs to a keyspace.

Because of these reasons, Alternator does not yet use tablets by default and it needs to be enabled explicitly be adding an experimental tag to the new table. This will allow us to test Alternator with tablets even before it is ready for the limelight.

Fixes #16203
Fixes #16313

Closes #16353

* github.com:scylladb/scylladb:
  mv, tablets, alternator: test for Alternator LSI with tablets
  mv: coroutinize wait code for remote view updates
  mv, test: add injection point to delay remove view update
  alternator: explicitly request synchronous updates for LSI
  alternator: fix view creation when using tablets
  alternator: add experimental method to create a table with tablets
@kbr-scylla
Copy link
Copy Markdown
Contributor

Dequeued, causes next promotion failures
https://jenkins.scylladb.com/job/scylla-master/job/next/6963/artifact/testlog/aarch64/debug/topology_experimental_raft.test_mv_tablets.1.log

=================================== FAILURES ===================================
____________________ test_tablet_alternator_lsi_consistency ____________________

self = <test.pylib.manager_client.ManagerClient object at 0xffff05231910>
servers_num = 10, cmdline = None
config = {'alternator_port': 8000, 'alternator_write_isolation': 'only_rmw_uses_lwt'}
property_file = None, start = True, expected_error = None

    async def servers_add(self, servers_num: int = 1,
                          cmdline: Optional[List[str]] = None,
                          config: Optional[dict[str, Any]] = None,
                          property_file: Optional[dict[str, Any]] = None,
                          start: bool = True,
                          expected_error: Optional[str] = None) -> [ServerInfo]:
        """Add new servers concurrently"""
        assert servers_num > 0, f"servers_add: cannot add {servers_num} servers, servers_num must be positive"
    
        try:
            data = self._create_server_add_data(None, cmdline, config, property_file, start, expected_error)
            data['servers_num'] = servers_num
>           server_infos = await self.client.put_json("/cluster/addservers", data, response_type="json",
                                                      timeout=ScyllaServer.TOPOLOGY_TIMEOUT * servers_num)

test/pylib/manager_client.py:266: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

self = <test.pylib.rest_client.UnixRESTClient object at 0xffff057c1510>
resource_uri = '/cluster/addservers'
data = {'config': {'alternator_port': 8000, 'alternator_write_isolation': 'only_rmw_uses_lwt'}, 'servers_num': 10, 'start': True}
host = None, port = None, params = None, response_type = 'json', timeout = 10000

    async def put_json(self, resource_uri: str, data: Optional[Mapping] = None, host: Optional[str] = None,
                       port: Optional[int] = None, params: Optional[dict[str, str]] = None,
                       response_type: Optional[str] = None, timeout: Optional[float] = None) -> Any:
>       ret = await self._fetch("PUT", resource_uri, response_type = response_type, host = host,
                                port = port, params = params, json = data, timeout = timeout)

test/pylib/rest_client.py:104: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

self = <test.pylib.rest_client.UnixRESTClient object at 0xffff057c1510>
method = 'PUT', resource = '/cluster/addservers', response_type = 'json'
host = None, port = None, params = None
json = {'config': {'alternator_port': 8000, 'alternator_write_isolation': 'only_rmw_uses_lwt'}, 'servers_num': 10, 'start': True}
timeout = 10000

    async def _fetch(self, method: str, resource: str, response_type: Optional[str] = None,
                     host: Optional[str] = None, port: Optional[int] = None,
                     params: Optional[Mapping[str, str]] = None,
                     json: Optional[Mapping] = None, timeout: Optional[float] = None) -> Any:
        # Can raise exception. See https://docs.aiohttp.org/en/latest/web_exceptions.html
        assert method in ["GET", "POST", "PUT", "DELETE"], f"Invalid HTTP request method {method}"
        assert response_type is None or response_type in ["text", "json"], \
                f"Invalid response type requested {response_type} (expected 'text' or 'json')"
        # Build the URI
        port = port if port else self.default_port if hasattr(self, "default_port") else None
        port_str = f":{port}" if port else ""
        assert host is not None or hasattr(self, "default_host"), "_fetch: missing host for " \
                "{method} {resource}"
        host_str = host if host is not None else self.default_host
        uri = self.uri_scheme + "://" + host_str + port_str + resource
        logging.debug(f"RESTClient fetching {method} {uri}")
    
        client_timeout = ClientTimeout(total = timeout if timeout is not None else 300)
        async with request(method, uri,
                           connector = self.connector if hasattr(self, "connector") else None,
                           params = params, json = json, timeout = client_timeout) as resp:
            if resp.status != 200:
                text = await resp.text()
>               raise HTTPError(uri, resp.status, params, json, text)
E               test.pylib.rest_client.HTTPError: HTTP error 500, uri: http+unix://api/cluster/addservers, params: None, json: {'start': True, 'config': {'alternator_port': 8000, 'alternator_write_isolation': 'only_rmw_uses_lwt'}, 'servers_num': 10}, body:
E               failed to start the node, timeout reached, server_id 498, IP 127.58.46.2, workdir scylla-498, host_id 26143419-27e5-4467-aa5a-041969135711, cql [not connected]
E               Check the log files:
E               /scylladir/testlog/aarch64/test.py.debug-release-dev.log
E               /scylladir/testlog/aarch64/debug/scylla-498.log

test/pylib/rest_client.py:69: HTTPError

The above exception was the direct cause of the following exception:

manager = <test.pylib.manager_client.ManagerClient object at 0xffff05231910>

    @pytest.mark.asyncio
    @skip_mode('release', 'error injections are not supported in release mode')
    async def test_tablet_alternator_lsi_consistency(manager: ManagerClient):
        """A reproducer for a bug where Alternator LSI was not using synchronous
           view updates when tablets are enabled, which could cause strongly-
           consistent read of the LSI to miss the data just written to the base.
           We use a fairly large cluster to increase the chance that the tablet
           randomization results in non-local pairing of base and view tablets.
           We could make this test fail more reliably and only need 4 nodes if
           we had an API to control the tablet placement.
           We also use the "delay_before_remote_view_update" injection point
           to add a delay to the view update - without it it's almost impossible
           to reproduce this issue on a fast machine.
           Reproduces #16313.
        """
>       servers = await manager.servers_add(10, config=alternator_config)

test/topology_experimental_raft/test_mv_tablets.py:134: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

self = <test.pylib.manager_client.ManagerClient object at 0xffff05231910>
servers_num = 10, cmdline = None
config = {'alternator_port': 8000, 'alternator_write_isolation': 'only_rmw_uses_lwt'}
property_file = None, start = True, expected_error = None

    async def servers_add(self, servers_num: int = 1,
                          cmdline: Optional[List[str]] = None,
                          config: Optional[dict[str, Any]] = None,
                          property_file: Optional[dict[str, Any]] = None,
                          start: bool = True,
                          expected_error: Optional[str] = None) -> [ServerInfo]:
        """Add new servers concurrently"""
        assert servers_num > 0, f"servers_add: cannot add {servers_num} servers, servers_num must be positive"
    
        try:
            data = self._create_server_add_data(None, cmdline, config, property_file, start, expected_error)
            data['servers_num'] = servers_num
            server_infos = await self.client.put_json("/cluster/addservers", data, response_type="json",
                                                      timeout=ScyllaServer.TOPOLOGY_TIMEOUT * servers_num)
        except Exception as exc:
>           raise Exception("Failed to add servers") from exc
E           Exception: Failed to add servers

It looks like the test simply timed out waiting to bootstrap 10 nodes.
https://jenkins.scylladb.com/job/scylla-master/job/next/6963/artifact/testlog/aarch64/debug/scylla-498.log

WARN  2023-12-19 08:40:45,997 seastar - Seastar compiled with default allocator, --memory option won't take effect
...
INFO  2023-12-19 08:57:19,830 [shard 0: gms] storage_service - raft topology: streaming completed

This node was booting for 17 minutes. This included waiting for other nodes to boot, which took 14 minutes.

INFO  2023-12-19 08:41:48,118 [shard 0:strm] storage_service - raft topology: join: request to join placed, waiting for the response from the topology coordinator
INFO  2023-12-19 08:41:48,184 [shard 0:strm] raft_group0 - Server 26143419-27e5-4467-aa5a-041969135711 is starting group 0 with id 6042e7b0-9e4a-11ee-b539-40f2e8d7f197
INFO  2023-12-19 08:55:46,826 [shard 0:stmt] schema_tables - Creating keyspace system_auth

I'm not sure if we can afford having such long tests in test.py.

Another problem is that this test file already has multiple long test cases. They should be in separate test files or they cannot be run concurrently.

Regarding the failing test itself, the suggestion from comment makes sense to me

           We could make this test fail more reliably and only need 4 nodes if
           we had an API to control the tablet placement.

maybe we could also use an error injection instead of API?

@tgrabiec
Copy link
Copy Markdown
Contributor

We could make this test fail more reliably and only need 4 nodes if we had an API to control the tablet placement.

We have an API to control tablet placement, you can move tablets with manager.api.move_tablet()

@nyh
Copy link
Copy Markdown
Contributor Author

nyh commented Dec 19, 2023

@kbr-scylla pushed a new version with a last patch that adds the tablets+MV tests (the entire file...) to the skip-on-debug list, so we don't have the timeout on the arm+debug build.

I hate this solution, but I can't fix everything in this PR - better solutions will have to come later. As I already noted, there is already a PR in the pipeline to switch this test to using just two nodes. But a 10 nodes test shouldn't have broken the world.

@kbr-scylla
Copy link
Copy Markdown
Contributor

@nyh why not use the skip_mode marker which works per test case?

BTW. the file should be probably split into multiple files anyway so the test cases can run concurrently.

@nyh
Copy link
Copy Markdown
Contributor Author

nyh commented Dec 19, 2023

BTW. the file should be probably split into multiple files anyway so the test cases can run concurrently.

No, I think that's a mistake. There's elegance in putting related things in the same file (this isn't Java), and this is NOT a big test file, and not even the biggest test file (test_tablets.py is bigger).

@nyh
Copy link
Copy Markdown
Contributor Author

nyh commented Dec 19, 2023

@nyh why not use the skip_mode marker which works per test case?

Because I didn't realize it works (I used it for skipping release - copying Tomek's use of it, but didn't realize I could use it for debug as well).

Ok, I'll push a new version like that.

This patch adds a test (in the topology test framework) for issue scylladb#16313 -
the bug where Alternator LSI must use synchronous view updates but didn't.
This test fails with high probability (around 50%) before the previous patch,
which fixed this bug - and passes consistently after the patch (I ran it
100 times and it didn't fail even once).

This is the first test in the topology framework that uses the DynamoDB
API and not CQL. This required a couple of tiny convenience functions,
which are introduced in the only test file that uses them - but if we
want we can later move them out to a library file.

Unfortunately, the standard AWS SDK for Python - boto3 - is *not*
asynchronous, so this test is also not really asynchronous, and will
block the event loop while making requests to Alternator. However,
for now it doesn't matter (we do NOT run multiple tests in the same
event loop), and if it ever matters, I mentioned a couple of options
what we can do in a comment.

Because this test uses a 10-node cluster, it is skipped in debug-mode
runs. In a later patch we will replace it by a more efficent - and
more reliable - 2-node test.

Refs scylladb#16313

Signed-off-by: Nadav Har'El <[email protected]>
@nyh
Copy link
Copy Markdown
Contributor Author

nyh commented Dec 19, 2023

Pushed a new version that uses @skip_mode instead of suite.yaml to skip the 10-node test.
As I said time and again, this 10-node test will soon disappear anyway (PR #16440 gets rid of it) so no need to continue discussing how awful a 10-node test is.

@scylladb-promoter
Copy link
Copy Markdown
Contributor

🔴 CI State: FAILURE

✅ - Build
❌ - dtest
✅ - Unit Tests

Failed Tests (1/23482):

Build Details:

  • Duration: 2 hr 6 min
  • Builder: spider6.cloudius-systems.com

@nyh
Copy link
Copy Markdown
Contributor Author

nyh commented Dec 19, 2023

Failed Tests (1/23482):

test_disablebinary_and_disablegossip 🔍

A known flaky test: #16219
I'll restart the CI

@scylladb-promoter
Copy link
Copy Markdown
Contributor

🟢 CI State: SUCCESS

✅ - Build
✅ - dtest
✅ - Unit Tests

Build Details:

  • Duration: 2 hr 30 min
  • Builder: spider3.cloudius-systems.com

@nyh
Copy link
Copy Markdown
Contributor Author

nyh commented Dec 19, 2023

Hallelujah.

@kbr-scylla please consider re-merging this PR. It now has a @skip_mode to skip the 10-node test on debug mode. And as I said, this 10-node test will soon disappear anyway (PR #16440 gets rid of it, and I can push these next PRs as soon as this one finally gets in)

@nyh
Copy link
Copy Markdown
Contributor Author

nyh commented Dec 20, 2023

@scylladb/scylla-maint merge ping.
To summarize yesterday events: This PR was already merged, but dequeued by @kbr-scylla because the 10-node test sometimes took more than the test.py timeout in the debug build. So in the new version of this PR, this test is marked for not running on debug build.

@kbr-scylla
Copy link
Copy Markdown
Contributor

Qd

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Alternator's LSI must use synchronous view updates - even on tablets Allow creating an Alternator table that uses tablets instead of vnodes

8 participants