Skip to content

view: revert cleanup filter that doesn't work with tablets#16670

Merged
scylladb-promoter merged 2 commits intoscylladb:masterfrom
nyh:fix-16598a
Jan 16, 2024
Merged

view: revert cleanup filter that doesn't work with tablets#16670
scylladb-promoter merged 2 commits intoscylladb:masterfrom
nyh:fix-16598a

Conversation

@nyh
Copy link
Copy Markdown
Contributor

@nyh nyh commented Jan 7, 2024

The goal of this PR is fix Scylla so that the dtest test_mvs_populating_from_existing_data, which starts to fail when enabling tablets, will pass.

The main fix (the second patch) is reverting code which doesn't work with tablets, and I explain why I think this code was not necessary in the first place.

Fixes #16598

@nyh nyh requested review from bhalevy and eliransin January 7, 2024 14:12
@scylladb-promoter
Copy link
Copy Markdown
Contributor

🔴 CI State: FAILURE

❌ - Build

Build Details:

  • Duration: 26 min
  • Builder: spider1.cloudius-systems.com

@nyh
Copy link
Copy Markdown
Contributor Author

nyh commented Jan 7, 2024

The self-contained-header check for dht/partition_filter.hh failed. I'll need to fix it and push a new version.

@scylladb-promoter
Copy link
Copy Markdown
Contributor

@nyh
Copy link
Copy Markdown
Contributor Author

nyh commented Jan 8, 2024

* [test_raft_remove_ignore_nodes](https://jenkins.scylladb.com//job/scylla-master/job/scylla-ci/5597/testReport/junit/%28root%29/test_raft_ignore_nodes/Tests___Unit_Tests___test_raft_remove_ignore_nodes) [🔍](https://github.com/scylladb/scylladb/issues?q=is:issue+is:open+test_raft_remove_ignore_nodes)

Known flaky test: #16420 :-(
I'll restart the CI.

@scylladb-promoter
Copy link
Copy Markdown
Contributor

🟢 CI State: SUCCESS

✅ - Build
✅ - dtest
✅ - Unit Tests

Build Details:

  • Duration: 3 hr 15 min
  • Builder: i-0170842cc1954f731 (m5ad.12xlarge)

@nyh
Copy link
Copy Markdown
Contributor Author

nyh commented Jan 10, 2024

@bhalevy please review, as it is your code which I'm reverting here.

@bhalevy
Copy link
Copy Markdown
Member

bhalevy commented Jan 10, 2024

This PR title is confusing since it does not fix the dtest code, but rather it fixes scylla so the dtest will stop failing.
I thing we better focus on the change to scylla.
The dtest passing is a nice side effect of that :)

Copy link
Copy Markdown
Contributor

@denesb denesb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we need a test which confirms that MV update generation indeed ignores data not owned by the current node.

// have had other updates.
std::move(sstables.begin(), sstables.end(), std::back_inserter(_sstables_with_tables[t]));
// Sleep a bit, to avoid a tight loop repeatedly spamming the log with the same message.
seastar::sleep(std::chrono::seconds(1)).get();
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe not in this PR, but we should make this exponential back-off, with a ceiling of say 1 minute.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know if it's important. Basically, if the generator fails after a long time (e.g., after working for 10 minutes), it doesn't matter if we sleep for 1 second or 60 seconds. If it fails after a long time (in this case, it failed after a fraction of a millisecond), again it doesn't matter if we sleep for 1 or 60 seconds - both are negligible performance-wise, and in both the log spam as at a manageable level.

// have had other updates.
std::move(sstables.begin(), sstables.end(), std::back_inserter(_sstables_with_tables[t]));
// Sleep a bit, to avoid a tight loop repeatedly spamming the log with the same message.
seastar::sleep(std::chrono::seconds(1)).get();
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that @kostja wanted to standardize on 0.1 seconds (100ms) sleeps.
Regarding the log spam, shouldn't we use logger::rate_limit?

That said, if you think that 1s sleep is more appropriate for this use case, I'm good with that.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Where did he want to standardized this way?
In my response to Botond above, I tried to explain why I think it doesn't matter what the sleep is (within reason), I don't think it needs to be smart or even standardized. It's not something that a user will ever encounter (it's not a user-facing timeout or delay or anything). The only purpose of this sleep is to reduce log spam - from a million lines per second to just 1 per second - 0.1 or 10 per second would also be just as acceptable.

inject_failure("view_update_generator_consume_staging_sstable");
auto result = staging_sstable_reader.consume_in_thread(view_updating_consumer(*this, s, std::move(permit), *t, sstables, _as, staging_sstable_reader_handle),
dht::incremental_owned_ranges_checker::make_partition_filter(_db.get_keyspace_local_ranges(s->ks_name())));
auto result = staging_sstable_reader.consume_in_thread(view_updating_consumer(*this, s, std::move(permit), *t, sstables, _as, staging_sstable_reader_handle));
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: s/with tables/with tablets/ in commit message.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Otherwise, LGTM

@nyh
Copy link
Copy Markdown
Contributor Author

nyh commented Jan 14, 2024

Maybe we need a test which confirms that MV update generation indeed ignores data not owned by the current node.

We already have a bunch of dtests that check multi-node MV operations. The specific dtest that this PR fixes is one of these tests - it involves view building and concurrent range ownership movements (tablets moving around).

To be honest, if I thought we had a problem, I would write a new test to confirm it, but in this case I don't think we have a problem: If a base replica does not own a range, it simply has no way to send the view update to anyone - if it's the Nth base replica it sends the update to the Nth view replica, but if it's not a base replica at all, it has noone to send anything to. So I don't think such a problem can even exist.

The "view update generator" is responsible for generating view updates
for staging sstables (such as coming from repair). If the processing
fails, the code retries - immediately. If there is some persistent bug,
such as issue scylladb#16598, we will have a tight loop of error messages,
potentially a gigabyte of identical messages every second.

In this patch we simply add a sleep of one second after view update
generation fails before retrying. We can still get many identical
error messages if there is some bug, but not more than one per second.

Refs scylladb#16598.

Signed-off-by: Nadav Har'El <[email protected]>
@nyh
Copy link
Copy Markdown
Contributor Author

nyh commented Jan 14, 2024

This PR title is confusing since it does not fix the dtest code, but rather it fixes scylla so the dtest will stop failing. I thing we better focus on the change to scylla. The dtest passing is a nice side effect of that :)

The title of the actual patch (the second patch) was exactly like you wanted: "view: revert cleanup filter that doesn't work with tablets". I just had a silly title for the merge commit. I'll change it.

@nyh nyh changed the title Fix dtest test_mvs_populating_from_existing_data with tablets view: revert cleanup filter that doesn't work with tablets Jan 14, 2024
This patch reverts commit 10f8f13 from
November 2022. That commit added to the "view update generator", the code
which builds view updates for staging sstables, a filter that ignores
ranges that do not belong to this node. However,

1. I believe this filter was never necessary, because the view update
   code already silently ignores base updates which do not belong to
   this replica (see get_view_natural_endpoint()). After all, the view
   update needs to know that this replica is the Nth owner of the base
   update to send its update to the Nth view replica, but if no such
   N exists, no view update is sent.

2. The code introduced for that filter used a per-keyspace replication
   map, which was ok for vnodes but no longer works for tablets, and
   causes the operation using it to fail.

3. The filter was used every time the "view update generator" was used,
   regardless of whether any cleanup is necessary or not, so every
   such operation would fail with tablets. So for example the dtest
   test_mvs_populating_from_existing_data fails with tablets:
     * This test has view building in parallel with automatic tablet
       movement.
     * Tablet movement is streaming.
     * When streaming happens before view building has finished, the
       streamed sstables get "view update generator" run on them.
       This causes the problematic code to be called.

Before this patch, the dtest test_mvs_populating_from_existing_data
fails when tablets are enabled. After this patch, it passes.

Fixes scylladb#16598

Signed-off-by: Nadav Har'El <[email protected]>
@nyh
Copy link
Copy Markdown
Contributor Author

nyh commented Jan 14, 2024

Rebased, and fixed typo in one commit message and the title of the pull request.

I did not change the 1-second sleep in the first patch. This first patch wasn't even the main point of this PR - I just had a hard time debugging this issue because before the fix in the second patch I would get millions of log messages which completely overwhelmed test.py, so the first patch is just to make this managable - failure in view_update_generator will now happen just once a second, not a million times a second.

@scylladb-promoter
Copy link
Copy Markdown
Contributor

🔴 CI State: FAILURE

✅ - Build
❌ - Unit Tests
✅ - dtest

Failed Tests (2/23733):

Build Details:

  • Duration: 1 hr 45 min
  • Builder: spider1.cloudius-systems.com

@nyh
Copy link
Copy Markdown
Contributor Author

nyh commented Jan 14, 2024

* [test_tablet_cql_lsi](https://jenkins.scylladb.com//job/scylla-master/job/scylla-ci/5759/testReport/junit/%28root%29/test_mv_tablets/Tests___Unit_Tests___test_tablet_cql_lsi) [🔍](https://github.com/scylladb/scylladb/issues?q=is:issue+is:open+test_tablet_cql_lsi)

This test didn't change in this PR. It seems to be flakiness in the topology test framework, in the concurrent server creation function servers_add(). I remember we saw this problem in the past but can't find this issue to update :-( I'll look more but @kbr-scylla @kostja maybe you remember?

I'll retry the CI.

You can see the failure in https://jenkins.scylladb.com/job/scylla-master/job/scylla-ci/5759/testReport/junit/(root)/non-boost%20tests/Tests___Unit_Tests___topology_experimental_raft_test_mv_tablets_debug_1/ :

=================================== FAILURES ===================================
_____________________________ test_tablet_cql_lsi ______________________________

self = <test.pylib.manager_client.ManagerClient object at 0x7fdcbe8a2c90>
servers_num = 2, cmdline = None, config = None, property_file = None
start = True, expected_error = None

    async def servers_add(self, servers_num: int = 1,
                          cmdline: Optional[List[str]] = None,
                          config: Optional[dict[str, Any]] = None,
                          property_file: Optional[dict[str, Any]] = None,
                          start: bool = True,
                          expected_error: Optional[str] = None) -> [ServerInfo]:
        """Add new servers concurrently.
        This function can be called only if the cluster uses consistent topology changes, which support
        concurrent bootstraps. If your test does not fulfill this condition and you want to add multiple
        servers, you should use multiple server_add calls."""
        assert servers_num > 0, f"servers_add: cannot add {servers_num} servers, servers_num must be positive"
    
        try:
            data = self._create_server_add_data(None, cmdline, config, property_file, start, expected_error)
            data['servers_num'] = servers_num
>           server_infos = await self.client.put_json("/cluster/addservers", data, response_type="json",
                                                      timeout=ScyllaServer.TOPOLOGY_TIMEOUT * servers_num)

test/pylib/manager_client.py:270: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

self = <test.pylib.rest_client.UnixRESTClient object at 0x7fdcbe8a1090>
resource_uri = '/cluster/addservers', data = {'servers_num': 2, 'start': True}
host = None, port = None, params = None, response_type = 'json', timeout = 2000

    async def put_json(self, resource_uri: str, data: Optional[Mapping] = None, host: Optional[str] = None,
                       port: Optional[int] = None, params: Optional[dict[str, str]] = None,
                       response_type: Optional[str] = None, timeout: Optional[float] = None) -> Any:
>       ret = await self._fetch("PUT", resource_uri, response_type = response_type, host = host,
                                port = port, params = params, json = data, timeout = timeout)

test/pylib/rest_client.py:104: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

self = <test.pylib.rest_client.UnixRESTClient object at 0x7fdcbe8a1090>
method = 'PUT', resource = '/cluster/addservers', response_type = 'json'
host = None, port = None, params = None
json = {'servers_num': 2, 'start': True}, timeout = 2000

    async def _fetch(self, method: str, resource: str, response_type: Optional[str] = None,
                     host: Optional[str] = None, port: Optional[int] = None,
                     params: Optional[Mapping[str, str]] = None,
                     json: Optional[Mapping] = None, timeout: Optional[float] = None) -> Any:
        # Can raise exception. See https://docs.aiohttp.org/en/latest/web_exceptions.html
        assert method in ["GET", "POST", "PUT", "DELETE"], f"Invalid HTTP request method {method}"
        assert response_type is None or response_type in ["text", "json"], \
                f"Invalid response type requested {response_type} (expected 'text' or 'json')"
        # Build the URI
        port = port if port else self.default_port if hasattr(self, "default_port") else None
        port_str = f":{port}" if port else ""
        assert host is not None or hasattr(self, "default_host"), "_fetch: missing host for " \
                "{method} {resource}"
        host_str = host if host is not None else self.default_host
        uri = self.uri_scheme + "://" + host_str + port_str + resource
        logging.debug(f"RESTClient fetching {method} {uri}")
    
        client_timeout = ClientTimeout(total = timeout if timeout is not None else 300)
        async with request(method, uri,
                           connector = self.connector if hasattr(self, "connector") else None,
                           params = params, json = json, timeout = client_timeout) as resp:
            if resp.status != 200:
                text = await resp.text()
>               raise HTTPError(uri, resp.status, params, json, text)
E               test.pylib.rest_client.HTTPError: HTTP error 500, uri: http+unix://api/cluster/addservers, params: None, json: {'start': True, 'servers_num': 2}, body:
E               failed to start the node, server_id 1410, IP 127.54.45.49, workdir scylla-1410, host_id 993789d7-6c5b-4648-a699-cbb76d4255c5, cql [not connected]
E               Check the log files:
E               /scylladir/testlog/x86_64/test.py.debug-release-dev.log
E               /scylladir/testlog/x86_64/debug/scylla-1410.log

test/pylib/rest_client.py:69: HTTPError

The above exception was the direct cause of the following exception:

manager = <test.pylib.manager_client.ManagerClient object at 0x7fdcbe8a2c90>

    @pytest.mark.asyncio
    @skip_mode('release', 'error injections are not supported in release mode')
    async def test_tablet_cql_lsi(manager: ManagerClient):
        """A simple reproducer for issue #16371 where CQL LSI (local secondary
           index) was not using synchronous view updates when tablets are enabled,
           contrary to what the documentation for local SI says. In other words,
           we could write to a table with CL=QUORUM and then try to read with
           CL=QUORUM using the index - and not find the data.
    
           We use a cluster of just two nodes and RF=1, and control the tablets
           so all base tablets will be in node 0 and all view tablets will be
           in node 1, to ensure that the view update is remote and therefore
           not synchronous by default. To make the test failure even more
           likely on a fast machine, we use the "delay_before_remote_view_update"
           injection point to add a delay to the view update more than usual.
           Reproduces #16371.
        """
>       servers = await manager.servers_add(2)

test/topology_experimental_raft/test_mv_tablets.py:303: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

self = <test.pylib.manager_client.ManagerClient object at 0x7fdcbe8a2c90>
servers_num = 2, cmdline = None, config = None, property_file = None
start = True, expected_error = None

    async def servers_add(self, servers_num: int = 1,
                          cmdline: Optional[List[str]] = None,
                          config: Optional[dict[str, Any]] = None,
                          property_file: Optional[dict[str, Any]] = None,
                          start: bool = True,
                          expected_error: Optional[str] = None) -> [ServerInfo]:
        """Add new servers concurrently.
        This function can be called only if the cluster uses consistent topology changes, which support
        concurrent bootstraps. If your test does not fulfill this condition and you want to add multiple
        servers, you should use multiple server_add calls."""
        assert servers_num > 0, f"servers_add: cannot add {servers_num} servers, servers_num must be positive"
    
        try:
            data = self._create_server_add_data(None, cmdline, config, property_file, start, expected_error)
            data['servers_num'] = servers_num
            server_infos = await self.client.put_json("/cluster/addservers", data, response_type="json",
                                                      timeout=ScyllaServer.TOPOLOGY_TIMEOUT * servers_num)
        except Exception as exc:
>           raise Exception("Failed to add servers") from exc
E           Exception: Failed to add servers

test/pylib/manager_client.py:273: Exception

@scylladb-promoter
Copy link
Copy Markdown
Contributor

🟢 CI State: SUCCESS

✅ - Build
✅ - Unit Tests
✅ - dtest

Build Details:

  • Duration: 1 hr 58 min
  • Builder: spider7.cloudius-systems.com

@kbr-scylla
Copy link
Copy Markdown
Contributor

It seems to be flakiness in the topology test framework, in the concurrent server creation function servers_add().

There are no known issues with this that haven't been fixed already

The failure report points to one of the servers failing to boot

E               failed to start the node, server_id 1410, IP 127.54.45.49, workdir scylla-1410, host_id 993789d7-6c5b-4648-a699-cbb76d4255c5, cql [not connected]
E               Check the log files:
E               /scylladir/testlog/x86_64/test.py.debug-release-dev.log
E               /scylladir/testlog/x86_64/debug/scylla-1410.log

Logs of this server: https://jenkins.scylladb.com/job/scylla-master/job/scylla-ci/5759/artifact/testlog/x86_64/debug/scylla-1410.log

WARN  2024-01-14 15:05:12,249 [shard 0: gms] repair - repair[c9bd731e-73eb-4acc-a412-eb12841925e4]: get_row_diff: got error from node=127.54.45.48, keyspace=system_auth, table=roles, range=(-648230420654190980,862260533620703931], error=std::out_of_range (regular column id 0 >= 0)
WARN  2024-01-14 15:05:12,250 [shard 0: gms] repair - repair[c9bd731e-73eb-4acc-a412-eb12841925e4]: shard=0, keyspace=system_auth, cf=roles, range=(-648230420654190980,862260533620703931], got error in row level repair: std::out_of_range (regular column id 0 >= 0)
INFO  2024-01-14 15:05:12,264 [shard 1: gms] repair - repair[c9bd731e-73eb-4acc-a412-eb12841925e4]: Started to repair 2 out of 3 tables in keyspace=system_auth, table=role_members, table_id=0ecdaa87-f8fb-3e60-88d1-74fb36fe5c0d, repair_reason=bootstrap
INFO  2024-01-14 15:05:12,265 [shard 0: gms] repair - repair[c9bd731e-73eb-4acc-a412-eb12841925e4]: stats: repair_reason=bootstrap, keyspace=system_auth, tables={roles, role_members, role_attributes}, ranges_nr=16, round_nr=16, round_nr_fast_path_already_synced=15, round_nr_fast_path_same_combined_hashes=0, round_nr_slow_path=0, rpc_call_nr=82, tx_hashes_nr=0, rx_hashes_nr=1, duration=0.062315915 seconds, tx_row_nr=0, rx_row_nr=0, tx_row_bytes=0, rx_row_bytes=0, row_from_disk_bytes={{127.54.45.48, 178}, {127.54.45.49, 0}}, row_from_disk_nr={{127.54.45.48, 1}, {127.54.45.49, 0}}, row_from_disk_bytes_per_sec={{127.54.45.48, 0.00272409}, {127.54.45.49, 0}} MiB/s, row_from_disk_rows_per_sec={{127.54.45.48, 16.0473}, {127.54.45.49, 0}} Rows/s, tx_row_nr_peer={}, rx_row_nr_peer={}
WARN  2024-01-14 15:05:12,266 [shard 0: gms] repair - repair[c9bd731e-73eb-4acc-a412-eb12841925e4]: 1 out of 48 ranges failed, keyspace=system_auth, tables={roles, role_members, role_attributes}, repair_reason=bootstrap, nodes_down_during_repair={}, aborted_by_user=false, failed_because=seastar::nested_exception: std::runtime_error (Failed to repair for keyspace=system_auth, cf=roles, range=(-648230420654190980,862260533620703931]) (while cleaning up after std::out_of_range (regular column id 0 >= 0))
INFO  2024-01-14 15:05:12,322 [shard 1: gms] repair - repair[c9bd731e-73eb-4acc-a412-eb12841925e4]: Started to repair 3 out of 3 tables in keyspace=system_auth, table=role_attributes, table_id=6b8c7359-a843-33f2-a1d8-5dc6a187436f, repair_reason=bootstrap
INFO  2024-01-14 15:05:12,383 [shard 1: gms] repair - repair[c9bd731e-73eb-4acc-a412-eb12841925e4]: stats: repair_reason=bootstrap, keyspace=system_auth, tables={roles, role_members, role_attributes}, ranges_nr=16, round_nr=48, round_nr_fast_path_already_synced=48, round_nr_fast_path_same_combined_hashes=0, round_nr_slow_path=0, rpc_call_nr=240, tx_hashes_nr=0, rx_hashes_nr=0, duration=0.17948368 seconds, tx_row_nr=0, rx_row_nr=0, tx_row_bytes=0, rx_row_bytes=0, row_from_disk_bytes={{127.54.45.48, 0}, {127.54.45.49, 0}}, row_from_disk_nr={{127.54.45.48, 0}, {127.54.45.49, 0}}, row_from_disk_bytes_per_sec={{127.54.45.48, 0}, {127.54.45.49, 0}} MiB/s, row_from_disk_rows_per_sec={{127.54.45.48, 0}, {127.54.45.49, 0}} Rows/s, tx_row_nr_peer={}, rx_row_nr_peer={}
INFO  2024-01-14 15:05:12,383 [shard 1: gms] repair - repair[c9bd731e-73eb-4acc-a412-eb12841925e4]: completed successfully, keyspace=system_auth
WARN  2024-01-14 15:05:12,384 [shard 0: gms] repair - repair[c9bd731e-73eb-4acc-a412-eb12841925e4]: sync data for keyspace=system_auth, status=failed: std::runtime_error ({shard 0: std::runtime_error (repair[c9bd731e-73eb-4acc-a412-eb12841925e4]: 1 out of 48 ranges failed, keyspace=system_auth, tables={roles, role_members, role_attributes}, repair_reason=bootstrap, nodes_down_during_repair={}, aborted_by_user=false, failed_because=seastar::nested_exception: std::runtime_error (Failed to repair for keyspace=system_auth, cf=roles, range=(-648230420654190980,862260533620703931]) (while cleaning up after std::out_of_range (regular column id 0 >= 0)))})
ERROR 2024-01-14 15:05:12,384 [shard 0: gms] storage_service - raft topology: raft_topology_cmd failed with: std::runtime_error ({shard 0: std::runtime_error (repair[c9bd731e-73eb-4acc-a412-eb12841925e4]: 1 out of 48 ranges failed, keyspace=system_auth, tables={roles, role_members, role_attributes}, repair_reason=bootstrap, nodes_down_during_repair={}, aborted_by_user=false, failed_because=seastar::nested_exception: std::runtime_error (Failed to repair for keyspace=system_auth, cf=roles, range=(-648230420654190980,862260533620703931]) (while cleaning up after std::out_of_range (regular column id 0 >= 0)))})

got error in row level repair: std::out_of_range (regular column id 0 >= 0)
I don't recall anything like this -- looks like a new issue.

Curious that it happened when repairing keyspace=system_auth, table=roles and what is even that error...

Node 127.54.45.48 (https://jenkins.scylladb.com/job/scylla-master/job/scylla-ci/5759/artifact/testlog/x86_64/debug/scylla-1409.log) didn't report anything interesting in its logs during this repair time.

@kbr-scylla
Copy link
Copy Markdown
Contributor

I also recommend in general doing a shallow 5 minute investigation (like I did above) before making a conclusion that this is a known issue --- there can be trillions of different reasons why bootstrapping a server might fail, and different root causes should be different issues IMO ("adding servers fails" is too generic). I haven't seen this std::out_of_range during repair before.

@nyh
Copy link
Copy Markdown
Contributor Author

nyh commented Jan 14, 2024

You're making a mountain of a molehill. I was trying to babysit this PR which contains a real fix which really is about work assigned to me, not debug random unrelated issues in areas of the code I'm not familiar. I actually did spend 5 minutes and found that add_servers failure and "showed my work" above. This fit with what you told me about a known bug in add_servers (which I didn't know had already been fixed). The insistence of topology tests developers to not put all the evidence of failures in a single file (which I insisted on when i wrote test/cql-pytest/run) meant I obviously missed some of the evidence in other files, which it seems you are more experienced at finding than I was. I don't think I should be executed for that :-)

@scylladb-promoter scylladb-promoter merged commit a9844ed into scylladb:master Jan 16, 2024
@nyh
Copy link
Copy Markdown
Contributor Author

nyh commented Jan 17, 2024

Opened a new issue #16821 on the CI failure that we saw above (and @kbr-scylla analyzed).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

View-updates after streaming doesn't work with tablets

5 participants