INFO 2024-01-14 15:05:12,202 [shard 0: gms] repair - bootstrap_with_repair: started with keyspace=system_auth, nr_ranges=48
INFO 2024-01-14 15:05:12,203 [shard 0: gms] repair - repair[c9bd731e-73eb-4acc-a412-eb12841925e4]: sync data for keyspace=system_auth, status=started
INFO 2024-01-14 15:05:12,203 [shard 0: gms] repair - repair[c9bd731e-73eb-4acc-a412-eb12841925e4]: Started to repair 1 out of 3 tables in keyspace=system_auth, table=roles, table_id=5bc52802-de25-35ed-aeab-188eecebb090, repair_reason=bootstrap
INFO 2024-01-14 15:05:12,204 [shard 1: gms] repair - repair[c9bd731e-73eb-4acc-a412-eb12841925e4]: Started to repair 1 out of 3 tables in keyspace=system_auth, table=roles, table_id=5bc52802-de25-35ed-aeab-188eecebb090, repair_reason=bootstrap
WARN 2024-01-14 15:05:12,249 [shard 0: gms] repair - repair[c9bd731e-73eb-4acc-a412-eb12841925e4]: get_row_diff: got error from node=127.54.45.48, keyspace=system_auth, table=roles, range=(-648230420654190980,862260533620703931], error=std::out_of_range (regular column id 0 >= 0)
WARN 2024-01-14 15:05:12,250 [shard 0: gms] repair - repair[c9bd731e-73eb-4acc-a412-eb12841925e4]: shard=0, keyspace=system_auth, cf=roles, range=(-648230420654190980,862260533620703931], got error in row level repair: std::out_of_range (regular column id 0 >= 0)
...
WARN 2024-01-14 15:05:12,384 [shard 0: gms] repair - repair[c9bd731e-73eb-4acc-a412-eb12841925e4]: sync data for keyspace=system_auth, status=failed: std::runtime_error ({shard 0: std::runtime_error (repair[c9bd731e-73eb-4acc-a412-eb12841925e4]: 1 out of 48 ranges failed, keyspace=system_auth, tables={roles, role_members, role_attributes}, repair_reason=bootstrap, nodes_down_during_repair={}, aborted_by_user=false, failed_because=seastar::nested_exception: std::runtime_error (Failed to repair for keyspace=system_auth, cf=roles, range=(-648230420654190980,862260533620703931]) (while cleaning up after std::out_of_range (regular column id 0 >= 0)))})
ERROR 2024-01-14 15:05:12,384 [shard 0: gms] storage_service - raft topology: raft_topology_cmd failed with: std::runtime_error ({shard 0: std::runtime_error (repair[c9bd731e-73eb-4acc-a412-eb12841925e4]: 1 out of 48 ranges failed, keyspace=system_auth, tables={roles, role_members, role_attributes}, repair_reason=bootstrap, nodes_down_during_repair={}, aborted_by_user=false, failed_because=seastar::nested_exception: std::runtime_error (Failed to repair for keyspace=system_auth, cf=roles, range=(-648230420654190980,862260533620703931]) (while cleaning up after std::out_of_range (regular column id 0 >= 0)))})
INFO 2024-01-14 15:05:12,433 [shard 0:main] init - Shutting down group 0 usage in local storage
Please note that as I said, the two nodes are coming up more-or-less concurrently, using topology test suite's add_servers() function. I'm just guessing here, but could we have a race where node B thinks it should repair a table (system_auth.roles) from node A, but node A doesn't have this table's schema yet?
This race or bug is probably rare, because we haven't seen these boot failures before (at least, I think we didn't see them).
In CI run https://jenkins.scylladb.com/job/scylla-master/job/scylla-ci/5759/, while bringing up a Scylla cluster of two nodes more-or-less concurrently, one Scylla node failed to boot with the log message (https://jenkins.scylladb.com/job/scylla-master/job/scylla-ci/5759/artifact/testlog/x86_64/debug/scylla-1410.log):
Apparently, the topology operation of adding the second node caused a repair of
system_auth.roles, a repair that failed because of astd::out_of_range (regular column id 0 >= 0). I don't know what this means, maybe @asias it rings a bell for you?Please note that as I said, the two nodes are coming up more-or-less concurrently, using topology test suite's
add_servers()function. I'm just guessing here, but could we have a race where node B thinks it should repair a table (system_auth.roles) from node A, but node A doesn't have this table's schema yet?This race or bug is probably rare, because we haven't seen these boot failures before (at least, I think we didn't see them).
Thanks to @kbr-scylla for analyzing this failure in #16670 (comment).