Sporadic repair failure when adding a node

In CI run https://jenkins.scylladb.com/job/scylla-master/job/scylla-ci/5759/, while bringing up a Scylla cluster of two nodes more-or-less concurrently, one Scylla node failed to boot with the log message (https://jenkins.scylladb.com/job/scylla-master/job/scylla-ci/5759/artifact/testlog/x86_64/debug/scylla-1410.log):

```
INFO  2024-01-14 15:05:12,202 [shard 0: gms] repair - bootstrap_with_repair: started with keyspace=system_auth, nr_ranges=48
INFO  2024-01-14 15:05:12,203 [shard 0: gms] repair - repair[c9bd731e-73eb-4acc-a412-eb12841925e4]: sync data for keyspace=system_auth, status=started
INFO  2024-01-14 15:05:12,203 [shard 0: gms] repair - repair[c9bd731e-73eb-4acc-a412-eb12841925e4]: Started to repair 1 out of 3 tables in keyspace=system_auth, table=roles, table_id=5bc52802-de25-35ed-aeab-188eecebb090, repair_reason=bootstrap
INFO  2024-01-14 15:05:12,204 [shard 1: gms] repair - repair[c9bd731e-73eb-4acc-a412-eb12841925e4]: Started to repair 1 out of 3 tables in keyspace=system_auth, table=roles, table_id=5bc52802-de25-35ed-aeab-188eecebb090, repair_reason=bootstrap
WARN  2024-01-14 15:05:12,249 [shard 0: gms] repair - repair[c9bd731e-73eb-4acc-a412-eb12841925e4]: get_row_diff: got error from node=127.54.45.48, keyspace=system_auth, table=roles, range=(-648230420654190980,862260533620703931], error=std::out_of_range (regular column id 0 >= 0)
WARN  2024-01-14 15:05:12,250 [shard 0: gms] repair - repair[c9bd731e-73eb-4acc-a412-eb12841925e4]: shard=0, keyspace=system_auth, cf=roles, range=(-648230420654190980,862260533620703931], got error in row level repair: std::out_of_range (regular column id 0 >= 0)
...
WARN  2024-01-14 15:05:12,384 [shard 0: gms] repair - repair[c9bd731e-73eb-4acc-a412-eb12841925e4]: sync data for keyspace=system_auth, status=failed: std::runtime_error ({shard 0: std::runtime_error (repair[c9bd731e-73eb-4acc-a412-eb12841925e4]: 1 out of 48 ranges failed, keyspace=system_auth, tables={roles, role_members, role_attributes}, repair_reason=bootstrap, nodes_down_during_repair={}, aborted_by_user=false, failed_because=seastar::nested_exception: std::runtime_error (Failed to repair for keyspace=system_auth, cf=roles, range=(-648230420654190980,862260533620703931]) (while cleaning up after std::out_of_range (regular column id 0 >= 0)))})
ERROR 2024-01-14 15:05:12,384 [shard 0: gms] storage_service - raft topology: raft_topology_cmd failed with: std::runtime_error ({shard 0: std::runtime_error (repair[c9bd731e-73eb-4acc-a412-eb12841925e4]: 1 out of 48 ranges failed, keyspace=system_auth, tables={roles, role_members, role_attributes}, repair_reason=bootstrap, nodes_down_during_repair={}, aborted_by_user=false, failed_because=seastar::nested_exception: std::runtime_error (Failed to repair for keyspace=system_auth, cf=roles, range=(-648230420654190980,862260533620703931]) (while cleaning up after std::out_of_range (regular column id 0 >= 0)))})
INFO  2024-01-14 15:05:12,433 [shard 0:main] init - Shutting down group 0 usage in local storage
```

Apparently, the topology operation of adding the second node caused a repair of `system_auth.roles`, a repair that failed because of a `std::out_of_range (regular column id 0 >= 0)`. I don't know what this means, maybe @asias it rings a bell for you?

Please note that as I said, the two nodes are coming up more-or-less concurrently, using topology test suite's `add_servers()` function. I'm just guessing here, but could we have a race where node B thinks it should repair a table (system_auth.roles) from node A, but node A doesn't have this table's schema yet?

This race or bug is probably rare, because we haven't seen these boot failures before (at least, I think we didn't see them).

Thanks to @kbr-scylla for analyzing this failure in https://github.com/scylladb/scylladb/pull/16670#issuecomment-1891020286.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Sporadic repair failure when adding a node #16821

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Sporadic repair failure when adding a node #16821

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions