I've seen this up-replication failure on a few roachtests. The easiest to reproduce so far is using acceptance/cli/node-status. A failure usually occurs within 5 runs:
~ COCKROACH_STORAGE_ENGINE=pebble roachtest run --local --artifacts artifacts.pebble --count 10 '^acceptance/cli/node-status$'
...
22:10:26 cluster.go:2490: still waiting for full replication
That then repeats forever and we never up-replicate. crdb_internal.ranges on this cluster shows that a number of ranges have only a single replica. The range debug page for one such range shows:
[n1,status] add - missing replica need=3, have=1, priority=10001.00
[n1,status] next replica action: add
[n1,status] allocate candidates: [ s2, valid:true, fulldisk:false, necessary:false, diversity:0.00, converges:0, balance:1, rangeCount:13, queriesPerSecond:1.57]
[n1,status] add target: s2, valid:true, fulldisk:false, necessary:false, diversity:0.00, converges:0, balance:1, rangeCount:13, queriesPerSecond:1.57
[n1,status] allocate candidates: []
[n1,status] error simulating allocator on replica [n1,s1,r1/1:/{Min-System/NodeL…}]: avoid up-replicating to fragile quorum: 0 of 2 live stores are able to take a new replica for the range (2 already have a replica); likely not enough nodes in cluster
I've seen this up-replication failure on a few roachtests. The easiest to reproduce so far is using
acceptance/cli/node-status. A failure usually occurs within 5 runs:That then repeats forever and we never up-replicate.
crdb_internal.rangeson this cluster shows that a number of ranges have only a single replica. The range debug page for one such range shows:I think this means that
n1is not consideringn2orn3to be live. Need to dive in further to what is going on, and why this is related to Pebble. The same roachtest when run on RocksDB passes 10 out of 10 times.