Problem
Recently we started seeing lots of "unhealthy" instances on the distributor ring in Loki clusters during normal rollout. These "unhealthy" members are the ones who already left the ring but they failed to unregister themselves from the underlying KV store (consul in this case).
Did some investigation and would like record the behavior here.
Investigation
So what happens when a member of the ring leaves?
It needs to unregister() itself from the underlying KV store. And this unregister basically means removing its instance-id from the shared key value called collectors/distributor. (or any other key based on the components)
This key value collector/distributor is been shared among all the members of the ring. And every member needs to update it before successfully leaving the ring.
This shared access to the collectors/distributor is provided by something called CAS access. It provides some sort of synchronization mechanism for the key value shared by multiple members of the ring.
The way CAS (Check-And-Set) works is based on current ModifyIndex, a metadata about the key value you are tyring to set(or update).
This index is monotonically increasing counter that get updated every time any member update this particular kv.
This CAS operation is similar to PUT operation except, it takes additional param called cas, an integer representing ModifyIndex of the key value you are trying to update.
Two cases.
- if
cas is set to 0 (or leave it empty) then the kv is updated only if it doesn't exist before.
- but if
cas is set to non-zero value, then the kv is updated ony if that value of the cas matches with current ModifyIndex of the kv.
e.g; (I ran it in our clusters)
curl -s http://127.0.0.1:8500/v1/kv/collectors/distributor | jq .[0].ModifyIndex
8149486
Here modifyindex is 8149486. If I try to cas with lesser value, then it returns false.
curl -s --request PUT --data 10 http://127.0.0.1:8500/v1/kv/collectors/distributor?cas=100 && echo
false
This is what happening in our distributor ring. Some "LEAVE"ing distributor, try max of 10 times to set the value with ModifyIndex that they got previously. But meanwhile some other distributor would have updated it, making that ModifyIndex out of date. Thus making these distributors ran out of retries.
Proposal
The ideal solution is bit tricky. As we need some kind of synchronisation with same key:value collectors/distributors shared by all the members of the ring. which is what ModifyIndex gives us anyway.
So tweaking the MaxCasRetry (and/or CasRetryDelay) configs would be good one IMHO. Unfortunately, dskit currently doesn't accept to tweak this retry arguments. But I think we can make it configurable. And setting retry to same as number of replicas always resulted in no unhealthy members (tested it with even 100 distributor replicas)
So I think it's better
- Make the flags configurable
- Document this behavior in the
ring and kv packages.
NOTES
-
In case of cas returning false (because of unmatched ModifyIndex), it will be treated as non-error case.
e.g this handled in the dskit code here
-
The default retry of 10 times may work for ingester (it has expensive and time-consuming gracefull shutdown process). So most likely, one of the CAS retry would succeed. But not the case for distributor, it can shutdown almost instantly and thus high chance lots of distributors leaving the ring, accessing the shared key value at the same time.
-
While investigating, I also found the error returned by unregister operation will be lost if ring is wrapped in BasicService and the service doesn't have any listeners.
-
CAS doc from consul website.
Any thoughts, suggestions or criticisms are welcome!
Problem
Recently we started seeing lots of "unhealthy" instances on the distributor ring in Loki clusters during normal rollout. These "unhealthy" members are the ones who already left the ring but they failed to unregister themselves from the underlying KV store (consul in this case).
Did some investigation and would like record the behavior here.
Investigation
So what happens when a member of the ring leaves?
It needs to
unregister()itself from the underlying KV store. And thisunregisterbasically means removing its instance-id from the shared key value calledcollectors/distributor. (or any other key based on the components)This key value
collector/distributoris been shared among all the members of the ring. And every member needs to update it before successfully leaving the ring.This shared access to the
collectors/distributoris provided by something calledCASaccess. It provides some sort of synchronization mechanism for the key value shared by multiple members of the ring.The way CAS (Check-And-Set) works is based on current
ModifyIndex, a metadata about the key value you are tyring to set(or update).This index is monotonically increasing counter that get updated every time any member update this particular kv.
This CAS operation is similar to
PUToperation except, it takes additional param calledcas, an integer representingModifyIndexof the key value you are trying to update.Two cases.
casis set to 0 (or leave it empty) then the kv is updated only if it doesn't exist before.casis set to non-zero value, then the kv is updated ony if that value of thecasmatches with currentModifyIndexof the kv.e.g; (I ran it in our clusters)
Here modifyindex is 8149486. If I try to cas with lesser value, then it returns false.
This is what happening in our distributor ring. Some "LEAVE"ing distributor, try max of 10 times to set the value with
ModifyIndexthat they got previously. But meanwhile some other distributor would have updated it, making thatModifyIndexout of date. Thus making these distributors ran out of retries.Proposal
The ideal solution is bit tricky. As we need some kind of synchronisation with same key:value
collectors/distributorsshared by all the members of the ring. which is whatModifyIndexgives us anyway.So tweaking the
MaxCasRetry(and/orCasRetryDelay) configs would be good one IMHO. Unfortunately, dskit currently doesn't accept to tweak this retry arguments. But I think we can make it configurable. And settingretryto same asnumber of replicasalways resulted in no unhealthy members (tested it with even 100 distributor replicas)So I think it's better
ringandkvpackages.NOTES
In case of cas returning
false(because of unmatched ModifyIndex), it will be treated as non-error case.e.g this handled in the dskit code here
The default retry of 10 times may work for ingester (it has expensive and time-consuming gracefull shutdown process). So most likely, one of the CAS retry would succeed. But not the case for distributor, it can shutdown almost instantly and thus high chance lots of distributors leaving the ring, accessing the shared key value at the same time.
While investigating, I also found the error returned by
unregisteroperation will be lost if ring is wrapped inBasicServiceand the service doesn't have any listeners.CAS doc from consul website.
Any thoughts, suggestions or criticisms are welcome!