Solr pods graceful shutdown 

### Context
In Kubernetes, from the moment you issue `kubectl delete pod` command and until the Pod is deleted there are a few steps that happen: 
1. endpoint is being removed from the Endpoints (k8s object);
2. if the pod has a `preStop` hook set, it will run it before `SIGTERM` is invoked;
3. control-plane fires an event to Kube-proxy, CoreDNS, Ingress controller to deregister the Pod's IP address, and no further traffic should be sent to it
**AND, in parallel, after the `preStop` hook finishes,**
app receives `SIGTERM` and, if it's able to process it, it starts graceful shutdown; otherwise - wait for the `terminationGracePeriodSeconds` (default 30s) period to pass, and then `SIGKILL` is fired;
4. pod is deleted;

### Graceful shutdown
In order to achieve a graceful shutdown, we must satisfy the following condition(s): no traffic is sent to the non-existent IP (pod already deleted). That should be done in Step 3 described above, but, since those components (kube-proxy, coredns, ingress controller) might be busy with something else, **there is no guarantee that the IP will be removed from their state before the Pod is gone**. How long would it take? It depends; some of them might take less than a second, the others a bit longer.  

**Race condition**
As mentioned, deregistration of the IP from kube-proxy, CoreDNS, ingress controller, and `SIGTERM` sent to the APP happens in parallel, which can cause a few race conditions, one of them is: what if pod is deleted before the IP is deregistered? That could be a problem, since traffic might be sent to a non-existent IP.


### Issue statement
Graceful shutdown of SolrCloud. Currently, we use `preStop` hook where we run `solr stop -p 8983` (which kernel behind the scenes sends `SIGQUIT` to the process) which stop solr instances on port 8983 that run in the background. 

But, as we already know, `preStop` hook (step 2) is executed before kube-proxy, coredns, ingress controller received the event to deregister the IP address from their local state (step 3) and it will stop the Solr instance before deregistering its IP, , thus, traffic will be sent to a non-existent IP.

A few ways to handle that:
- custom input for `lifecycle.preStop` parameter that can be passed by the user and will overwrite the default `solr stop -p 8983`; 
- additional command before `solr stop -p 8983` in `preStop` hook that can be passed by the user, i.e. `lifecycle.preStop.cmd: sleep 30`, when merged, we'll get something like `sleep 30 && solr stop -p 8983`;
- a bit harsh: do nothing and wait for `SIGKILL`; this way we'll have better chances that within `terminationGracePeriodSeconds` (default 30s) the kube-proxy, coredns, ingress controller will deregister the IP and no traffic is sent to the Pod, and when `SIGKILL` fires - pod gets forcefully deleted.

What could be the other available options? 

cc @giannisbetas


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Solr pods graceful shutdown #322

Context

Graceful shutdown

Issue statement

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Solr pods graceful shutdown #322

Description

Context

Graceful shutdown

Issue statement

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions