Use smarter probes for SolrCloud and SolrPrometheusExporter #511

HoustonPutman · 2022-12-22T21:16:54Z

Resolves #510

This is a WIP but the goal is to have smarter probes that work better for most installations of Solr and the Prometheus Exporter.

HoustonPutman · 2023-01-11T17:34:27Z

Would love some feedback on what users have been setting for their liveness/readiness/startup probes!

@janhoy @endzyme @joshsouza @nosvalds

HoustonPutman · 2023-01-11T18:02:09Z

This ticket is important since the e2e-tests will timeout with the unoptimized probes currently in use. We need to merge this before we can merge the e2e-tests.

joshsouza · 2023-01-11T19:20:35Z

At the moment we're not tuning the liveness/readiness/startup probes from the defaults the operator provides.
That said, here are my thoughts:

Liveness and Readiness are two different concepts, and the general advice I've seen settled on is that they should rarely be the same check.
- Liveness should ensure that the application is running (hasn't errored in such a manner that pid1 in the container is up, but the app is not running). When this check fails, the pod should be terminated entirely.
- Startup should check the same thing liveness does, but account for the worst-case startup delay (the app takes 5 minutes to boot etc...). The Startup probe should take precedence over an initial delay for the liveness check (since the liveness check won't begin until after the startup probe finishes), and generally isn't needed unless the app takes time to boot up.
- Readiness should ensure that the application is prepared to handle incoming requests, and should be considered a good target for the load balancer/service to send traffic to (Usually a live endpoint, in this case the metrics one makes a lot of sense to me)
I'm not sure if using the same endpoint for both liveness and readiness is appropriate for this particular application, but the questions that come to mind are: Can this be overloaded to the point where it's still valid/running, but needs to process requests in flight before it can handle more? If so, we need to make sure the liveness probe won't fail in that scenario, but the readiness check would, so that it stops receiving new traffic, but is permitted to finish the in-flight requests.
Given what I can infer here, (that there's only the one endpoint, and it could be overloaded, but we wouldn't want it to remain overloaded for an extended period of time) I would suggest (opinions, definitely up for debate):
- Increase the failure threshold for the startup probe from 5 to 15 (right now the pod has 10s to start, or it is considered a failure. I would bump that to 30s, but that's just me)
- Increase the failure threshold on the liveness probe from 3 to 6 (with a period of 10, this means that if the pod can't handle any requests for a minute, it's probably best to kill it off)
- Reduce the period on the readiness probe from 10s to 5s (so if it can't respond for 15s, drop it from the service. That gives it 45s to sort out requests already in flight before being considered dead)
- Or, ideally, separate the readiness and liveness checks to match their purpose more closely (checking the pid is running for liveness etc...)

Just some brain-dump on my gut reaction, very much open to discussion and/or correction. :)

endzyme · 2023-01-12T14:59:33Z

Thanks for reaching out about considerations. As Josh mentioned, we're not deviating from the current defaults in 0.6.0.

I wanted to addon to what Josh has mentioned above.

This application is a little different than other apps because all the nodes are clustered and have traffic routing among them based on what data they hold for what collections. This is important because it changes the contextual reason behind readiness probes. Readiness probes mostly affect the "status" of the endpoint associating the pod to any services it's a part of. Since traffic can still hit nodes, even when they're pulled from the service, then the readiness probe doesn't really have much effect on incoming requests. I'd have to think a little more on the intended value of readiness with how Solr works. To be candid, I'm not sure if the operator configures communication between nodes via services or directly with pod names. If configured with services then readiness probes could impact communication between nodes in the SolrCloud.

As for liveness probes, the main value I see is when restarting the java process would actually resolve a problem. Liveness should really only trigger when the node cannot perform the most basic tasks but the process still appears to be running. Things that come to mind are an inability to read from disk, causing many 500s but still technically "running". Another could be "runaway threads" which bomb the process and should be terminated to recover service availability. That said, it should be blend of a service critical KPI with how long you'd be comfortable having this pod unavailable.

All the above are just considerations for original intent of the tools and should definitely be considered with how SolrCloud is intended to operate.

On more consideration we commonly see in the wild is negative feedback loops for liveness probes. This one is pretty tricky but the simplest example is that an application experiencing too much load can trigger its liveness probe, which then restart the app. While the app is restarting, it causes more load on the remaining pods, causing them to also cascade and trigger their liveness probes. These are usually the result at aggressive setting on liveness probes. Generally, in Java, the largest contributing factor would be resource starvation, like under allotting CPU or memory, which could lead to GC issues.

Anyway, not sure if the last paragraph is very actionable, but something to keep in my mind on initially tuning for aggressive restarts or more acceptable time to wait before restarting a service.

HoustonPutman · 2023-01-24T22:39:27Z

Thanks for both of your thoughts!

As for the question on whether the liveness/readiness endpoint should be the same, I do not think they should be eventually. I like admin/info/system as the handler for liveness (the same as it is now), since that basically just responds if solr is running. When 8.0 is our minimum version (soon) then using admin/info/health would be great for the readiness probe since we want to make sure that Solr can connect to Zookeeper. Eventually adding a parameter that says most replicas on the host are healthy could be useful, but I think ZK connection is a good place to start.

Can this be overloaded to the point where it's still valid/running, but needs to process requests in flight before it can handle more? If so, we need to make sure the liveness probe won't fail in that scenario, but the readiness check would, so that it stops receiving new traffic, but is permitted to finish the in-flight requests.

I think this is hard to do since a lot of the request handling could be updates and queries for specific collections, which we can't know... But definitely agree it would be great to get to this in the end.

I would suggest (opinions, definitely up for debate)

(The actual bulk of the changes happening in this PR)

Yeah I've upped the number of checks in the startup probe to 10, giving the pod 1 minute to become healthy. I think that should be enough for the Solr server to start.

For the others I agree, I have it set as 3 20s checks for the liveness probes, giving us a 40s-1m of "downtime" before taking down the pod. Readiness is set as 2 10 second checks, so if zk isn't available for 10-20 seconds, requests won't be routed to that node. But if its a blip, 1 good readiness check and its back in the list.

To be candid, I'm not sure if the operator configures communication between nodes via services or directly with pod names. If configured with services then readiness probes could impact communication between nodes in the SolrCloud.

So for node-level endpoints (headless service and individual pod services for ingresses), the readiness check is not used for routing, since we use the publishNotReadyAddresses: true option for these services. The only service that doesn't use this option is the solrcloud-common service, which is what the readiness probe would be impacting. Also the solr operator's rolling restart logic only uses the readiness probe when calculating the maxPodsDown option, so a readiness probe that is more likely to return errors is going to slow down rolling restarts, but probably not to a large degree.

On more consideration we commonly see in the wild is negative feedback loops for liveness probes.

Yeah this is definitely not something we want to take lightly. We only want to restart solr nodes when absolutely necessary.

Start for new probes defaults

e023e89

HoustonPutman mentioned this pull request Dec 23, 2022

Start e2e/integration testing functionality. #507

Merged

9 tasks

Update startup probes, give Solr 60 seconds to come online

445c34c

HoustonPutman and others added 3 commits January 25, 2023 10:39

Merge remote-tracking branch 'apache/main' into probe-defaults

9d096a5

Add changelog entry

bcf078d

Update prometheus_exporter_util.go

bb6dd15

HoustonPutman merged commit a353e99 into apache:main Jan 25, 2023

HoustonPutman deleted the probe-defaults branch January 25, 2023 16:41

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Use smarter probes for SolrCloud and SolrPrometheusExporter #511

Use smarter probes for SolrCloud and SolrPrometheusExporter #511

Uh oh!

HoustonPutman commented Dec 22, 2022

Uh oh!

HoustonPutman commented Jan 11, 2023

Uh oh!

HoustonPutman commented Jan 11, 2023

Uh oh!

joshsouza commented Jan 11, 2023

Uh oh!

endzyme commented Jan 12, 2023

Uh oh!

HoustonPutman commented Jan 24, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Use smarter probes for SolrCloud and SolrPrometheusExporter #511

Use smarter probes for SolrCloud and SolrPrometheusExporter #511

Uh oh!

Conversation

HoustonPutman commented Dec 22, 2022

Uh oh!

HoustonPutman commented Jan 11, 2023

Uh oh!

HoustonPutman commented Jan 11, 2023

Uh oh!

joshsouza commented Jan 11, 2023

Uh oh!

endzyme commented Jan 12, 2023

Uh oh!

HoustonPutman commented Jan 24, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants