-
Notifications
You must be signed in to change notification settings - Fork 134
Use smarter probes for SolrCloud and SolrPrometheusExporter #511
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Would love some feedback on what users have been setting for their liveness/readiness/startup probes! |
|
This ticket is important since the e2e-tests will timeout with the unoptimized probes currently in use. We need to merge this before we can merge the e2e-tests. |
|
At the moment we're not tuning the liveness/readiness/startup probes from the defaults the operator provides.
Just some brain-dump on my gut reaction, very much open to discussion and/or correction. :) |
|
Thanks for reaching out about considerations. As Josh mentioned, we're not deviating from the current defaults in 0.6.0. I wanted to addon to what Josh has mentioned above. This application is a little different than other apps because all the nodes are clustered and have traffic routing among them based on what data they hold for what collections. This is important because it changes the contextual reason behind readiness probes. Readiness probes mostly affect the "status" of the endpoint associating the pod to any services it's a part of. Since traffic can still hit nodes, even when they're pulled from the service, then the readiness probe doesn't really have much effect on incoming requests. I'd have to think a little more on the intended value of readiness with how Solr works. To be candid, I'm not sure if the operator configures communication between nodes via services or directly with pod names. If configured with services then readiness probes could impact communication between nodes in the SolrCloud. As for liveness probes, the main value I see is when restarting the java process would actually resolve a problem. Liveness should really only trigger when the node cannot perform the most basic tasks but the process still appears to be running. Things that come to mind are an inability to read from disk, causing many 500s but still technically "running". Another could be "runaway threads" which bomb the process and should be terminated to recover service availability. That said, it should be blend of a service critical KPI with how long you'd be comfortable having this pod unavailable. All the above are just considerations for original intent of the tools and should definitely be considered with how SolrCloud is intended to operate. On more consideration we commonly see in the wild is negative feedback loops for liveness probes. This one is pretty tricky but the simplest example is that an application experiencing too much load can trigger its liveness probe, which then restart the app. While the app is restarting, it causes more load on the remaining pods, causing them to also cascade and trigger their liveness probes. These are usually the result at aggressive setting on liveness probes. Generally, in Java, the largest contributing factor would be resource starvation, like under allotting CPU or memory, which could lead to GC issues. Anyway, not sure if the last paragraph is very actionable, but something to keep in my mind on initially tuning for aggressive restarts or more acceptable time to wait before restarting a service. |
|
Thanks for both of your thoughts! As for the question on whether the liveness/readiness endpoint should be the same, I do not think they should be eventually. I like
I think this is hard to do since a lot of the request handling could be updates and queries for specific collections, which we can't know... But definitely agree it would be great to get to this in the end.
(The actual bulk of the changes happening in this PR) Yeah I've upped the number of checks in the startup probe to 10, giving the pod 1 minute to become healthy. I think that should be enough for the Solr server to start. For the others I agree, I have it set as 3 20s checks for the liveness probes, giving us a 40s-1m of "downtime" before taking down the pod. Readiness is set as 2 10 second checks, so if zk isn't available for 10-20 seconds, requests won't be routed to that node. But if its a blip, 1 good readiness check and its back in the list.
So for node-level endpoints (headless service and individual pod services for ingresses), the readiness check is not used for routing, since we use the
Yeah this is definitely not something we want to take lightly. We only want to restart solr nodes when absolutely necessary. |
Resolves #510
This is a WIP but the goal is to have smarter probes that work better for most installations of Solr and the Prometheus Exporter.