Introduce managed SolrCloud update strategy #133

HoustonPutman · 2020-08-17T20:32:47Z

Issue number of the reported bug or feature request: Resolves #66

Describe your changes
There has been much discussion around the way in which rolling upgrades should occur for Solr within Kubernetes.

Currently the operator lets the statefulset controller manage upgrades, in a rolling one-by-one strategy.
In the ticket linked above, there was discussion around the addition of a parameter in the healthcheck handler of solr to make sure that all Solr cores in a pod are active before issuing a restart command. This handler would make the default statefulset rolling upgrade strategy safer for SolrClouds, but it doesn't actually solve the issue of optimal ordering of pods for restarts or batching all pods that can be restarted at the same time.

This PR gives the Solr Operator the control over which pods are deleted and when. The operator sets the StatefulSet upgrade policy to be OnDelete and will choose a set of pods to be deleted at any given time when there are pods in a solrcloud statefulset that are not up to date. Currently the ordering for upgrading pods is (from first to upgrade -> last to upgrade) Already offline -> No replicas -> # of leaders -> # of replicas -> not live -> live -> Overseer Node.

There should likely be good documentation around this feature so that users are well aware of the expected behavior and how this mechanism works.

The following options have been added to the SolrCloud CRD:

type SolrUpdateStrategy struct {
	// Method defines the way in which SolrClouds should be updated when the podSpec changes.
	// +optional
	Method SolrUpdateMethod `json:"method,omitempty"`

	// Options for Solr Operator Managed rolling updates.
	// +optional
	ManagedUpdateOptions ManagedUpdateOptions `json:"managed,omitempty"`
}

// SolrUpdateMethod is a string enumeration type that enumerates
// all possible ways that a SolrCloud can having rolling updates managed.
// +kubebuilder:validation:Enum=Managed;StatefulSet;Manual
type SolrUpdateMethod string

const (
	// Let the Solr Operator manage rolling updates to keep collections/shards available while updating pods in parallel.
	// This is the default option.
	ManagedUpdate SolrUpdateMethod = "Managed"

	// Use the default StatefulSet rolling updates logic. One pod at a time, starting with the highest ordinal.
	StatefulSetUpdate SolrUpdateMethod = "StatefulSet"

	// The Solr Operator and Kubernetes will not delete pods for updates. The user will be responsible for this.
	ManualUpdate SolrUpdateMethod = "Manual"
)

func (opts *SolrUpdateStrategy) withDefaults() (changed bool) {
	// You can't use an externalAddress for Solr Nodes if the Nodes are hidden externally
	if opts.Method == "" {
		changed = true
		opts.Method = ManagedUpdate
	}

	return changed
}

// Spec to control the desired behavior of managed rolling update.
type ManagedUpdateOptions struct {

	// The maximum number of pods that can be unavailable during the update.
	// Value can be an absolute number (ex: 5) or a percentage of the desired number of pods (ex: 10%).
	// Absolute number is calculated from percentage by rounding down.
	// If the provided number is 0 or negative, then all pods will be allowed to be updated in unison.
	//
	// Defaults to 25%.
	//
	// +optional
	MaxPodsUnavailable *intstr.IntOrString `json:"maxPodsUnavailable,omitempty"`

	// The maximum number of replicas for each shard that can be unavailable during the update.
	// Value can be an absolute number (ex: 5) or a percentage of replicas in a shard (ex: 25%).
	// Absolute number is calculated from percentage by rounding down.
	// If the provided number is 0 or negative, then all replicas will be allowed to be updated in unison.
	//
	// Defaults to 1.
	//
	// +optional
	MaxShardReplicasUnavailable *intstr.IntOrString `json:"maxShardReplicasUnavailable,omitempty"`
}

Testing performed

There are unit tests included that try to ensure that the pod selection and ordering logic is correct. As well as the parsing of the CRD options.

I have tested this manually, but haven't exhausted all possible cases.
Before this is merged, this needs likely needs an integration test suite that ensures additional changes will not break this functionality.

docs/solr-cloud/managed-updates.md

thelabdude · 2020-09-17T19:00:33Z

LGTM +1 ... ran through various scenarios with a 5-node cluster and multiple collections. Also tried with no collection and a single node cluster just to verify no weird edges cases.

HoustonPutman · 2020-09-17T19:10:01Z

Also tested some edge cases like using an image that doesn't exist, then reverting back. Works even better than the statefulset logic! 😄

sepulworld · 2020-10-07T18:10:09Z

Need any help testing this out? Looks awesome. We would probably start using ManagedUpdate in production right from the get go! :)

HoustonPutman · 2020-10-08T15:31:08Z

It'd be great if y'all could test it out. I think I want to add some stuff in the status about whether a node is up-to-date with the spec. (Right now it only looks at the image tag, not other pod specs). Other than that I think this is close to ready to merge.

Obviously every use case we have to test it out makes me even more confident! 😄

HoustonPutman · 2020-10-12T20:26:51Z

@sepulworld The status changes are in. Let me know when you get a chance to test this out!

Signed-off-by: Houston Putman <[email protected]>

thelabdude

Great work as usual Houston! I kicked the tires on this PR in GKE over the past few days and looks great to me

Signed-off-by: Houston Putman <[email protected]>

- Issue with passing an array to a method and assigning value to the array - Deep dive into Ingress rules instead of an overall DeepEquals Signed-off-by: Houston Putman <[email protected]>

Also increasing the documentation for the change to the ordering of live nodes, as well as the change for non-started, out-of-date pods (which are updated automatically). Signed-off-by: Houston Putman <[email protected]>

Signed-off-by: Houston Putman <[email protected]>

Solr 8.6 introduced a new security feature, where the backup & restore features could not be used with arbitrary paths. Only the SOLR_HOME, the SOLR_DATA_HOME or explicitly defined paths could be provided. Instead of using the "allowedPaths" input to specify additional paths, the solr operator now puts backup & restore data within SOLR_HOME (the data volume). This allows us to support both 8.5- and 8.6+. Additionally the data PVCs will not retain the backup information, because volumes to not persist data that is mounted to other volumes within their directories. Signed-off-by: Houston Putman <[email protected]>

Signed-off-by: Houston Putman <[email protected]>

HoustonPutman mentioned this pull request Aug 18, 2020

Introduce comprehensive integration tests #134

Closed

HoustonPutman force-pushed the manual-pod-deletion branch from c28452f to 12ffef7 Compare August 26, 2020 17:55

HoustonPutman force-pushed the manual-pod-deletion branch from ec88d61 to eeb8486 Compare September 16, 2020 16:28

thelabdude reviewed Sep 16, 2020

View reviewed changes

docs/solr-cloud/managed-updates.md Outdated Show resolved Hide resolved

thelabdude reviewed Sep 16, 2020

View reviewed changes

docs/solr-cloud/managed-updates.md Outdated Show resolved Hide resolved

HoustonPutman force-pushed the manual-pod-deletion branch from 92a0e79 to 126add9 Compare October 12, 2020 19:43

HoustonPutman force-pushed the manual-pod-deletion branch from eda3264 to 38d9213 Compare October 22, 2020 20:26

HoustonPutman changed the title ~~Comprehensive restart strategy for SolrClouds~~ Introduce managed SolrCloud update strategy Oct 28, 2020

HoustonPutman added a commit to HoustonPutman/solr-operator that referenced this pull request Oct 28, 2020

Introduce managed SolrCloud update strategy (apache#133)

2803145

Signed-off-by: Houston Putman <[email protected]>

HoustonPutman force-pushed the manual-pod-deletion branch from 990dac8 to 2803145 Compare October 28, 2020 16:05

HoustonPutman added a commit to HoustonPutman/solr-operator that referenced this pull request Nov 30, 2020

Introduce managed SolrCloud update strategy (apache#133)

78857db

Signed-off-by: Houston Putman <[email protected]>

HoustonPutman force-pushed the manual-pod-deletion branch from ddc5b32 to 78857db Compare November 30, 2020 00:56

HoustonPutman added a commit to HoustonPutman/solr-operator that referenced this pull request Nov 30, 2020

Introduce managed SolrCloud update strategy (apache#133)

55782e8

Signed-off-by: Houston Putman <[email protected]>

HoustonPutman force-pushed the manual-pod-deletion branch from 78857db to 55782e8 Compare November 30, 2020 01:37

thelabdude approved these changes Dec 4, 2020

View reviewed changes

HoustonPutman and others added 9 commits December 7, 2020 09:48

Introduce managed SolrCloud update strategy (apache#133)

cd5eafd

Signed-off-by: Houston Putman <[email protected]>

Upping CRD version to v1, to support embedded pod spec.

5811dd5

Signed-off-by: Houston Putman <[email protected]>

Remove support for modifying static StatefulSet fields.

53cd5aa

Signed-off-by: Houston Putman <[email protected]>

Fixing the never-ending reconcile loop for ZKs

ec45f88

Signed-off-by: Houston Putman <[email protected]>

Fixing never-ending reconcile loop for ingress update in Kube 1.18+

ae793d0

Signed-off-by: Houston Putman <[email protected]>

Fixes for checking if kube objects should be updated

fa4cc88

- Issue with passing an array to a method and assigning value to the array - Deep dive into Ingress rules instead of an overall DeepEquals Signed-off-by: Houston Putman <[email protected]>

Adding correct defaults for metrics objects.

2449247

Signed-off-by: Houston Putman <[email protected]>

HoustonPutman force-pushed the manual-pod-deletion branch from c135ff2 to 95aa678 Compare December 7, 2020 01:20

bsankara approved these changes Dec 8, 2020

View reviewed changes

bsankara merged commit 6024117 into apache:master Dec 8, 2020

bsankara pushed a commit that referenced this pull request Dec 8, 2020

Introduce managed SolrCloud update strategy (#133)

e3e031c

Signed-off-by: Houston Putman <[email protected]>

HoustonPutman deleted the manual-pod-deletion branch December 8, 2020 20:16

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Introduce managed SolrCloud update strategy #133

Introduce managed SolrCloud update strategy #133

Uh oh!

HoustonPutman commented Aug 17, 2020 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

thelabdude commented Sep 17, 2020

Uh oh!

HoustonPutman commented Sep 17, 2020

Uh oh!

sepulworld commented Oct 7, 2020

Uh oh!

HoustonPutman commented Oct 8, 2020

Uh oh!

HoustonPutman commented Oct 12, 2020

Uh oh!

thelabdude left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Introduce managed SolrCloud update strategy #133

Introduce managed SolrCloud update strategy #133

Uh oh!

Conversation

HoustonPutman commented Aug 17, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

thelabdude commented Sep 17, 2020

Uh oh!

HoustonPutman commented Sep 17, 2020

Uh oh!

sepulworld commented Oct 7, 2020

Uh oh!

HoustonPutman commented Oct 8, 2020

Uh oh!

HoustonPutman commented Oct 12, 2020

Uh oh!

thelabdude left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

HoustonPutman commented Aug 17, 2020 •

edited

Loading