Skip to content

Conversation

@HoustonPutman
Copy link
Contributor

@HoustonPutman HoustonPutman commented Aug 17, 2020

Issue number of the reported bug or feature request: Resolves #66

Describe your changes
There has been much discussion around the way in which rolling upgrades should occur for Solr within Kubernetes.

Currently the operator lets the statefulset controller manage upgrades, in a rolling one-by-one strategy.
In the ticket linked above, there was discussion around the addition of a parameter in the healthcheck handler of solr to make sure that all Solr cores in a pod are active before issuing a restart command. This handler would make the default statefulset rolling upgrade strategy safer for SolrClouds, but it doesn't actually solve the issue of optimal ordering of pods for restarts or batching all pods that can be restarted at the same time.

This PR gives the Solr Operator the control over which pods are deleted and when. The operator sets the StatefulSet upgrade policy to be OnDelete and will choose a set of pods to be deleted at any given time when there are pods in a solrcloud statefulset that are not up to date. Currently the ordering for upgrading pods is (from first to upgrade -> last to upgrade) Already offline -> No replicas -> # of leaders -> # of replicas -> not live -> live -> Overseer Node.

There should likely be good documentation around this feature so that users are well aware of the expected behavior and how this mechanism works.

The following options have been added to the SolrCloud CRD:

type SolrUpdateStrategy struct {
	// Method defines the way in which SolrClouds should be updated when the podSpec changes.
	// +optional
	Method SolrUpdateMethod `json:"method,omitempty"`

	// Options for Solr Operator Managed rolling updates.
	// +optional
	ManagedUpdateOptions ManagedUpdateOptions `json:"managed,omitempty"`
}

// SolrUpdateMethod is a string enumeration type that enumerates
// all possible ways that a SolrCloud can having rolling updates managed.
// +kubebuilder:validation:Enum=Managed;StatefulSet;Manual
type SolrUpdateMethod string

const (
	// Let the Solr Operator manage rolling updates to keep collections/shards available while updating pods in parallel.
	// This is the default option.
	ManagedUpdate SolrUpdateMethod = "Managed"

	// Use the default StatefulSet rolling updates logic. One pod at a time, starting with the highest ordinal.
	StatefulSetUpdate SolrUpdateMethod = "StatefulSet"

	// The Solr Operator and Kubernetes will not delete pods for updates. The user will be responsible for this.
	ManualUpdate SolrUpdateMethod = "Manual"
)

func (opts *SolrUpdateStrategy) withDefaults() (changed bool) {
	// You can't use an externalAddress for Solr Nodes if the Nodes are hidden externally
	if opts.Method == "" {
		changed = true
		opts.Method = ManagedUpdate
	}

	return changed
}

// Spec to control the desired behavior of managed rolling update.
type ManagedUpdateOptions struct {

	// The maximum number of pods that can be unavailable during the update.
	// Value can be an absolute number (ex: 5) or a percentage of the desired number of pods (ex: 10%).
	// Absolute number is calculated from percentage by rounding down.
	// If the provided number is 0 or negative, then all pods will be allowed to be updated in unison.
	//
	// Defaults to 25%.
	//
	// +optional
	MaxPodsUnavailable *intstr.IntOrString `json:"maxPodsUnavailable,omitempty"`

	// The maximum number of replicas for each shard that can be unavailable during the update.
	// Value can be an absolute number (ex: 5) or a percentage of replicas in a shard (ex: 25%).
	// Absolute number is calculated from percentage by rounding down.
	// If the provided number is 0 or negative, then all replicas will be allowed to be updated in unison.
	//
	// Defaults to 1.
	//
	// +optional
	MaxShardReplicasUnavailable *intstr.IntOrString `json:"maxShardReplicasUnavailable,omitempty"`
}

Testing performed

There are unit tests included that try to ensure that the pod selection and ordering logic is correct. As well as the parsing of the CRD options.

I have tested this manually, but haven't exhausted all possible cases.
Before this is merged, this needs likely needs an integration test suite that ensures additional changes will not break this functionality.

@thelabdude
Copy link
Contributor

LGTM +1 ... ran through various scenarios with a 5-node cluster and multiple collections. Also tried with no collection and a single node cluster just to verify no weird edges cases.

@HoustonPutman
Copy link
Contributor Author

Also tested some edge cases like using an image that doesn't exist, then reverting back. Works even better than the statefulset logic! 😄

@sepulworld
Copy link
Contributor

Need any help testing this out? Looks awesome. We would probably start using ManagedUpdate in production right from the get go! :)

@HoustonPutman
Copy link
Contributor Author

It'd be great if y'all could test it out. I think I want to add some stuff in the status about whether a node is up-to-date with the spec. (Right now it only looks at the image tag, not other pod specs). Other than that I think this is close to ready to merge.

Obviously every use case we have to test it out makes me even more confident! 😄

@HoustonPutman
Copy link
Contributor Author

@sepulworld The status changes are in. Let me know when you get a chance to test this out!

@HoustonPutman HoustonPutman changed the title Comprehensive restart strategy for SolrClouds Introduce managed SolrCloud update strategy Oct 28, 2020
HoustonPutman added a commit to HoustonPutman/solr-operator that referenced this pull request Oct 28, 2020
HoustonPutman added a commit to HoustonPutman/solr-operator that referenced this pull request Nov 30, 2020
HoustonPutman added a commit to HoustonPutman/solr-operator that referenced this pull request Nov 30, 2020
Copy link
Contributor

@thelabdude thelabdude left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great work as usual Houston! I kicked the tires on this PR in GKE over the past few days and looks great to me

HoustonPutman and others added 9 commits December 7, 2020 09:48
- Issue with passing an array to a method and assigning value to the
  array
- Deep dive into Ingress rules instead of an overall DeepEquals

Signed-off-by: Houston Putman <[email protected]>
Also increasing the documentation for the change to the ordering of live
nodes, as well as the change for non-started, out-of-date pods (which
are updated automatically).

Signed-off-by: Houston Putman <[email protected]>
Solr 8.6 introduced a new security feature, where the backup & restore
features could not be used with arbitrary paths. Only the SOLR_HOME, the
SOLR_DATA_HOME or explicitly defined paths could be provided.

Instead of using the "allowedPaths" input to specify additional paths,
the solr operator now puts backup & restore data within SOLR_HOME (the
data volume). This allows us to support both 8.5- and 8.6+. Additionally
the data PVCs will not retain the backup information, because volumes to
not persist data that is mounted to other volumes within their
directories.

Signed-off-by: Houston Putman <[email protected]>
@bsankara bsankara merged commit 6024117 into apache:master Dec 8, 2020
bsankara pushed a commit that referenced this pull request Dec 8, 2020
@HoustonPutman HoustonPutman deleted the manual-pod-deletion branch December 8, 2020 20:16
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Complications around readinessChecks and pod restarts

4 participants