Rolling update failure thresholds and rollback by aaronlehmann · Pull Request #1380 · moby/swarmkit

aaronlehmann · 2016-08-16T18:31:59Z

This adds a previous spec to the Serivce object to support rollback.
It also adds new failure actions and update states corresponding to
rollback situations. Finally, it adds a threshold determining whether to
invoke the failure action.

I also thought about adding a timeout on task startup, so that if a
task does not reach RUNNING within the given amount of time, it
counts as a failure and is taken account when evaluating if the
update is above or below its failure threshold. However, a timeout
on task startup shouldn't be specific to rolling updates, and is
somewhat difficult to handle correctly across all the possible
situations (new tasks, updated tasks, restarted tasks, leader
failover...). I decided to leave this for the future.

cc @dongluochen @stevvooe @aluzzardi

codecov-io · 2016-08-16T18:41:02Z

Current coverage is 55.77% (diff: 85.57%)

Merging #1380 into master will increase coverage by 0.38%

@@             master      #1380   diff @@
==========================================
  Files            82         82          
  Lines         12885      12971    +86   
  Methods           0          0          
  Messages          0          0          
  Branches          0          0          
==========================================
+ Hits           7138       7235    +97   
+ Misses         4757       4739    -18   
- Partials        990        997     +7

Powered by Codecov. Last update 27fbaef...4bf8072

dongluochen · 2016-08-16T18:48:46Z

api/types.proto

+	//
+	// If the failure action is CONTINUE, there is no effect.
+	// If the failure action is PAUSE, no more tasks will be updated until
+	// another update is started.


Does the next update start with failing task?

I'm not sure I understand the question. What I mean here is that no more tasks will be updated until the user changes the ServiceSpec.

When user changes the ServiceSpec again, i.e., another update is started, does updater start with the failing fraction first?

If the ServiceSpec changes, it's considered a fresh update. Only tasks with the new spec will count towards the number of failures.

I think we should keep service healthy as much as possible. If previous rollout PAUSED because of failure, the next rollout should resolve those failures first.

I think we could do this by sorting the instances by health and doing the updates in that order. It shouldn't have an impact on the protobufs, though.

stevvooe · 2016-08-16T18:56:02Z

I also thought about adding a timeout on task startup, so that if a
task does not reach RUNNING within the given amount of time, it
counts as a failure and is taken account when evaluating if the
update is above or below its failure threshold.

This should be taken care of by the executor. If the task doesn't start in a certain amount of time we report failure.

dongluochen · 2016-08-16T20:08:03Z

LGTM

aaronlehmann · 2016-08-17T17:19:30Z

Updated to record the service version (from Meta) in the UpdateStatus, and propagate this version information to tasks that are created by the update. This will allow us to match up tasks with the update that created them.

stevvooe · 2016-08-17T21:57:16Z

manager/orchestrator/replicated.go

 		ServiceAnnotations: service.Spec.Annotations,
 		Spec:               service.Spec.Task,
 		ServiceID:          service.ID,
+		ServiceVersion:     service.Meta.Version,


Does this need to be set from UpdateStatus?

Yeah, that was the intent but I messed this up. Fixing now.

Does this need to be set from UpdateStatus?

Fixed.

aaronlehmann · 2016-08-19T23:13:34Z

Rebased. ping @aluzzardi

aaronlehmann · 2016-08-23T20:47:05Z

Discussed rollbacks with @aluzzardi and came up with a few conclusions:

Counting any failures that happen during the update period is too non-deterministic. We're thinking of supporting a configurable monitoring period that applies on a per-task basis. In other words, if this is set to 30 seconds, a task failing during the first 30 seconds after it's updated will count as a failure, and after that it will not count as a failure. This is similar to only watching for failures during the "update delay" that optionally happens after a task is updated, but it lets us decouple the pause between updates with the monitoring period, and put in place a more useful default for the monitoring period (as opposed to the 0s default for pausing between updates).
If a task fails during the update, and then continues to fail as it's restarted, that should only count as a single failure.
Looking at failed tasks in the data store isn't a good way of counting failures. This is too sensitive to task reaper settings, among other things. We should either count failures with local state, a counter in UpdateStatus on the service object, or some other approach that doesn't rely on historic task objects.
In the model where we have a configurable monitoring period that applies to each task separately, it's not clear how to make this work seamlessly when a manager fails over. Keeping a failure count in the service object is one approach, but even this is not bulletproof (a manager can fail while attempting to update this counter). Maintaining this extra state also increases complexity. For now, it may be best not to make any guarantees about accurate failure accounting across leaders, and consider adding this as a future enhancement. The worst case scenario is that a leader failover resets the failure count during an update, and inhibits a rollback that otherwise would have happened. This is a corner case that I don't know if many people care about (however, as a counterpoint, I don't really like adding special cases where things behave differently depending whether the leader changes).

aaronlehmann · 2016-08-23T21:10:11Z

I've updated this according to the comments above. I removed the ServiceVersion stuff, because I don't think it will be necessary under the model where we monitor tasks individually for specific time periods, instead of watching for failures of any tasks associated with the current version of the service. I also added MonitoringPeriod and clarified some comments to more closely match the mode of operation discussed above.

@aluzzardi PTAL

aaronlehmann · 2016-08-23T23:37:58Z

Added a commit to implement AllowedFailureFraction and MonitoringPeriod in the updater. Currently this uses only local state so the failure count will reset if there's a leader failover. However, I made it evaluate the failure fraction as totalFailures / len(dirtySlots), so if another leader takes over partway through the update, the correct proportion of task failures will trigger the failure action. I'm wondering whether this is a good idea or not, since it means that if the leader changes before updating the last task of many, and that last task fails to update, it could trigger a rollback.

Currently this change just deals with pausing the update because rollbacks aren't implemented yet. Implementing rollbacks as a next step is relatively straightforward.

aaronlehmann · 2016-08-24T02:40:09Z

Added a commit that implements rollback and adds a unit test for it. PTAL.

aluzzardi · 2016-08-26T03:22:19Z

manager/orchestrator/coverage.out

@@ -0,0 +1,468 @@
+mode: set


This file was probably added by mistake

aluzzardi · 2016-08-26T03:25:16Z

I was wondering about this use case:

Start a rolling update. spec=2 previous_spec=1
In the middle of it, do another update. spec=3 previous_spec=2
3 is broken - we'll rollback to 2, no questions asked
turns out 2 was broken too - we just didn't notice it because we didn't finish the rollout

To mitigate this problem, what if we update previous_spec ONLY if an update is not already in progress? In the previous example, we would have kept spec=3 previous_spec=1

aaronlehmann · 2016-08-26T03:49:10Z

That seems reasonable. I don't know which behavior would be least surprising though.

aluzzardi · 2016-08-31T23:01:58Z

api/types.proto

+	// from its creation, this counts as a failure. If it fails after
+	// MonitoringPeriod, it does not count as a failure. If
+	// MonitoringPeriod is unspecified, a default value will be used.
+	Duration monitoring_period = 4;


nit: We use Duration delay for the other duration and those two work hand in hand, yet they have different a naming "scheme".

What about monitor or watch?

I'm thinking of consistency also for flag names, e.g. service update --update-delay vs --update-watch --update-monitor, --update-monitoring-period (this PR), ...

Changed to monitor.

Feel like this might be a little terse, but I see the point about flag names.

aluzzardi · 2016-09-01T00:14:18Z

api/types.proto


+	// RollbackStartedAt is the time at which the update started rolling
+	// back to the previous version.
+	Timestamp rollback_started_at = 5;


Do we need paused_at for consistency?

Also, we had this conversation a while ago (at the same of adding pause), don't remember the outcome but: does it make sense to add timestamp fields every time we add a new state?

Wouldn't it help to have a list of UpdateStatuses and each time we transition to a new state, we add another item in order to keep history?

The same would apply to task status, node status, and so on.

For instance, if we had this, UX wise instead of displaying events like this:

Started At: 2016-09-01T00:19:32 Rollback At: 2016-09-01T00:19:34 Completed At: 2016-09-01T00:19:36

We could do something like this:

Events: TIME EVENT ERROR 2016-09-01T00:19:32 Update Started 2016-09-01T00:19:34 Rollback Started Unable to find image "foo" 2016-09-01T00:19:36 Rollback Completed

Yeah, kind of wish we had done this.

I think we can't remove the existing fields. We could add a new repeated field to supersede them, but we'd still have to populate the existing timestamp fields when applicable.

So we could do this, but we'd end up with a weird set of fields in the protobuf because we'd have started_at and completed_at for backward compatibility, and also a repeated field that lists generic events. Do you think it's worth it?

@aluzzardi: I removed RollbackStartedAt. PTAL.

aluzzardi · 2016-09-02T00:39:36Z

Overall LGTM.

I have a few points I'd like to make regarding some preferences. I don't have strong arguments for them, except perhaps n2: I do really think we shouldn't add more timestamps.

Threshold: Should this be an absolute number? I don't have a strong opinion, absolute numbers seem more "predictable" to me (from a user's POI) but I can see the appeal for a fractional threshold
Status: I'd avoid adding new timestamps into UpdateStatus. Instead, ideally, a Service should have a history (service created, service scaled, service updated, ...). This allows the user to see what happened when. UpdateStatus is only there to provide real time update information (see Rolling update failure thresholds and rollback #1380 (comment)). I'd go as far as saying that there could be a single timestamp for UpdateStatus.
Auto rollback vs manual rollback. The bulk of the PR design LGTM (thresholds, previous spec, ...). The one thing I'm wondering about is: Should we do automatic rollback at all or should we just provide a CLI based service update --rollback? These changes allow the implementation client side (by having the client inspect UpdateStatus and push previous_spec back). A few reasons for this:

3.a) It's "more" user predictable: A rollback will look to the user as if nothing happened. Either he'll think the update didn't run or maybe she won't notice that we rolled back (tough to notice it's still running a hash rather than another one, and at some point the update was in progress)

3.b) It's more configurable. Going 'back' manually provides the same options as going forward: failure action, parallelism, delay ... Maybe if a deployment goes bad you want to go back in lockstep (parallelism = 0) with no delay.

3.c) If we have manual rollback, we can always add auto rollback later. We can't take it away though

aaronlehmann · 2016-09-02T17:35:42Z

Auto rollback vs manual rollback. The bulk of the PR design LGTM (thresholds, previous spec, ...). The one thing I'm wondering about is: Should we do automatic rollback at all or should we just provide a CLI based service update --rollback? These changes allow the implementation client side (by having the client inspect UpdateStatus and push previous_spec back). A few reasons for this:

3.a) It's "more" user predictable: A rollback will look to the user as if nothing happened. Either he'll think the update didn't run or maybe she won't notice that we rolled back (tough to notice it's still running a hash rather than another one, and at some point the update was in progress)

3.b) It's more configurable. Going 'back' manually provides the same options as going forward: failure action, parallelism, delay ... Maybe if a deployment goes bad you want to go back in lockstep (parallelism = 0) with no delay.

3.c) If we have manual rollback, we can always add auto rollback later. We can't take it away though

I'm almost convinced but I have a few questions/comments about manual rollback.

Implementing rollback on the client side would mean that "previous spec" becomes the spec from the aborted update, so running --rollback multiple times would alternate between the two specs. Is that the behavior we want?
After a successful rollback, the update status would show "update complete". This seems slightly confusing because from the user's perspective the update wasn't completed - it was rolled back.

If we decide to go with client-side rollback, I'm wondering about the possibility of leaving the server-side rollback parts of this PR in place so we can continue to experiment with it, but not exposing the feature through engine-api, so we have the option in the future to remove it. What do you think?

aaronlehmann · 2016-09-02T18:39:52Z

Updated PR to mark automatic rollback as experimental. The integration won't expose it for now. It will provide a way to start a rollback manually, instead (--rollback).

This adds a previous spec to the Serivce object to support rollback. It also adds new failure actions and update states corresponding to rollback situations. Finally, it adds a threshold determining whether to invoke the failure action. Signed-off-by: Aaron Lehmann <[email protected]>

Signed-off-by: Aaron Lehmann <[email protected]>

Add a unit test Signed-off-by: Aaron Lehmann <[email protected]>

aluzzardi

LGTM

Some (same) comments but I don't want to delay this further:

I think # of failures might be better than fraction: It's a little more obvious and easier to "introspect" (by looking at the number of tasks that failed). I also assume that 99% of services will have less than 100 replicas and, given a small number, absolute thresholds might be easier to reason about.
allowed_failure_fraction is not a great name - e.g. --update-allowed-failure-fraction?

GordonTheTurtle added the status/0-triage label Aug 16, 2016

dongluochen reviewed Aug 16, 2016
View reviewed changes

aaronlehmann force-pushed the rollback-protobufs branch from a0f7506 to 026557d Compare August 16, 2016 19:08

aaronlehmann force-pushed the rollback-protobufs branch from 026557d to 159f169 Compare August 17, 2016 17:18

stevvooe reviewed Aug 17, 2016
View reviewed changes

aaronlehmann force-pushed the rollback-protobufs branch 2 times, most recently from fbf5f32 to 2f47546 Compare August 19, 2016 23:13

aaronlehmann force-pushed the rollback-protobufs branch from 2f47546 to dcf8ff7 Compare August 23, 2016 21:08

aaronlehmann changed the title ~~Protobuf changes for automatic rolling update rollback~~ Rolling update failure thresholds and rollback Aug 24, 2016

aluzzardi reviewed Aug 26, 2016
View reviewed changes

manager/orchestrator/coverage.out Outdated

@@ -0,0 +1,468 @@

mode: set

Copy link

Member

aluzzardi Aug 26, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This file was probably added by mistake

aaronlehmann force-pushed the rollback-protobufs branch from fc6f4b9 to ef05269 Compare August 26, 2016 16:21

aaronlehmann mentioned this pull request Aug 30, 2016

return all validation error in spec validation #1433

Closed

aluzzardi reviewed Aug 31, 2016
View reviewed changes

aluzzardi reviewed Sep 1, 2016
View reviewed changes

aaronlehmann force-pushed the rollback-protobufs branch 2 times, most recently from 0869263 to dd91e23 Compare September 1, 2016 16:42

aaronlehmann force-pushed the rollback-protobufs branch from dd91e23 to f180c39 Compare September 2, 2016 18:39

Aaron Lehmann added 3 commits September 6, 2016 11:46

Add support for AllowedFailureFraction and MonitoringPeriod

3e98927

Signed-off-by: Aaron Lehmann <[email protected]>

Implement rollback

4bf8072

Add a unit test Signed-off-by: Aaron Lehmann <[email protected]>

aaronlehmann force-pushed the rollback-protobufs branch from f180c39 to 4bf8072 Compare September 6, 2016 18:48

aaronlehmann mentioned this pull request Sep 8, 2016

Service update failure thresholds and rollback moby/moby#26421

Merged

aluzzardi approved these changes Sep 16, 2016

View reviewed changes

aluzzardi merged commit 1f4a6e9 into moby:master Sep 16, 2016

aaronlehmann deleted the rollback-protobufs branch September 16, 2016 23:55

yongtang mentioned this pull request Nov 4, 2016

Docker Service "UpdateStatus.State" Not Present moby/moby#28012

Open

Comments

Conversation

aaronlehmann commented Aug 16, 2016

Uh oh!

codecov-io commented Aug 16, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Current coverage is 55.77% (diff: 85.57%)

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

stevvooe commented Aug 16, 2016

Uh oh!

dongluochen commented Aug 16, 2016

Uh oh!

aaronlehmann commented Aug 17, 2016

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

aaronlehmann commented Aug 19, 2016

Uh oh!

aaronlehmann commented Aug 23, 2016

Uh oh!

aaronlehmann commented Aug 23, 2016

Uh oh!

aaronlehmann commented Aug 23, 2016

Uh oh!

aaronlehmann commented Aug 24, 2016

Uh oh!

Choose a reason for hiding this comment

Uh oh!

aluzzardi commented Aug 26, 2016

Uh oh!

aaronlehmann commented Aug 26, 2016

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

aluzzardi commented Sep 2, 2016

Uh oh!

aaronlehmann commented Sep 2, 2016

Uh oh!

aaronlehmann commented Sep 2, 2016

Uh oh!

aluzzardi left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

codecov-io commented Aug 16, 2016 •

edited

Loading