Skip to content

Comments

Rolling update failure thresholds and rollback#1380

Merged
aluzzardi merged 3 commits intomoby:masterfrom
aaronlehmann:rollback-protobufs
Sep 16, 2016
Merged

Rolling update failure thresholds and rollback#1380
aluzzardi merged 3 commits intomoby:masterfrom
aaronlehmann:rollback-protobufs

Conversation

@aaronlehmann
Copy link
Collaborator

This adds a previous spec to the Serivce object to support rollback.
It also adds new failure actions and update states corresponding to
rollback situations. Finally, it adds a threshold determining whether to
invoke the failure action.

I also thought about adding a timeout on task startup, so that if a
task does not reach RUNNING within the given amount of time, it
counts as a failure and is taken account when evaluating if the
update is above or below its failure threshold. However, a timeout
on task startup shouldn't be specific to rolling updates, and is
somewhat difficult to handle correctly across all the possible
situations (new tasks, updated tasks, restarted tasks, leader
failover...). I decided to leave this for the future.

cc @dongluochen @stevvooe @aluzzardi

@codecov-io
Copy link

codecov-io commented Aug 16, 2016

Current coverage is 55.77% (diff: 85.57%)

Merging #1380 into master will increase coverage by 0.38%

@@             master      #1380   diff @@
==========================================
  Files            82         82          
  Lines         12885      12971    +86   
  Methods           0          0          
  Messages          0          0          
  Branches          0          0          
==========================================
+ Hits           7138       7235    +97   
+ Misses         4757       4739    -18   
- Partials        990        997     +7   

Sunburst

Powered by Codecov. Last update 27fbaef...4bf8072

//
// If the failure action is CONTINUE, there is no effect.
// If the failure action is PAUSE, no more tasks will be updated until
// another update is started.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does the next update start with failing task?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure I understand the question. What I mean here is that no more tasks will be updated until the user changes the ServiceSpec.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When user changes the ServiceSpec again, i.e., another update is started, does updater start with the failing fraction first?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the ServiceSpec changes, it's considered a fresh update. Only tasks with the new spec will count towards the number of failures.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should keep service healthy as much as possible. If previous rollout PAUSED because of failure, the next rollout should resolve those failures first.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we could do this by sorting the instances by health and doing the updates in that order. It shouldn't have an impact on the protobufs, though.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree.

@stevvooe
Copy link
Contributor

I also thought about adding a timeout on task startup, so that if a
task does not reach RUNNING within the given amount of time, it
counts as a failure and is taken account when evaluating if the
update is above or below its failure threshold.

This should be taken care of by the executor. If the task doesn't start in a certain amount of time we report failure.

@dongluochen
Copy link
Contributor

LGTM

@aaronlehmann
Copy link
Collaborator Author

Updated to record the service version (from Meta) in the UpdateStatus, and propagate this version information to tasks that are created by the update. This will allow us to match up tasks with the update that created them.

ServiceAnnotations: service.Spec.Annotations,
Spec: service.Spec.Task,
ServiceID: service.ID,
ServiceVersion: service.Meta.Version,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this need to be set from UpdateStatus?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, that was the intent but I messed this up. Fixing now.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this need to be set from UpdateStatus?

Fixed.

@aaronlehmann aaronlehmann force-pushed the rollback-protobufs branch 2 times, most recently from fbf5f32 to 2f47546 Compare August 19, 2016 23:13
@aaronlehmann
Copy link
Collaborator Author

Rebased. ping @aluzzardi

@aaronlehmann
Copy link
Collaborator Author

Discussed rollbacks with @aluzzardi and came up with a few conclusions:

  • Counting any failures that happen during the update period is too non-deterministic. We're thinking of supporting a configurable monitoring period that applies on a per-task basis. In other words, if this is set to 30 seconds, a task failing during the first 30 seconds after it's updated will count as a failure, and after that it will not count as a failure. This is similar to only watching for failures during the "update delay" that optionally happens after a task is updated, but it lets us decouple the pause between updates with the monitoring period, and put in place a more useful default for the monitoring period (as opposed to the 0s default for pausing between updates).
  • If a task fails during the update, and then continues to fail as it's restarted, that should only count as a single failure.
  • Looking at failed tasks in the data store isn't a good way of counting failures. This is too sensitive to task reaper settings, among other things. We should either count failures with local state, a counter in UpdateStatus on the service object, or some other approach that doesn't rely on historic task objects.
  • In the model where we have a configurable monitoring period that applies to each task separately, it's not clear how to make this work seamlessly when a manager fails over. Keeping a failure count in the service object is one approach, but even this is not bulletproof (a manager can fail while attempting to update this counter). Maintaining this extra state also increases complexity. For now, it may be best not to make any guarantees about accurate failure accounting across leaders, and consider adding this as a future enhancement. The worst case scenario is that a leader failover resets the failure count during an update, and inhibits a rollback that otherwise would have happened. This is a corner case that I don't know if many people care about (however, as a counterpoint, I don't really like adding special cases where things behave differently depending whether the leader changes).

@aaronlehmann
Copy link
Collaborator Author

I've updated this according to the comments above. I removed the ServiceVersion stuff, because I don't think it will be necessary under the model where we monitor tasks individually for specific time periods, instead of watching for failures of any tasks associated with the current version of the service. I also added MonitoringPeriod and clarified some comments to more closely match the mode of operation discussed above.

@aluzzardi PTAL

@aaronlehmann
Copy link
Collaborator Author

Added a commit to implement AllowedFailureFraction and MonitoringPeriod in the updater. Currently this uses only local state so the failure count will reset if there's a leader failover. However, I made it evaluate the failure fraction as totalFailures / len(dirtySlots), so if another leader takes over partway through the update, the correct proportion of task failures will trigger the failure action. I'm wondering whether this is a good idea or not, since it means that if the leader changes before updating the last task of many, and that last task fails to update, it could trigger a rollback.

Currently this change just deals with pausing the update because rollbacks aren't implemented yet. Implementing rollbacks as a next step is relatively straightforward.

@aaronlehmann
Copy link
Collaborator Author

Added a commit that implements rollback and adds a unit test for it. PTAL.

@aaronlehmann aaronlehmann changed the title Protobuf changes for automatic rolling update rollback Rolling update failure thresholds and rollback Aug 24, 2016
@@ -0,0 +1,468 @@
mode: set
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This file was probably added by mistake

@aluzzardi
Copy link
Member

I was wondering about this use case:

  • Start a rolling update. spec=2 previous_spec=1
  • In the middle of it, do another update. spec=3 previous_spec=2
  • 3 is broken - we'll rollback to 2, no questions asked
  • turns out 2 was broken too - we just didn't notice it because we didn't finish the rollout

To mitigate this problem, what if we update previous_spec ONLY if an update is not already in progress? In the previous example, we would have kept spec=3 previous_spec=1

@aaronlehmann
Copy link
Collaborator Author

That seems reasonable. I don't know which behavior would be least surprising though.

api/types.proto Outdated
// from its creation, this counts as a failure. If it fails after
// MonitoringPeriod, it does not count as a failure. If
// MonitoringPeriod is unspecified, a default value will be used.
Duration monitoring_period = 4;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: We use Duration delay for the other duration and those two work hand in hand, yet they have different a naming "scheme".

What about monitor or watch?

I'm thinking of consistency also for flag names, e.g. service update --update-delay vs --update-watch --update-monitor, --update-monitoring-period (this PR), ...

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed to monitor.

Feel like this might be a little terse, but I see the point about flag names.

api/types.proto Outdated

// RollbackStartedAt is the time at which the update started rolling
// back to the previous version.
Timestamp rollback_started_at = 5;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need paused_at for consistency?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, we had this conversation a while ago (at the same of adding pause), don't remember the outcome but: does it make sense to add timestamp fields every time we add a new state?

Wouldn't it help to have a list of UpdateStatuses and each time we transition to a new state, we add another item in order to keep history?

The same would apply to task status, node status, and so on.

For instance, if we had this, UX wise instead of displaying events like this:

Started At: 2016-09-01T00:19:32
Rollback At: 2016-09-01T00:19:34
Completed At: 2016-09-01T00:19:36

We could do something like this:

Events:
TIME                       EVENT               ERROR
2016-09-01T00:19:32        Update Started      
2016-09-01T00:19:34        Rollback Started    Unable to find image "foo"
2016-09-01T00:19:36        Rollback Completed      

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, kind of wish we had done this.

I think we can't remove the existing fields. We could add a new repeated field to supersede them, but we'd still have to populate the existing timestamp fields when applicable.

So we could do this, but we'd end up with a weird set of fields in the protobuf because we'd have started_at and completed_at for backward compatibility, and also a repeated field that lists generic events. Do you think it's worth it?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@aluzzardi: I removed RollbackStartedAt. PTAL.

@aaronlehmann aaronlehmann force-pushed the rollback-protobufs branch 2 times, most recently from 0869263 to dd91e23 Compare September 1, 2016 16:42
@aluzzardi
Copy link
Member

Overall LGTM.

I have a few points I'd like to make regarding some preferences. I don't have strong arguments for them, except perhaps n2: I do really think we shouldn't add more timestamps.

  1. Threshold: Should this be an absolute number? I don't have a strong opinion, absolute numbers seem more "predictable" to me (from a user's POI) but I can see the appeal for a fractional threshold

  2. Status: I'd avoid adding new timestamps into UpdateStatus. Instead, ideally, a Service should have a history (service created, service scaled, service updated, ...). This allows the user to see what happened when. UpdateStatus is only there to provide real time update information (see Rolling update failure thresholds and rollback #1380 (comment)). I'd go as far as saying that there could be a single timestamp for UpdateStatus.

  3. Auto rollback vs manual rollback. The bulk of the PR design LGTM (thresholds, previous spec, ...). The one thing I'm wondering about is: Should we do automatic rollback at all or should we just provide a CLI based service update --rollback? These changes allow the implementation client side (by having the client inspect UpdateStatus and push previous_spec back). A few reasons for this:

3.a) It's "more" user predictable: A rollback will look to the user as if nothing happened. Either he'll think the update didn't run or maybe she won't notice that we rolled back (tough to notice it's still running a hash rather than another one, and at some point the update was in progress)

3.b) It's more configurable. Going 'back' manually provides the same options as going forward: failure action, parallelism, delay ... Maybe if a deployment goes bad you want to go back in lockstep (parallelism = 0) with no delay.

3.c) If we have manual rollback, we can always add auto rollback later. We can't take it away though

@aaronlehmann
Copy link
Collaborator Author

  1. Auto rollback vs manual rollback. The bulk of the PR design LGTM (thresholds, previous spec, ...). The one thing I'm wondering about is: Should we do automatic rollback at all or should we just provide a CLI based service update --rollback? These changes allow the implementation client side (by having the client inspect UpdateStatus and push previous_spec back). A few reasons for this:

3.a) It's "more" user predictable: A rollback will look to the user as if nothing happened. Either he'll think the update didn't run or maybe she won't notice that we rolled back (tough to notice it's still running a hash rather than another one, and at some point the update was in progress)

3.b) It's more configurable. Going 'back' manually provides the same options as going forward: failure action, parallelism, delay ... Maybe if a deployment goes bad you want to go back in lockstep (parallelism = 0) with no delay.

3.c) If we have manual rollback, we can always add auto rollback later. We can't take it away though

I'm almost convinced but I have a few questions/comments about manual rollback.

  1. Implementing rollback on the client side would mean that "previous spec" becomes the spec from the aborted update, so running --rollback multiple times would alternate between the two specs. Is that the behavior we want?

  2. After a successful rollback, the update status would show "update complete". This seems slightly confusing because from the user's perspective the update wasn't completed - it was rolled back.

If we decide to go with client-side rollback, I'm wondering about the possibility of leaving the server-side rollback parts of this PR in place so we can continue to experiment with it, but not exposing the feature through engine-api, so we have the option in the future to remove it. What do you think?

@aaronlehmann
Copy link
Collaborator Author

Updated PR to mark automatic rollback as experimental. The integration won't expose it for now. It will provide a way to start a rollback manually, instead (--rollback).

Aaron Lehmann added 3 commits September 6, 2016 11:46
This adds a previous spec to the Serivce object to support rollback.
It also adds new failure actions and update states corresponding to
rollback situations. Finally, it adds a threshold determining whether to
invoke the failure action.

Signed-off-by: Aaron Lehmann <[email protected]>
Add a unit test

Signed-off-by: Aaron Lehmann <[email protected]>
Copy link
Member

@aluzzardi aluzzardi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Some (same) comments but I don't want to delay this further:

  • I think # of failures might be better than fraction: It's a little more obvious and easier to "introspect" (by looking at the number of tasks that failed). I also assume that 99% of services will have less than 100 replicas and, given a small number, absolute thresholds might be easier to reason about.
  • allowed_failure_fraction is not a great name - e.g. --update-allowed-failure-fraction?

@aluzzardi aluzzardi merged commit 1f4a6e9 into moby:master Sep 16, 2016
@aaronlehmann aaronlehmann deleted the rollback-protobufs branch September 16, 2016 23:55
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants