Rolling update failure thresholds and rollback#1380
Conversation
Current coverage is 55.77% (diff: 85.57%)@@ master #1380 diff @@
==========================================
Files 82 82
Lines 12885 12971 +86
Methods 0 0
Messages 0 0
Branches 0 0
==========================================
+ Hits 7138 7235 +97
+ Misses 4757 4739 -18
- Partials 990 997 +7
|
| // | ||
| // If the failure action is CONTINUE, there is no effect. | ||
| // If the failure action is PAUSE, no more tasks will be updated until | ||
| // another update is started. |
There was a problem hiding this comment.
Does the next update start with failing task?
There was a problem hiding this comment.
I'm not sure I understand the question. What I mean here is that no more tasks will be updated until the user changes the ServiceSpec.
There was a problem hiding this comment.
When user changes the ServiceSpec again, i.e., another update is started, does updater start with the failing fraction first?
There was a problem hiding this comment.
If the ServiceSpec changes, it's considered a fresh update. Only tasks with the new spec will count towards the number of failures.
There was a problem hiding this comment.
I think we should keep service healthy as much as possible. If previous rollout PAUSED because of failure, the next rollout should resolve those failures first.
There was a problem hiding this comment.
I think we could do this by sorting the instances by health and doing the updates in that order. It shouldn't have an impact on the protobufs, though.
This should be taken care of by the executor. If the task doesn't start in a certain amount of time we report failure. |
a0f7506 to
026557d
Compare
|
LGTM |
026557d to
159f169
Compare
|
Updated to record the service version (from |
manager/orchestrator/replicated.go
Outdated
| ServiceAnnotations: service.Spec.Annotations, | ||
| Spec: service.Spec.Task, | ||
| ServiceID: service.ID, | ||
| ServiceVersion: service.Meta.Version, |
There was a problem hiding this comment.
Does this need to be set from UpdateStatus?
There was a problem hiding this comment.
Yeah, that was the intent but I messed this up. Fixing now.
There was a problem hiding this comment.
Does this need to be set from
UpdateStatus?
Fixed.
fbf5f32 to
2f47546
Compare
|
Rebased. ping @aluzzardi |
|
Discussed rollbacks with @aluzzardi and came up with a few conclusions:
|
2f47546 to
dcf8ff7
Compare
|
I've updated this according to the comments above. I removed the @aluzzardi PTAL |
|
Added a commit to implement Currently this change just deals with pausing the update because rollbacks aren't implemented yet. Implementing rollbacks as a next step is relatively straightforward. |
|
Added a commit that implements rollback and adds a unit test for it. PTAL. |
manager/orchestrator/coverage.out
Outdated
| @@ -0,0 +1,468 @@ | |||
| mode: set | |||
There was a problem hiding this comment.
This file was probably added by mistake
|
I was wondering about this use case:
To mitigate this problem, what if we update |
|
That seems reasonable. I don't know which behavior would be least surprising though. |
fc6f4b9 to
ef05269
Compare
api/types.proto
Outdated
| // from its creation, this counts as a failure. If it fails after | ||
| // MonitoringPeriod, it does not count as a failure. If | ||
| // MonitoringPeriod is unspecified, a default value will be used. | ||
| Duration monitoring_period = 4; |
There was a problem hiding this comment.
nit: We use Duration delay for the other duration and those two work hand in hand, yet they have different a naming "scheme".
What about monitor or watch?
I'm thinking of consistency also for flag names, e.g. service update --update-delay vs --update-watch --update-monitor, --update-monitoring-period (this PR), ...
There was a problem hiding this comment.
Changed to monitor.
Feel like this might be a little terse, but I see the point about flag names.
api/types.proto
Outdated
|
|
||
| // RollbackStartedAt is the time at which the update started rolling | ||
| // back to the previous version. | ||
| Timestamp rollback_started_at = 5; |
There was a problem hiding this comment.
Do we need paused_at for consistency?
There was a problem hiding this comment.
Also, we had this conversation a while ago (at the same of adding pause), don't remember the outcome but: does it make sense to add timestamp fields every time we add a new state?
Wouldn't it help to have a list of UpdateStatuses and each time we transition to a new state, we add another item in order to keep history?
The same would apply to task status, node status, and so on.
For instance, if we had this, UX wise instead of displaying events like this:
Started At: 2016-09-01T00:19:32
Rollback At: 2016-09-01T00:19:34
Completed At: 2016-09-01T00:19:36
We could do something like this:
Events:
TIME EVENT ERROR
2016-09-01T00:19:32 Update Started
2016-09-01T00:19:34 Rollback Started Unable to find image "foo"
2016-09-01T00:19:36 Rollback Completed
There was a problem hiding this comment.
Yeah, kind of wish we had done this.
I think we can't remove the existing fields. We could add a new repeated field to supersede them, but we'd still have to populate the existing timestamp fields when applicable.
So we could do this, but we'd end up with a weird set of fields in the protobuf because we'd have started_at and completed_at for backward compatibility, and also a repeated field that lists generic events. Do you think it's worth it?
There was a problem hiding this comment.
@aluzzardi: I removed RollbackStartedAt. PTAL.
0869263 to
dd91e23
Compare
|
Overall LGTM. I have a few points I'd like to make regarding some preferences. I don't have strong arguments for them, except perhaps n2: I do really think we shouldn't add more timestamps.
3.a) It's "more" user predictable: A rollback will look to the user as if nothing happened. Either he'll think the update didn't run or maybe she won't notice that we rolled back (tough to notice it's still running a hash rather than another one, and at some point the update was in progress) 3.b) It's more configurable. Going 'back' manually provides the same options as going forward: failure action, parallelism, delay ... Maybe if a deployment goes bad you want to go back in lockstep (parallelism = 0) with no delay. 3.c) If we have manual rollback, we can always add auto rollback later. We can't take it away though |
I'm almost convinced but I have a few questions/comments about manual rollback.
If we decide to go with client-side rollback, I'm wondering about the possibility of leaving the server-side rollback parts of this PR in place so we can continue to experiment with it, but not exposing the feature through engine-api, so we have the option in the future to remove it. What do you think? |
dd91e23 to
f180c39
Compare
|
Updated PR to mark automatic rollback as experimental. The integration won't expose it for now. It will provide a way to start a rollback manually, instead ( |
This adds a previous spec to the Serivce object to support rollback. It also adds new failure actions and update states corresponding to rollback situations. Finally, it adds a threshold determining whether to invoke the failure action. Signed-off-by: Aaron Lehmann <[email protected]>
Signed-off-by: Aaron Lehmann <[email protected]>
Add a unit test Signed-off-by: Aaron Lehmann <[email protected]>
f180c39 to
4bf8072
Compare
aluzzardi
left a comment
There was a problem hiding this comment.
LGTM
Some (same) comments but I don't want to delay this further:
- I think # of failures might be better than fraction: It's a little more obvious and easier to "introspect" (by looking at the number of tasks that failed). I also assume that 99% of services will have less than 100 replicas and, given a small number, absolute thresholds might be easier to reason about.
allowed_failure_fractionis not a great name - e.g.--update-allowed-failure-fraction?
This adds a previous spec to the Serivce object to support rollback.
It also adds new failure actions and update states corresponding to
rollback situations. Finally, it adds a threshold determining whether to
invoke the failure action.
I also thought about adding a timeout on task startup, so that if a
task does not reach RUNNING within the given amount of time, it
counts as a failure and is taken account when evaluating if the
update is above or below its failure threshold. However, a timeout
on task startup shouldn't be specific to rolling updates, and is
somewhat difficult to handle correctly across all the possible
situations (new tasks, updated tasks, restarted tasks, leader
failover...). I decided to leave this for the future.
cc @dongluochen @stevvooe @aluzzardi