[doc] update description about the task slot by AkihiroSuda · Pull Request #1741 · moby/swarmkit

AkihiroSuda · 2016-11-10T10:12:10Z

This PR updates design/task_model.md for clarifying that multiple containers can share the single slot number because of a network partition between nodes.

Note that people may abuse the {{.Task.Slot}} template string without knowledge about this behavior. (#1650, moby/moby#28025)

The PR for the "introspection mount" is related to this behavior as well. (moby/moby#26331, #1642)

Below is the repro for the behavior I mentioned in this PR.
IMO we don't need to include this repro in design/task_model.md it self.

Repro

Consider that we have 1 manager node (dm01) and 2 worker nodes (dm02 and dm03).

dm01$ docker node ls
ID                           HOSTNAME  STATUS  AVAILABILITY  MANAGER STATUS
kohz31yrlbkk9xplcmvyjvhew *  dm01      Ready   Active        Leader
midxgj1vyg6bwmmsa9l27dfkx    dm03      Ready   Active        
zgq4jimlom374s98nspy6flb3    dm02      Ready   Active

Step 1.

Start a service with 3 replicas. We use the {{.Task.Slot}} template string for injecting the slot numbers into containers.

dm01$ docker service create --hostname 'slot-{{.Task.Slot}}' --replicas 3 --name foo busybox top
wt3vzqd5szjytzpwrfr6gf786

Consider that now the container for the slot 1 is running on dm01, 2 is on dm03, and 3 is on dm02.

dm01$ docker service ps foo
NAME                IMAGE    NODE  DESIRED STATE  CURRENT STATE                   ERROR
foo.1.4tuobs6y6bf1  busybox  dm01  Running        Running 1 second ago            
foo.2.52fda7b7ohpv  busybox  dm03  Running        Running less than a second ago  
foo.3.dbikhhbeo6mp  busybox  dm02  Running        Running 1 second ago

Step 2.

Split dm02 from dm01, but keep dm02 itself still running.

dm01$ sudo iptables -I INPUT -s $(docker node inspect -f '{{.Status.Addr}}' dm02) -j DROP

Then, a new container for the slot 3 will be automatically created and started on another node. e.g. dm01.

dm01$ docker service ps foo
NAME                IMAGE    NODE  DESIRED STATE  CURRENT STATE               ERROR
foo.1.4tuobs6y6bf1  busybox  dm01  Running        Running 5 minutes ago       
foo.2.52fda7b7ohpv  busybox  dm03  Running        Running 5 minutes ago       
foo.3.l7i6r4a3t0vf  busybox  dm01  Running        Running about a minute ago
dm01$ docker ps
CONTAINER ID        IMAGE               COMMAND             CREATED             STATUS              PORTS               NAMES
ca1ade6ab083        busybox:latest      "top"               10 minutes ago      Up 10 minutes                           foo.3.l7i6r4a3t0vfxqd79xdfe84h0
8554d36a2832        busybox:latest      "top"               13 minutes ago      Up 13 minutes                           foo.1.4tuobs6y6bf1j7ojnjmjzt62d
dm01$ docker exec foo.3.l7i6r4a3t0vfxqd79xdfe84h0 hostname
slot-3

However, the old container for the slot 3 is still running on dm02, which is split from dm01.

dm02$ docker ps
CONTAINER ID        IMAGE               COMMAND             CREATED             STATUS              PORTS               NAMES
1347974c00d4        busybox:latest      "top"               4 minutes ago       Up 4 minutes                            foo.3.dbikhhbeo6mplqs2vb3v2v78w
dm02$ docker exec foo.3.dbikhhbeo6mplqs2vb3v2v78w hostname
slot-3

Now we have two containers running simultaneously with the common slot number. 🐧

Signed-off-by: Akihiro Suda [email protected]

AkihiroSuda · 2016-11-10T10:13:04Z

cc @aaronlehmann @stevvooe

codecov-io · 2016-11-10T10:20:59Z

Current coverage is 55.08% (diff: 100%)

Merging #1741 into master will increase coverage by 0.17%

@@             master      #1741   diff @@
==========================================
  Files           102        102          
  Lines         16930      16930          
  Methods           0          0          
  Messages          0          0          
  Branches          0          0          
==========================================
+ Hits           9297       9326    +29   
+ Misses         6467       6441    -26   
+ Partials       1166       1163     -3

Powered by Codecov. Last update ff342cb...1a7f5b5

AkihiroSuda · 2016-11-10T10:38:22Z

Can we plan a new feature that guarantees no two containers can simultaneously have the same slot number?

aaronlehmann · 2016-11-11T00:16:47Z

Change LGTM

Can we plan a new feature that guarantees no two containers can simultaneously have the same slot number?

That isn't really how our system works. When we update a task, we simultaneously set the old task's DesiredState to Shutdown and create a new task with the same slot number. If the node where the old task is running is slow or unresponsive, the old task could stay in a running state for awhile. But there will only be one task per slot number with DesiredState <= Running.

stevvooe · 2016-11-11T04:10:00Z

design/task_model.md

 The updater takes care of making sure that each slot converges to having a
 single running task.

+Also, multiple containers can share the single slot number because of a network


s/containers/tasks/

stevvooe · 2016-11-11T04:11:09Z

design/task_model.md

+partition between nodes. If a node is split from manager nodes, the tasks that
+were running on the node will be recreated on another node.  However, the task
+containers on the split node can still continue running. So the old task
+containers and the new ones can share identical slot numbers.


These tasks may be considered "orphaned" by the manager, after some time. Upon recovering the split, these tasks will be killed.

stevvooe · 2016-11-11T04:11:59Z

Can we plan a new feature that guarantees no two containers can simultaneously have the same slot number?

Also, this behavior is already provided by task ids.

AkihiroSuda · 2016-11-11T05:50:26Z

OK.

IIUC one of the motivations for having template/introspection feature was to support applications that depends on "myid". e.g. ZooKeeper (docker service create -e MYID={{.Task.Slot}} --replicas 3 zk moby/moby#24110)

However considering that multiple tasks can share identical slot number, this MYID={{.Task.Slot}} approach is likely to cause a bunch of problems.

Alternative idea is to just create separate services: docker service create --name zk1 -e MYID=1 zk; docker service create --name zk2 -e MYID=2 zk; docker service create --name zk3 -e MYID=3 zk.
But the issue is still unresolved: there can be multiple containers running simultaneously as certain single service. (i.e. multiple containers with slot=1)

Can I hear your opinion about how/whether we can set up such applications on Swarm?

I know k8s supports such applications as "petset", but haven't looked into its design yet.

p.s. Maybe this discussion should be done in another issue 😅

AkihiroSuda · 2016-11-11T06:17:04Z

A naive solution I can come up is to make node kill its own tasks when it is split from managers.
When the manager detects the split, it defers task creation until some "grace period" passes.
If the split worker couldn't kill the task in that grace period, there can be multiple tasks with the identical slot 😢
And worse, we can't guarantee that the task can be killed in grace period 😢

However, we can tell living nodes to "drop all the packets from/to the tasks on the split node" during the grace period 😄

stevvooe · 2016-11-14T21:55:06Z

@AkihiroSuda Ideally, we want the application to continue working under such failures. Unfortunately, for a single worker node, determining whether there is a split or a temporary lapse in availability is impossible. If we kill the tasks when a worker loses connectivity, we risk bringing down the applications if there is a blip in orchestrator.

Slots are likely enough to provide quorum bootstrap but the application must double check this work after establishing its own quorum.

AkihiroSuda · 2016-11-21T06:31:06Z

@stevvooe

Thank you for the response, and three questions

1.

Please let me know if this PR itself is LGTY? 😃

2.

If we kill the tasks when a worker loses connectivity, we risk bringing down the applications if there is a blip in orchestrator.

If we let workers kill the tasks after some duration (say, 3 minutes or 5 minutes), we can avoid bringing down apps when there is a blip in orchestrator.
However, the manager cannot guarantee that the tasks are actually killed. Also, the manager cannot recreate the tasks until the duration passes.

WDYT?
I haven't looked into deeper, but in my experiment, Kubernetes 1.4 seems doing so.
If it SGTY, I'd like to try to implement the approach (after the release of 1.13)

3.

Slots are likely enough to provide quorum bootstrap but the application must double check this work after establishing its own quorum.

Can you show a concrete example for "double check"?

stevvooe · 2016-11-21T22:57:41Z

I think this is okay. I don't have a suggestion for updates here, but it might be better to make it clear that these are implications of the slot design and the minimum guarantees provided by them, rather than making it look like this was a design decision that can be changed.

If we let workers kill the tasks after some duration (say, 3 minutes or 5 minutes), we can avoid bringing down apps when there is a blip in orchestrator.
However, the manager cannot guarantee that the tasks are actually killed. Also, the manager cannot recreate the tasks until the duration passes.

WDYT?
I haven't looked into deeper, but in my experiment, Kubernetes 1.4 seems doing so.
If it SGTY, I'd like to try to implement the approach (after the release of 1.13)

This seems like we'd be introduce a ticking time bomb for operators to recover in the case of a quorum loss or other orchestrator downtime. I don't understand in detail what Kubernetes is doing, but I would be largely disappointed if my application reliability was gated on my orchestrators' liability.

The fact is, there are no guarantees that the rogue tasks are killed, either way, even with tasks that may be marked as such.

@aaronlehmann @aluzzardi PTAL

Can you show a concrete example for "double check"?

They need to implement their own quorum algorithm to provide the required guarantees. The set of nodes in the quorum need to be consistent for the bootstrap period, which slots provide. If you expect this "slot quorum" to be provided throughout the lifetime of the service, there is going to be disappointment.

AkihiroSuda · 2016-11-29T08:33:35Z

OK, added words "in the current implementation" in the last commit:

diff --git a/design/task_model.md b/design/task_model.md
index f2c1967..5509f65 100644
--- a/design/task_model.md
+++ b/design/task_model.md
@@ -169,13 +169,13 @@ satisfied by at least one running task, not the detailed makeup of those slots.
 The updater takes care of making sure that each slot converges to having a
 single running task.
 
-Also, multiple tasks can share the single slot number because of a network
-partition between nodes. If a node is split from manager nodes, the tasks that
-were running on the node will be recreated on another node.  However, the tasks
-on the split node can still continue running. So the old tasks and the new ones
-can share identical slot numbers. These tasks may be considered "orphaned" by
-the manager, after some time. Upon recovering the split, these tasks will be
-killed.
+Also, in the current implementation, multiple tasks can share the single slot
+number because of a network partition between nodes. If a node is split from
+manager nodes, the tasks that were running on the node will be recreated on
+another node.  However, the tasks on the split node can still continue
+running. So the old tasks and the new ones can share identical slot
+numbers. These tasks may be considered "orphaned" by the manager, after some
+time. Upon recovering the split, these tasks will be killed.
 
 Global tasks do not have slot numbers, but the concept is similar. Each node in
 the system should have a single running task associated with it. If this is not

Please let me know this is OK

stevvooe · 2016-11-29T20:44:21Z

@AkihiroSuda After reading back through this change and my commentary, I think I missed something key here. I am not sure if adding "in the current implementation" gets at the issue. Having this paragraph implies that this is a solvable limitation.

We have traded off slot-consistency for application availability, in the CAP sense. This is because we could have much better slot-consistency if we killed tasks that lose manager contact, at the cost of application availability. Basically, adding something to the effect of "we have sacrificed slot consistency for application availability" may be enough to get this into shape.

Sorry for re-hashing this so much.

This PR updates `design/task_model.md` for clarifying that multiple containers can share the single slot number because of a network partition between nodes. Signed-off-by: Akihiro Suda <[email protected]>

AkihiroSuda · 2016-12-05T05:44:32Z

OK, updated PR

diff --git a/design/task_model.md b/design/task_model.md
index 5509f65..106e77e 100644
--- a/design/task_model.md
+++ b/design/task_model.md
@@ -169,8 +169,8 @@ satisfied by at least one running task, not the detailed makeup of those slots.
 The updater takes care of making sure that each slot converges to having a
 single running task.
 
-Also, in the current implementation, multiple tasks can share the single slot
-number because of a network partition between nodes. If a node is split from
+Also, for application availability, multiple tasks can share the single slot
+number when a network partition occurs between nodes. If a node is split from
 manager nodes, the tasks that were running on the node will be recreated on
 another node.  However, the tasks on the split node can still continue
 running. So the old tasks and the new ones can share identical slot

AkihiroSuda · 2016-12-05T05:45:30Z

linking similar issue #1743 to this PR

stevvooe · 2016-12-08T04:01:30Z

LGTM

AkihiroSuda · 2016-12-26T02:50:18Z

@aaronlehmann PTAL?

aaronlehmann · 2016-12-28T10:38:13Z

LGTM, sorry for missing this one.

AkihiroSuda mentioned this pull request Nov 10, 2016

[EXPERIMENTAL] new mount type: introspection moby/moby#26331

Closed

7 tasks

stevvooe reviewed Nov 11, 2016

View reviewed changes

AkihiroSuda force-pushed the update-design-doc branch from 39aa2ba to 8de4bb3 Compare November 11, 2016 04:28

AkihiroSuda mentioned this pull request Nov 21, 2016

Add documentation about templating support on… moby/moby#28486

Merged

AkihiroSuda force-pushed the update-design-doc branch from 8de4bb3 to 4b8ec62 Compare November 29, 2016 08:32

[doc] update description about the task slot

1a7f5b5

This PR updates `design/task_model.md` for clarifying that multiple containers can share the single slot number because of a network partition between nodes. Signed-off-by: Akihiro Suda <[email protected]>

AkihiroSuda force-pushed the update-design-doc branch from 4b8ec62 to 1a7f5b5 Compare December 5, 2016 05:44

AkihiroSuda mentioned this pull request Dec 5, 2016

Shutdown containers on isolated nodes #1743

Open

aaronlehmann merged commit f355ca1 into moby:master Dec 28, 2016

AkihiroSuda mentioned this pull request Mar 17, 2017

Proposal: Container Storage Interface (CSI) proposed standard moby/moby#31923

Closed

Comments

Conversation

AkihiroSuda commented Nov 10, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Repro

Step 1.

Step 2.

Uh oh!

AkihiroSuda commented Nov 10, 2016

Uh oh!

codecov-io commented Nov 10, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Current coverage is 55.08% (diff: 100%)

Uh oh!

AkihiroSuda commented Nov 10, 2016

Uh oh!

aaronlehmann commented Nov 11, 2016

Uh oh!

stevvooe Nov 11, 2016

Choose a reason for hiding this comment

Uh oh!

AkihiroSuda Nov 11, 2016

Choose a reason for hiding this comment

Uh oh!

stevvooe Nov 11, 2016

Choose a reason for hiding this comment

Uh oh!

AkihiroSuda Nov 11, 2016

Choose a reason for hiding this comment

Uh oh!

stevvooe commented Nov 11, 2016

Uh oh!

AkihiroSuda commented Nov 11, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

AkihiroSuda commented Nov 11, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

stevvooe commented Nov 14, 2016

Uh oh!

AkihiroSuda commented Nov 21, 2016

1.

2.

3.

Uh oh!

stevvooe commented Nov 21, 2016

Uh oh!

AkihiroSuda commented Nov 29, 2016

Uh oh!

stevvooe commented Nov 29, 2016

Uh oh!

AkihiroSuda commented Dec 5, 2016

Uh oh!

AkihiroSuda commented Dec 5, 2016

Uh oh!

stevvooe commented Dec 8, 2016

Uh oh!

AkihiroSuda commented Dec 26, 2016

Uh oh!

aaronlehmann commented Dec 28, 2016

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

AkihiroSuda commented Nov 10, 2016 •

edited

Loading

codecov-io commented Nov 10, 2016 •

edited

Loading

AkihiroSuda commented Nov 11, 2016 •

edited

Loading

AkihiroSuda commented Nov 11, 2016 •

edited

Loading