Skip to content

Comments

[doc] update description about the task slot#1741

Merged
aaronlehmann merged 1 commit intomoby:masterfrom
AkihiroSuda:update-design-doc
Dec 28, 2016
Merged

[doc] update description about the task slot#1741
aaronlehmann merged 1 commit intomoby:masterfrom
AkihiroSuda:update-design-doc

Conversation

@AkihiroSuda
Copy link
Member

@AkihiroSuda AkihiroSuda commented Nov 10, 2016

This PR updates design/task_model.md for clarifying that multiple containers can share the single slot number because of a network partition between nodes.

Note that people may abuse the {{.Task.Slot}} template string without knowledge about this behavior. (#1650, moby/moby#28025)

The PR for the "introspection mount" is related to this behavior as well. (moby/moby#26331, #1642)


Below is the repro for the behavior I mentioned in this PR.
IMO we don't need to include this repro in design/task_model.md it self.

Repro

Consider that we have 1 manager node (dm01) and 2 worker nodes (dm02 and dm03).

dm01$ docker node ls
ID                           HOSTNAME  STATUS  AVAILABILITY  MANAGER STATUS
kohz31yrlbkk9xplcmvyjvhew *  dm01      Ready   Active        Leader
midxgj1vyg6bwmmsa9l27dfkx    dm03      Ready   Active        
zgq4jimlom374s98nspy6flb3    dm02      Ready   Active

Step 1.

Start a service with 3 replicas. We use the {{.Task.Slot}} template string for injecting the slot numbers into containers.

dm01$ docker service create --hostname 'slot-{{.Task.Slot}}' --replicas 3 --name foo busybox top
wt3vzqd5szjytzpwrfr6gf786

Consider that now the container for the slot 1 is running on dm01, 2 is on dm03, and 3 is on dm02.

dm01$ docker service ps foo
NAME                IMAGE    NODE  DESIRED STATE  CURRENT STATE                   ERROR
foo.1.4tuobs6y6bf1  busybox  dm01  Running        Running 1 second ago            
foo.2.52fda7b7ohpv  busybox  dm03  Running        Running less than a second ago  
foo.3.dbikhhbeo6mp  busybox  dm02  Running        Running 1 second ago

Step 2.

Split dm02 from dm01, but keep dm02 itself still running.

dm01$ sudo iptables -I INPUT -s $(docker node inspect -f '{{.Status.Addr}}' dm02) -j DROP

Then, a new container for the slot 3 will be automatically created and started on another node. e.g. dm01.

dm01$ docker service ps foo
NAME                IMAGE    NODE  DESIRED STATE  CURRENT STATE               ERROR
foo.1.4tuobs6y6bf1  busybox  dm01  Running        Running 5 minutes ago       
foo.2.52fda7b7ohpv  busybox  dm03  Running        Running 5 minutes ago       
foo.3.l7i6r4a3t0vf  busybox  dm01  Running        Running about a minute ago
dm01$ docker ps
CONTAINER ID        IMAGE               COMMAND             CREATED             STATUS              PORTS               NAMES
ca1ade6ab083        busybox:latest      "top"               10 minutes ago      Up 10 minutes                           foo.3.l7i6r4a3t0vfxqd79xdfe84h0
8554d36a2832        busybox:latest      "top"               13 minutes ago      Up 13 minutes                           foo.1.4tuobs6y6bf1j7ojnjmjzt62d
dm01$ docker exec foo.3.l7i6r4a3t0vfxqd79xdfe84h0 hostname
slot-3

However, the old container for the slot 3 is still running on dm02, which is split from dm01.

dm02$ docker ps
CONTAINER ID        IMAGE               COMMAND             CREATED             STATUS              PORTS               NAMES
1347974c00d4        busybox:latest      "top"               4 minutes ago       Up 4 minutes                            foo.3.dbikhhbeo6mplqs2vb3v2v78w
dm02$ docker exec foo.3.dbikhhbeo6mplqs2vb3v2v78w hostname
slot-3

Now we have two containers running simultaneously with the common slot number. 🐧


Signed-off-by: Akihiro Suda [email protected]

@AkihiroSuda
Copy link
Member Author

cc @aaronlehmann @stevvooe

@codecov-io
Copy link

codecov-io commented Nov 10, 2016

Current coverage is 55.08% (diff: 100%)

Merging #1741 into master will increase coverage by 0.17%

@@             master      #1741   diff @@
==========================================
  Files           102        102          
  Lines         16930      16930          
  Methods           0          0          
  Messages          0          0          
  Branches          0          0          
==========================================
+ Hits           9297       9326    +29   
+ Misses         6467       6441    -26   
+ Partials       1166       1163     -3   

Sunburst

Powered by Codecov. Last update ff342cb...1a7f5b5

@AkihiroSuda
Copy link
Member Author

Can we plan a new feature that guarantees no two containers can simultaneously have the same slot number?

@aaronlehmann
Copy link
Collaborator

Change LGTM

Can we plan a new feature that guarantees no two containers can simultaneously have the same slot number?

That isn't really how our system works. When we update a task, we simultaneously set the old task's DesiredState to Shutdown and create a new task with the same slot number. If the node where the old task is running is slow or unresponsive, the old task could stay in a running state for awhile. But there will only be one task per slot number with DesiredState <= Running.

The updater takes care of making sure that each slot converges to having a
single running task.

Also, multiple containers can share the single slot number because of a network
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

s/containers/tasks/

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

partition between nodes. If a node is split from manager nodes, the tasks that
were running on the node will be recreated on another node. However, the task
containers on the split node can still continue running. So the old task
containers and the new ones can share identical slot numbers.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These tasks may be considered "orphaned" by the manager, after some time. Upon recovering the split, these tasks will be killed.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

@stevvooe
Copy link
Contributor

Can we plan a new feature that guarantees no two containers can simultaneously have the same slot number?

Also, this behavior is already provided by task ids.

@AkihiroSuda
Copy link
Member Author

AkihiroSuda commented Nov 11, 2016

OK.

IIUC one of the motivations for having template/introspection feature was to support applications that depends on "myid". e.g. ZooKeeper (docker service create -e MYID={{.Task.Slot}} --replicas 3 zk moby/moby#24110)

However considering that multiple tasks can share identical slot number, this MYID={{.Task.Slot}} approach is likely to cause a bunch of problems.

Alternative idea is to just create separate services: docker service create --name zk1 -e MYID=1 zk; docker service create --name zk2 -e MYID=2 zk; docker service create --name zk3 -e MYID=3 zk.
But the issue is still unresolved: there can be multiple containers running simultaneously as certain single service. (i.e. multiple containers with slot=1)

Can I hear your opinion about how/whether we can set up such applications on Swarm?

I know k8s supports such applications as "petset", but haven't looked into its design yet.

p.s. Maybe this discussion should be done in another issue 😅

@AkihiroSuda
Copy link
Member Author

AkihiroSuda commented Nov 11, 2016

A naive solution I can come up is to make node kill its own tasks when it is split from managers.
When the manager detects the split, it defers task creation until some "grace period" passes.
If the split worker couldn't kill the task in that grace period, there can be multiple tasks with the identical slot 😢
And worse, we can't guarantee that the task can be killed in grace period 😢

However, we can tell living nodes to "drop all the packets from/to the tasks on the split node" during the grace period 😄

@stevvooe
Copy link
Contributor

@AkihiroSuda Ideally, we want the application to continue working under such failures. Unfortunately, for a single worker node, determining whether there is a split or a temporary lapse in availability is impossible. If we kill the tasks when a worker loses connectivity, we risk bringing down the applications if there is a blip in orchestrator.

Slots are likely enough to provide quorum bootstrap but the application must double check this work after establishing its own quorum.

@AkihiroSuda
Copy link
Member Author

@stevvooe

Thank you for the response, and three questions

1.

Please let me know if this PR itself is LGTY? 😃

2.

If we kill the tasks when a worker loses connectivity, we risk bringing down the applications if there is a blip in orchestrator.

If we let workers kill the tasks after some duration (say, 3 minutes or 5 minutes), we can avoid bringing down apps when there is a blip in orchestrator.
However, the manager cannot guarantee that the tasks are actually killed. Also, the manager cannot recreate the tasks until the duration passes.

WDYT?
I haven't looked into deeper, but in my experiment, Kubernetes 1.4 seems doing so.
If it SGTY, I'd like to try to implement the approach (after the release of 1.13)

3.

Slots are likely enough to provide quorum bootstrap but the application must double check this work after establishing its own quorum.

Can you show a concrete example for "double check"?

@stevvooe
Copy link
Contributor

I think this is okay. I don't have a suggestion for updates here, but it might be better to make it clear that these are implications of the slot design and the minimum guarantees provided by them, rather than making it look like this was a design decision that can be changed.

If we let workers kill the tasks after some duration (say, 3 minutes or 5 minutes), we can avoid bringing down apps when there is a blip in orchestrator.
However, the manager cannot guarantee that the tasks are actually killed. Also, the manager cannot recreate the tasks until the duration passes.

WDYT?
I haven't looked into deeper, but in my experiment, Kubernetes 1.4 seems doing so.
If it SGTY, I'd like to try to implement the approach (after the release of 1.13)

This seems like we'd be introduce a ticking time bomb for operators to recover in the case of a quorum loss or other orchestrator downtime. I don't understand in detail what Kubernetes is doing, but I would be largely disappointed if my application reliability was gated on my orchestrators' liability.

The fact is, there are no guarantees that the rogue tasks are killed, either way, even with tasks that may be marked as such.

@aaronlehmann @aluzzardi PTAL

Can you show a concrete example for "double check"?

They need to implement their own quorum algorithm to provide the required guarantees. The set of nodes in the quorum need to be consistent for the bootstrap period, which slots provide. If you expect this "slot quorum" to be provided throughout the lifetime of the service, there is going to be disappointment.

@AkihiroSuda
Copy link
Member Author

OK, added words "in the current implementation" in the last commit:

diff --git a/design/task_model.md b/design/task_model.md
index f2c1967..5509f65 100644
--- a/design/task_model.md
+++ b/design/task_model.md
@@ -169,13 +169,13 @@ satisfied by at least one running task, not the detailed makeup of those slots.
 The updater takes care of making sure that each slot converges to having a
 single running task.
 
-Also, multiple tasks can share the single slot number because of a network
-partition between nodes. If a node is split from manager nodes, the tasks that
-were running on the node will be recreated on another node.  However, the tasks
-on the split node can still continue running. So the old tasks and the new ones
-can share identical slot numbers. These tasks may be considered "orphaned" by
-the manager, after some time. Upon recovering the split, these tasks will be
-killed.
+Also, in the current implementation, multiple tasks can share the single slot
+number because of a network partition between nodes. If a node is split from
+manager nodes, the tasks that were running on the node will be recreated on
+another node.  However, the tasks on the split node can still continue
+running. So the old tasks and the new ones can share identical slot
+numbers. These tasks may be considered "orphaned" by the manager, after some
+time. Upon recovering the split, these tasks will be killed.
 
 Global tasks do not have slot numbers, but the concept is similar. Each node in
 the system should have a single running task associated with it. If this is not

Please let me know this is OK

@stevvooe
Copy link
Contributor

@AkihiroSuda After reading back through this change and my commentary, I think I missed something key here. I am not sure if adding "in the current implementation" gets at the issue. Having this paragraph implies that this is a solvable limitation.

We have traded off slot-consistency for application availability, in the CAP sense. This is because we could have much better slot-consistency if we killed tasks that lose manager contact, at the cost of application availability. Basically, adding something to the effect of "we have sacrificed slot consistency for application availability" may be enough to get this into shape.

Sorry for re-hashing this so much.

This PR updates `design/task_model.md` for clarifying that multiple containers
can share the single slot number because of a network partition between nodes.

Signed-off-by: Akihiro Suda <[email protected]>
@AkihiroSuda
Copy link
Member Author

OK, updated PR

diff --git a/design/task_model.md b/design/task_model.md
index 5509f65..106e77e 100644
--- a/design/task_model.md
+++ b/design/task_model.md
@@ -169,8 +169,8 @@ satisfied by at least one running task, not the detailed makeup of those slots.
 The updater takes care of making sure that each slot converges to having a
 single running task.
 
-Also, in the current implementation, multiple tasks can share the single slot
-number because of a network partition between nodes. If a node is split from
+Also, for application availability, multiple tasks can share the single slot
+number when a network partition occurs between nodes. If a node is split from
 manager nodes, the tasks that were running on the node will be recreated on
 another node.  However, the tasks on the split node can still continue
 running. So the old tasks and the new ones can share identical slot

@AkihiroSuda
Copy link
Member Author

linking similar issue #1743 to this PR

@stevvooe
Copy link
Contributor

stevvooe commented Dec 8, 2016

LGTM

@AkihiroSuda
Copy link
Member Author

@aaronlehmann PTAL?

@aaronlehmann
Copy link
Collaborator

LGTM, sorry for missing this one.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants