[doc] update description about the task slot#1741
Conversation
Current coverage is 55.08% (diff: 100%)@@ master #1741 diff @@
==========================================
Files 102 102
Lines 16930 16930
Methods 0 0
Messages 0 0
Branches 0 0
==========================================
+ Hits 9297 9326 +29
+ Misses 6467 6441 -26
+ Partials 1166 1163 -3
|
|
Can we plan a new feature that guarantees no two containers can simultaneously have the same slot number? |
|
Change LGTM
That isn't really how our system works. When we update a task, we simultaneously set the old task's |
design/task_model.md
Outdated
| The updater takes care of making sure that each slot converges to having a | ||
| single running task. | ||
|
|
||
| Also, multiple containers can share the single slot number because of a network |
design/task_model.md
Outdated
| partition between nodes. If a node is split from manager nodes, the tasks that | ||
| were running on the node will be recreated on another node. However, the task | ||
| containers on the split node can still continue running. So the old task | ||
| containers and the new ones can share identical slot numbers. |
There was a problem hiding this comment.
These tasks may be considered "orphaned" by the manager, after some time. Upon recovering the split, these tasks will be killed.
Also, this behavior is already provided by task ids. |
39aa2ba to
8de4bb3
Compare
|
OK. IIUC one of the motivations for having template/introspection feature was to support applications that depends on "myid". e.g. ZooKeeper ( However considering that multiple tasks can share identical slot number, this Alternative idea is to just create separate services: Can I hear your opinion about how/whether we can set up such applications on Swarm? I know k8s supports such applications as "petset", but haven't looked into its design yet. p.s. Maybe this discussion should be done in another issue 😅 |
|
A naive solution I can come up is to make node kill its own tasks when it is split from managers. However, we can tell living nodes to "drop all the packets from/to the tasks on the split node" during the grace period 😄 |
|
@AkihiroSuda Ideally, we want the application to continue working under such failures. Unfortunately, for a single worker node, determining whether there is a split or a temporary lapse in availability is impossible. If we kill the tasks when a worker loses connectivity, we risk bringing down the applications if there is a blip in orchestrator. Slots are likely enough to provide quorum bootstrap but the application must double check this work after establishing its own quorum. |
|
Thank you for the response, and three questions 1.Please let me know if this PR itself is LGTY? 😃 2.
If we let workers kill the tasks after some duration (say, 3 minutes or 5 minutes), we can avoid bringing down apps when there is a blip in orchestrator. WDYT? 3.
Can you show a concrete example for "double check"? |
|
I think this is okay. I don't have a suggestion for updates here, but it might be better to make it clear that these are implications of the slot design and the minimum guarantees provided by them, rather than making it look like this was a design decision that can be changed.
This seems like we'd be introduce a ticking time bomb for operators to recover in the case of a quorum loss or other orchestrator downtime. I don't understand in detail what Kubernetes is doing, but I would be largely disappointed if my application reliability was gated on my orchestrators' liability. The fact is, there are no guarantees that the rogue tasks are killed, either way, even with tasks that may be marked as such. @aaronlehmann @aluzzardi PTAL
They need to implement their own quorum algorithm to provide the required guarantees. The set of nodes in the quorum need to be consistent for the bootstrap period, which slots provide. If you expect this "slot quorum" to be provided throughout the lifetime of the service, there is going to be disappointment. |
8de4bb3 to
4b8ec62
Compare
|
OK, added words "in the current implementation" in the last commit: diff --git a/design/task_model.md b/design/task_model.md
index f2c1967..5509f65 100644
--- a/design/task_model.md
+++ b/design/task_model.md
@@ -169,13 +169,13 @@ satisfied by at least one running task, not the detailed makeup of those slots.
The updater takes care of making sure that each slot converges to having a
single running task.
-Also, multiple tasks can share the single slot number because of a network
-partition between nodes. If a node is split from manager nodes, the tasks that
-were running on the node will be recreated on another node. However, the tasks
-on the split node can still continue running. So the old tasks and the new ones
-can share identical slot numbers. These tasks may be considered "orphaned" by
-the manager, after some time. Upon recovering the split, these tasks will be
-killed.
+Also, in the current implementation, multiple tasks can share the single slot
+number because of a network partition between nodes. If a node is split from
+manager nodes, the tasks that were running on the node will be recreated on
+another node. However, the tasks on the split node can still continue
+running. So the old tasks and the new ones can share identical slot
+numbers. These tasks may be considered "orphaned" by the manager, after some
+time. Upon recovering the split, these tasks will be killed.
Global tasks do not have slot numbers, but the concept is similar. Each node in
the system should have a single running task associated with it. If this is notPlease let me know this is OK |
|
@AkihiroSuda After reading back through this change and my commentary, I think I missed something key here. I am not sure if adding "in the current implementation" gets at the issue. Having this paragraph implies that this is a solvable limitation. We have traded off slot-consistency for application availability, in the CAP sense. This is because we could have much better slot-consistency if we killed tasks that lose manager contact, at the cost of application availability. Basically, adding something to the effect of "we have sacrificed slot consistency for application availability" may be enough to get this into shape. Sorry for re-hashing this so much. |
This PR updates `design/task_model.md` for clarifying that multiple containers can share the single slot number because of a network partition between nodes. Signed-off-by: Akihiro Suda <[email protected]>
4b8ec62 to
1a7f5b5
Compare
|
OK, updated PR diff --git a/design/task_model.md b/design/task_model.md
index 5509f65..106e77e 100644
--- a/design/task_model.md
+++ b/design/task_model.md
@@ -169,8 +169,8 @@ satisfied by at least one running task, not the detailed makeup of those slots.
The updater takes care of making sure that each slot converges to having a
single running task.
-Also, in the current implementation, multiple tasks can share the single slot
-number because of a network partition between nodes. If a node is split from
+Also, for application availability, multiple tasks can share the single slot
+number when a network partition occurs between nodes. If a node is split from
manager nodes, the tasks that were running on the node will be recreated on
another node. However, the tasks on the split node can still continue
running. So the old tasks and the new ones can share identical slot |
|
linking similar issue #1743 to this PR |
|
LGTM |
|
@aaronlehmann PTAL? |
|
LGTM, sorry for missing this one. |
This PR updates
design/task_model.mdfor clarifying that multiple containers can share the single slot number because of a network partition between nodes.Note that people may abuse the
{{.Task.Slot}}template string without knowledge about this behavior. (#1650, moby/moby#28025)The PR for the "introspection mount" is related to this behavior as well. (moby/moby#26331, #1642)
Below is the repro for the behavior I mentioned in this PR.
IMO we don't need to include this repro in
design/task_model.mdit self.Repro
Consider that we have 1 manager node (
dm01) and 2 worker nodes (dm02anddm03).Step 1.
Start a service with 3 replicas. We use the
{{.Task.Slot}}template string for injecting the slot numbers into containers.Consider that now the container for the slot 1 is running on
dm01, 2 is ondm03, and 3 is ondm02.Step 2.
Split
dm02fromdm01, but keepdm02itself still running.dm01$ sudo iptables -I INPUT -s $(docker node inspect -f '{{.Status.Addr}}' dm02) -j DROPThen, a new container for the slot 3 will be automatically created and started on another node. e.g.
dm01.However, the old container for the slot 3 is still running on
dm02, which is split fromdm01.Now we have two containers running simultaneously with the common slot number. 🐧
Signed-off-by: Akihiro Suda [email protected]