Platform Filter for Service Scheduling by nishanttotla · Pull Request #1981 · moby/swarmkit

nishanttotla · 2017-02-23T21:12:51Z

This PR adds the ability to automatically schedule service tasks based on platform constraints. It's part of a larger change, where platform constraints would be passed as part of the service spec, and used during scheduling to avoid running tasks on nodes that don't support them.

What this PR does

Adds a Platforms field to TaskSpec.Placement. This is a list of all platforms that the container can be scheduled on.
Adds a PlatformFilter struct, that gets activated if TaskSpec.Placement.Platforms is non-empty, and picks a node for a container if node.Description.Platform is in TaskSpec.Placement.Platforms.

Where does the platform information come from?

Some images have fat manifests, that include references to platform-specific manifests (an example of such an image is friism/golang). The Docker daemon already pins service images by digest, and can in addition retrieve platform information from the fat manifest (if it exists) and populate the TaskSpec.Placement.Platforms field. If the field remains empty, then the platform filter is not enabled.

See moby/moby#31348 for broader context.

cc @aaronlehmann @aluzzardi @dongluochen @friism

codecov-io · 2017-02-23T21:21:49Z

Codecov Report

Merging #1981 into master will decrease coverage by 0.15%.
The diff coverage is 78.94%.

@@            Coverage Diff             @@
##           master    #1981      +/-   ##
==========================================
- Coverage   60.03%   59.87%   -0.16%     
==========================================
  Files         119      119              
  Lines       19670    19684      +14     
==========================================
- Hits        11808    11786      -22     
- Misses       6513     6550      +37     
+ Partials     1349     1348       -1

aaronlehmann · 2017-02-23T22:13:28Z

manager/scheduler/filter.go

+			// copy the platform information
+			for _, p := range c.SupportedPlatforms {
+				f.supportedPlatforms = append(f.supportedPlatforms, p)
+			}


I don't see anywhere that f.supportedPlatforms is cleared, so I think this slice will just grow every time a task is scheduled, and the filter won't work as expected.

Also note that you can do append(f.supportedPlatforms, c.SupportedPlatforms...) instead of using a loop.

But wouldn't it be even better to just do f.supportedPlatforms = c.SupportedPlatforms?

@aaronlehmann updated to remove the for loop. With using f.supportedPlatforms = c.SupportedPlatforms, we don't need to reset it anymore, right?

aaronlehmann · 2017-02-23T22:26:07Z

Adds a SupportedPlatforms field to ContainerSpec. This is a list of all platforms that the container can be scheduled on.
Adds a PlatformFilter struct, that gets activated if SupportedPlatforms is non-empty, and picks a node for a container if node.Description.Platform is in SupportedPlatforms.

The general approach seems sound. I don't think SupportedPlatforms should be inside ContainerSpec, though. We generally don't modify any specs inside the daemon. We made an exception for digest pinning, sort of as a hack, because reusing the existing Image field was simpler than the alternatives, and the behavior on updates was well-defined and well-behaved. But since SupportedPlatforms is a new addition, I don't think putting it in the spec buys us anything. I guess we'd have to manually copy this information to tasks, but I think that's worth it.

One thing that needs to be thought through before we move forward with this is how updates are handled. Basically, we need to define when the manager will retrieve the manifest and update this platform information. I think doing this on every update is a non-starter, because often updates are changing something totally unrelated to the image (i.e. --update-parallelism), and don't need to incur this round-trip.

With digest pinning, the logic is relatively simple: if there's no digest, we add one, but if there's already a digest, nothing needs to be done.

I think that for platform support, we need to define a mechanism that avoids unnecessary round trips to the registry when we update an image but keep the image/digest the same. My thought was that SupportedPlatforms could contain a Source field that indicates where the information came from. If the value of Source matches the current Image field in ContainerSpec, that means the platform information is up to date and it's not necessary to download the manifest. If Source does not match Image, then it needs to be updated.

dongluochen · 2017-02-23T22:37:18Z

Isn't SupportPlatforms part of image spec? If image hasn't changed, no need to inspect this info.

dongluochen · 2017-02-23T22:39:22Z

manager/scheduler/filter.go

+	// check if the platform for the node is supported
+	nodePlatform := n.Description.Platform
+	for _, p := range f.supportedPlatforms {
+		if p.Architecture == nodePlatform.Architecture && p.OS == nodePlatform.OS {


I'm not sure about the OS matching. From node, the OS is reported to daemon. The image spec on OS might be just "Linux". I don't know if we can compare them directly.

@dongluochen in the examples I've seen, the common ones include Architecture = "amd64" and OS = "linux" or OS = "windows". This seemed like the common sense comparison, but perhaps @friism can add more insight here.

The node info from docker info contains Operating System and OSType. I think currently we are collecting only one of them in NodeInfo. I think both are useful.

dchen@vm2:~$ docker info | egrep "Operating|OS" Operating System: Ubuntu 14.04.3 LTS OSType: linux

@dongluochen I can't remember why, but I think it was a conscious decision to not collect Operating System info reported by the node. Either way, I don't fully understand what constraints are sufficient to run multi-platform images, so I'm happy to make necessary changes.

@stevvooe need your comments here. What is a sufficient comparison for Platforms?

nishanttotla · 2017-02-23T23:30:46Z

The general approach seems sound. I don't think SupportedPlatforms should be inside ContainerSpec, though. We generally don't modify any specs inside the daemon. We made an exception for digest pinning, sort of as a hack, because reusing the existing Image field was simpler than the alternatives, and the behavior on updates was well-defined and well-behaved. But since SupportedPlatforms is a new addition, I don't think putting it in the spec buys us anything. I guess we'd have to manually copy this information to tasks, but I think that's worth it.

I agree with this reasoning, but didn't know a better way. The reason I put it in ContainerSpec was that it's a container-specific field, assuming that tasks are more generic workloads that don't necessarily have platform information the way Docker specifies it. I don't understand "manually copy this information to tasks", can you clarify?

One thing that needs to be thought through before we move forward with this is how updates are handled. Basically, we need to define when the manager will retrieve the manifest and update this platform information. I think doing this on every update is a non-starter, because often updates are changing something totally unrelated to the image (i.e. --update-parallelism), and don't need to incur this round-trip.

Good point. Can we agree that whenever we're contacting the registry to retrieve digest, we don't mind pulling platform information also?

In cases where the --image is not supplied with service update, we don't need to do anything, because we just assume that the image and all its platform-related requirements stay the same.

The interesting case is when the digest is provided (say by the user). In this case, we do know at the manager level whether the digest of the new service spec is different from the existing one. If it isn't different, we don't need to do anything (this is probably the same case as above). If there is a change, then we want to contact the registry to get platform information. The only difference is that before, the user could update the service by providing a new digest, and skip the registry lookup, whereas now it would entail the registry lookup.

To summarize, I don't see why we cannot avoid registry lookup when the service is updated but the image is not.

aaronlehmann · 2017-02-24T00:06:41Z

I don't understand "manually copy this information to tasks", can you clarify?

I assume we want tasks to contain a list of supported platforms, so the scheduler can easily check against nodes. The TaskSpec (including ContainerSpec) gets automatically copied to tasks associated with the service, but something outside the spec would have to be copied separately by the orchestrator.

In this case, we do know at the manager level whether the digest of the new service spec is different from the existing one. If it isn't different, we don't need to do anything (this is probably the same case as above). If there is a change, then we want to contact the registry to get platform information.

This seems reasonable.

nishanttotla · 2017-02-24T22:38:24Z

@aaronlehmann I've moved the SupportedPlatforms field to the Task object, taking it out of ContainerSpec. I still need to change add code for copying the field in the orchestrator.

In principle (assuming SwarmKit would support other workloads beyond containers in the future), do you think it's reasonable to attach platform restrictions to tasks?

aluzzardi · 2017-02-25T01:03:29Z

@aaronlehmann To have that in the Task rather than the TaskSpec is weird though.

Task contains more like runtime status data while the spec contains configuration and here I think this is configuration

For instance, changing the supported list of platforms should trigger a rolling update IMO

aaronlehmann · 2017-02-25T01:08:51Z

To have that in the Task rather than the TaskSpec is weird though.

Hmm, I disagree. TaskSpec is declarative and SupportedPlatforms is something derived from other information.

For instance, changing the supported list of platforms should trigger a rolling update IMO

Since SupportedPlatforms is derived information, it will never be changed directly. A change in the information that SupportedPlatforms is derived from (such as the image digest) can trigger a rolling update.

aluzzardi · 2017-02-25T01:29:40Z

SupportedPlatforms is not derived by SwarmKit. Either SwarmKit derives something internally and puts it into an attribute, or someone else does some computation and instructs SwarmKit to do stuff using the Spec.

Like - if we do put this into Task rather than TaskSpec, how's the engine going to pass it to us since all they can interact with is the spec?

aaronlehmann · 2017-02-25T01:41:14Z

I guess I'm uncomfortable with the "engine manipulates the spec" model. It means the REST API is not really declarative, even if the gRPC API is. I think these differences will be confusing and become a maintenance burden.

For example, what happens if the client decides to include a SupportedPlatforms field in the spec? How would we prevent the engine from overwriting it? One possibility is to hide this field from the REST API, but again, introducing differences between the APIs really seems like the wrong direction.

I don't have any great suggestions here. I think in my ideal world, the client would be doing these derivations, not the engine. I hoped to do it that way for digest pinning, but a different approach got chosen, and now I fear it's becoming precedent to continue adding similar hacks.

stevvooe · 2017-02-25T01:47:08Z

This doesn't look right. We are coupling ourselves way too tightly to the image format (why are we even assuming that tasks have images or even a platform?). It seems like we should only be supporting a constraint that would dispatch a service to a particular platform. Making a platform determination when full context isn't available will make the system limiting when we try to schedule other kinds of workloads.

Again, I hope we can step cautiously when adding multi-platform support. For this to work smoothly, sometimes it is more about what you don't do, than what you do do. In this case, this approach seems limiting.

aluzzardi · 2017-02-25T01:49:42Z

@stevvooe SupportedPlatforms is a constraint. Ideally it should be in TaskSpec.Placement.Platforms, similar to TaskSpec.Placement.Constraints.

The engine has a convenience of filling out those fields based on the manifest

stevvooe · 2017-02-25T02:40:40Z

@stevvooe SupportedPlatforms is a constraint. Ideally it should be in TaskSpec.Placement.Platforms, similar to TaskSpec.Placement.Constraints.

That sounds more right.

nishanttotla · 2017-02-27T20:30:57Z

Personally, I agree with @aaronlehmann that the TaskSpec shouldn't have this information, especially if it results in coupling too tightly with the image format. But the problem as @aluzzardi pointed out is that the Engine only interacts with the spec, so there's a problem wrt passing the information to the manager.

Given this, isn't TaskSpec.Placement.Platforms still (somewhat) coupled with the image format, given that the information we put in Platforms will look like what the manifest contains?

There are also a couple of important questions to answer, based on moby/moby#31348 (comment):

How much control should we give the user wrt setting platform constraints? If we put platform information in the spec, then the expectation would be that the user should be able to set it.
We are implicitly assuming that cross-platform services are okay. The platform filter as in this PR will not restrict all tasks to run on only a single platform if multiple platforms are supported. Are there images that will be affected by this? In particular, how does this change when we want to have persistent distributed data? cc @friism

nishanttotla · 2017-02-28T19:03:36Z

Changed to TaskSpec.Placement.Platforms. PTAL.

aaronlehmann · 2017-02-28T19:39:24Z

manager/scheduler/filter.go

+	if placement != nil {
+		if len(placement.Platforms) > 0 {
+			// copy the platform information
+			f.supportedPlatforms = placement.Platforms


Let's move this line outside above the if len(placement.Platforms) > 0. It shouldn't make a difference, because the filter should be disabled in this case, but it seems like a good idea to clear data that may be left over from a previous use of the filter.

Ah, you're right. Fixed it.

aaronlehmann · 2017-02-28T19:39:56Z

Design LGTM

aaronlehmann · 2017-02-28T19:47:13Z

manager/scheduler/filter.go

+	if placement != nil {
+		f.supportedPlatforms = placement.Platforms
+		if len(placement.Platforms) > 0 {
+			// copy the platform information


comment needs to be moved too :)

aaronlehmann · 2017-05-03T22:33:47Z

Let's add a test case for a missing architecture.

Otherwise, LGTM

nishanttotla · 2017-05-03T22:56:11Z

@aaronlehmann done, thanks!

The tests seem flaky for some reason, do they need more time to run?

aaronlehmann · 2017-05-03T23:04:23Z

No, it's a problem with your test. watchAssignmentFailure silently consumes successes. You can't make assumptions about the order in which tasks are scheduled. It may be easier to create one task at a time, and watch for success or failure.

nishanttotla · 2017-05-03T23:28:14Z

@aaronlehmann fixed it.

dperny · 2017-05-04T17:26:57Z

manager/scheduler/filter.go

+	supportedPlatforms []*api.Platform
+}
+
+// SetTask returns true when the filter is enabled for a given task.


the name SetTask makes this seem like a setter, with side effects, but the doc comment makes it seem like a comparison, without side effects. but the code body clearly shows side effects. can you rewrite this comment? is the boolean supposed to indicate the success or failure of the operation? if it fails, would it be better to return an error?

This name comes from the interface that PlatformFilter implements:

// Filter checks whether the given task can run on the given node. // A filter may only operate type Filter interface { // SetTask returns true when the filter is enabled for a given task // and assigns the task to the filter. It returns false if the filter // isn't applicable to this task. For instance, a constraints filter // would return `false` if the task doesn't contain any constraints. SetTask(*api.Task) bool // Check returns true if the task assigned by SetTask can be scheduled // into the given node. This function should not be called if SetTask // returned false. Check(*NodeInfo) bool // Explain what a failure of this filter means Explain(nodes int) string }

All filters are implemented this way. I agree that the name and the comment could be confusing. I'll open a separate PR to propose changing that, since all filters are implemented this way and this PR shouldn't touch others.

dperny · 2017-05-04T17:28:35Z

manager/scheduler/filter.go

+	if nodes == 1 {
+		return "unsupported platform on 1 node"
+	}
+	return fmt.Sprintf("unsupported platform on %d nodes", nodes)


would it not be better to have this return an error somewhere?

Again, following the interface design here.

dperny · 2017-05-04T17:29:49Z

Requested changed, but you do not HAVE to make these changes, as long as you can give me a quick explanation as to why it's done the way it's done.

aaronlehmann · 2017-05-04T17:32:44Z

@dperny: These methods satisfy an interface. SetTask is a setter that sets up the filter for this task and returns true/false depending whether the filter should be enabled for that task. Explain returns a string explaining why that filter prevented the task from being scheduled. It isn't useful to return an error because this gets exposed through TaskStatus protobuf message. It's not bubbled up internally.

dperny · 2017-05-04T17:33:24Z

@aaronlehmann thanks, sorry about that then. is the interface documented with that information?

aaronlehmann · 2017-05-04T17:35:22Z

@dperny: https://github.com/docker/swarmkit/blob/56bcb2fcb5a698201efb0491dbb6ad1671d5f6bf/manager/scheduler/filter.go#L11-L27

dperny · 2017-05-04T17:40:57Z

Thanks. LGTM 👍 👍 👍

dongluochen · 2017-05-04T17:58:52Z

It seems we are not very sure of the use cases, should we make this an experimental feature so we can change it when we get user feedbacks?

dongluochen · 2017-05-04T18:06:45Z

manager/scheduler/scheduler_test.go

+		},
+	}
+
+	// task3 - no platform constraints, should run on any node


I think you should add another test with task specifying multiple platforms explicitly. And verify it can be scheduled to nodes with different architecture.

Adding it now.

@dongluochen added a task that can run on multiple platforms, and an extra node.

nishanttotla · 2017-05-04T22:23:10Z

@dongluochen the use case is straightforward (and uncommon IMO). When you have a mixed cluster, you don't want images that are not supported by a platform to run on a node with that platform.

Signed-off-by: Nishant Totla <[email protected]>

dongluochen · 2017-05-04T22:49:02Z

@nishanttotla My concern is mis-uses. For example, users may not realize an image can run on 2 platforms. It accidentally gets scheduled to a random platform.

nishanttotla · 2017-05-04T22:52:50Z

My concern is mis-uses. For example, users may not realize an image can run on 2 platforms. It accidentally gets scheduled to a random platform.

This is what happens right now. Images are scheduled to any node whether or not it can run there. This PR just provides an extra filter to allow for preventing that.

trunet · 2017-07-02T13:13:19Z

please, see #2294

nishanttotla · 2017-07-24T18:37:52Z

@erikbgithub regarding the problem you're facing with 3 raspberries, can you open a new issue providing more details about individual nodes as well as the service you're running?

We can discuss possible solutions there, as well as ways to make the error report better.

nishanttotla added area/orchestration kind/enhancement labels Feb 23, 2017

aaronlehmann reviewed Feb 23, 2017

View reviewed changes

nishanttotla force-pushed the image-platform-proposal branch from 75abfc8 to c6145e1 Compare February 23, 2017 22:33

dongluochen reviewed Feb 23, 2017

View reviewed changes

nishanttotla mentioned this pull request Feb 24, 2017

Multi-architecture support in Swarm mode moby/moby#31348

Closed

nishanttotla force-pushed the image-platform-proposal branch from c6145e1 to 7d6722a Compare February 24, 2017 22:36

nishanttotla changed the title ~~[proposal] Platform Filter for Service Scheduling~~ [proposal][WIP] Platform Filter for Service Scheduling Feb 24, 2017

nishanttotla force-pushed the image-platform-proposal branch from 7d6722a to 1339b2c Compare February 28, 2017 19:02

nishanttotla changed the title ~~[proposal][WIP] Platform Filter for Service Scheduling~~ [proposal] Platform Filter for Service Scheduling Feb 28, 2017

aaronlehmann reviewed Feb 28, 2017

View reviewed changes

aaronlehmann added the status/2-code-review label Feb 28, 2017

nishanttotla force-pushed the image-platform-proposal branch from 1339b2c to aa425f0 Compare February 28, 2017 19:46

aaronlehmann reviewed Feb 28, 2017

View reviewed changes

nishanttotla force-pushed the image-platform-proposal branch from 3d244d0 to 8226e23 Compare May 3, 2017 22:15

nishanttotla force-pushed the image-platform-proposal branch from 8226e23 to 0cb059a Compare May 3, 2017 22:42

nishanttotla force-pushed the image-platform-proposal branch from 0cb059a to 6da8089 Compare May 3, 2017 23:19

dperny requested changes May 4, 2017

View reviewed changes

dperny approved these changes May 4, 2017

View reviewed changes

dongluochen reviewed May 4, 2017

View reviewed changes

nishanttotla added status/4-merge and removed status/2-code-review labels May 4, 2017

Adding PlatformFilter for scheduling

8edbb92

Signed-off-by: Nishant Totla <[email protected]>

nishanttotla force-pushed the image-platform-proposal branch from 6da8089 to 8edbb92 Compare May 4, 2017 22:45

dongluochen approved these changes May 4, 2017

View reviewed changes

nishanttotla merged commit c54f8f7 into moby:master May 4, 2017

nishanttotla deleted the image-platform-proposal branch May 4, 2017 22:57

nishanttotla mentioned this pull request May 10, 2017

Adding Platforms field to TaskSpec Placement moby/moby#33144

Merged

vielmetti mentioned this pull request Oct 9, 2017

"no suitable node available for task" on arm64 due to non-normalized platform name #2401

Closed

Conversation

nishanttotla commented Feb 23, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What this PR does

Where does the platform information come from?

Uh oh!

codecov-io commented Feb 23, 2017 • edited by codecov bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

aaronlehmann commented Feb 23, 2017

Uh oh!

dongluochen commented Feb 23, 2017

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

nishanttotla commented Feb 23, 2017

Uh oh!

aaronlehmann commented Feb 24, 2017

Uh oh!

nishanttotla commented Feb 24, 2017

Uh oh!

aluzzardi commented Feb 25, 2017

Uh oh!

aaronlehmann commented Feb 25, 2017

Uh oh!

aluzzardi commented Feb 25, 2017

Uh oh!

aaronlehmann commented Feb 25, 2017

Uh oh!

stevvooe commented Feb 25, 2017

Uh oh!

aluzzardi commented Feb 25, 2017

Uh oh!

stevvooe commented Feb 25, 2017

Uh oh!

nishanttotla commented Feb 27, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

nishanttotla commented Feb 28, 2017

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

aaronlehmann commented Feb 28, 2017

Uh oh!

Choose a reason for hiding this comment

Uh oh!

aaronlehmann commented May 3, 2017

Uh oh!

nishanttotla commented May 3, 2017

Uh oh!

aaronlehmann commented May 3, 2017

Uh oh!

nishanttotla commented May 3, 2017

Uh oh!

Choose a reason for hiding this comment

Uh oh!

nishanttotla May 4, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

nishanttotla commented Feb 23, 2017 •

edited

Loading

codecov-io commented Feb 23, 2017 •

edited by codecov bot

Loading

nishanttotla commented Feb 27, 2017 •

edited

Loading

nishanttotla May 4, 2017 •

edited

Loading