Skip to content

Platform Filter for Service Scheduling#1981

Merged
nishanttotla merged 2 commits intomoby:masterfrom
nishanttotla:image-platform-proposal
May 4, 2017
Merged

Platform Filter for Service Scheduling#1981
nishanttotla merged 2 commits intomoby:masterfrom
nishanttotla:image-platform-proposal

Conversation

@nishanttotla
Copy link
Contributor

@nishanttotla nishanttotla commented Feb 23, 2017

This PR adds the ability to automatically schedule service tasks based on platform constraints. It's part of a larger change, where platform constraints would be passed as part of the service spec, and used during scheduling to avoid running tasks on nodes that don't support them.

What this PR does

  • Adds a Platforms field to TaskSpec.Placement. This is a list of all platforms that the container can be scheduled on.
  • Adds a PlatformFilter struct, that gets activated if TaskSpec.Placement.Platforms is non-empty, and picks a node for a container if node.Description.Platform is in TaskSpec.Placement.Platforms.

Where does the platform information come from?

Some images have fat manifests, that include references to platform-specific manifests (an example of such an image is friism/golang). The Docker daemon already pins service images by digest, and can in addition retrieve platform information from the fat manifest (if it exists) and populate the TaskSpec.Placement.Platforms field. If the field remains empty, then the platform filter is not enabled.

See moby/moby#31348 for broader context.

cc @aaronlehmann @aluzzardi @dongluochen @friism

@codecov-io
Copy link

codecov-io commented Feb 23, 2017

Codecov Report

Merging #1981 into master will decrease coverage by 0.15%.
The diff coverage is 78.94%.

@@            Coverage Diff             @@
##           master    #1981      +/-   ##
==========================================
- Coverage   60.03%   59.87%   -0.16%     
==========================================
  Files         119      119              
  Lines       19670    19684      +14     
==========================================
- Hits        11808    11786      -22     
- Misses       6513     6550      +37     
+ Partials     1349     1348       -1

// copy the platform information
for _, p := range c.SupportedPlatforms {
f.supportedPlatforms = append(f.supportedPlatforms, p)
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't see anywhere that f.supportedPlatforms is cleared, so I think this slice will just grow every time a task is scheduled, and the filter won't work as expected.

Also note that you can do append(f.supportedPlatforms, c.SupportedPlatforms...) instead of using a loop.

But wouldn't it be even better to just do f.supportedPlatforms = c.SupportedPlatforms?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@aaronlehmann updated to remove the for loop. With using f.supportedPlatforms = c.SupportedPlatforms, we don't need to reset it anymore, right?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Correct

@aaronlehmann
Copy link
Collaborator

Adds a SupportedPlatforms field to ContainerSpec. This is a list of all platforms that the container can be scheduled on.
Adds a PlatformFilter struct, that gets activated if SupportedPlatforms is non-empty, and picks a node for a container if node.Description.Platform is in SupportedPlatforms.

The general approach seems sound. I don't think SupportedPlatforms should be inside ContainerSpec, though. We generally don't modify any specs inside the daemon. We made an exception for digest pinning, sort of as a hack, because reusing the existing Image field was simpler than the alternatives, and the behavior on updates was well-defined and well-behaved. But since SupportedPlatforms is a new addition, I don't think putting it in the spec buys us anything. I guess we'd have to manually copy this information to tasks, but I think that's worth it.

One thing that needs to be thought through before we move forward with this is how updates are handled. Basically, we need to define when the manager will retrieve the manifest and update this platform information. I think doing this on every update is a non-starter, because often updates are changing something totally unrelated to the image (i.e. --update-parallelism), and don't need to incur this round-trip.

With digest pinning, the logic is relatively simple: if there's no digest, we add one, but if there's already a digest, nothing needs to be done.

I think that for platform support, we need to define a mechanism that avoids unnecessary round trips to the registry when we update an image but keep the image/digest the same. My thought was that SupportedPlatforms could contain a Source field that indicates where the information came from. If the value of Source matches the current Image field in ContainerSpec, that means the platform information is up to date and it's not necessary to download the manifest. If Source does not match Image, then it needs to be updated.

@nishanttotla nishanttotla force-pushed the image-platform-proposal branch from 75abfc8 to c6145e1 Compare February 23, 2017 22:33
@dongluochen
Copy link
Contributor

Isn't SupportPlatforms part of image spec? If image hasn't changed, no need to inspect this info.

// check if the platform for the node is supported
nodePlatform := n.Description.Platform
for _, p := range f.supportedPlatforms {
if p.Architecture == nodePlatform.Architecture && p.OS == nodePlatform.OS {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure about the OS matching. From node, the OS is reported to daemon. The image spec on OS might be just "Linux". I don't know if we can compare them directly.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@dongluochen in the examples I've seen, the common ones include Architecture = "amd64" and OS = "linux" or OS = "windows". This seemed like the common sense comparison, but perhaps @friism can add more insight here.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The node info from docker info contains Operating System and OSType. I think currently we are collecting only one of them in NodeInfo. I think both are useful.

dchen@vm2:~$ docker info | egrep "Operating|OS"
Operating System: Ubuntu 14.04.3 LTS
OSType: linux

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@dongluochen I can't remember why, but I think it was a conscious decision to not collect Operating System info reported by the node. Either way, I don't fully understand what constraints are sufficient to run multi-platform images, so I'm happy to make necessary changes.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@stevvooe need your comments here. What is a sufficient comparison for Platforms?

@nishanttotla
Copy link
Contributor Author

The general approach seems sound. I don't think SupportedPlatforms should be inside ContainerSpec, though. We generally don't modify any specs inside the daemon. We made an exception for digest pinning, sort of as a hack, because reusing the existing Image field was simpler than the alternatives, and the behavior on updates was well-defined and well-behaved. But since SupportedPlatforms is a new addition, I don't think putting it in the spec buys us anything. I guess we'd have to manually copy this information to tasks, but I think that's worth it.

I agree with this reasoning, but didn't know a better way. The reason I put it in ContainerSpec was that it's a container-specific field, assuming that tasks are more generic workloads that don't necessarily have platform information the way Docker specifies it. I don't understand "manually copy this information to tasks", can you clarify?

One thing that needs to be thought through before we move forward with this is how updates are handled. Basically, we need to define when the manager will retrieve the manifest and update this platform information. I think doing this on every update is a non-starter, because often updates are changing something totally unrelated to the image (i.e. --update-parallelism), and don't need to incur this round-trip.

Good point. Can we agree that whenever we're contacting the registry to retrieve digest, we don't mind pulling platform information also?

In cases where the --image is not supplied with service update, we don't need to do anything, because we just assume that the image and all its platform-related requirements stay the same.

The interesting case is when the digest is provided (say by the user). In this case, we do know at the manager level whether the digest of the new service spec is different from the existing one. If it isn't different, we don't need to do anything (this is probably the same case as above). If there is a change, then we want to contact the registry to get platform information. The only difference is that before, the user could update the service by providing a new digest, and skip the registry lookup, whereas now it would entail the registry lookup.

To summarize, I don't see why we cannot avoid registry lookup when the service is updated but the image is not.

@aaronlehmann
Copy link
Collaborator

I don't understand "manually copy this information to tasks", can you clarify?

I assume we want tasks to contain a list of supported platforms, so the scheduler can easily check against nodes. The TaskSpec (including ContainerSpec) gets automatically copied to tasks associated with the service, but something outside the spec would have to be copied separately by the orchestrator.

In this case, we do know at the manager level whether the digest of the new service spec is different from the existing one. If it isn't different, we don't need to do anything (this is probably the same case as above). If there is a change, then we want to contact the registry to get platform information.

This seems reasonable.

@nishanttotla
Copy link
Contributor Author

@aaronlehmann I've moved the SupportedPlatforms field to the Task object, taking it out of ContainerSpec. I still need to change add code for copying the field in the orchestrator.

In principle (assuming SwarmKit would support other workloads beyond containers in the future), do you think it's reasonable to attach platform restrictions to tasks?

@nishanttotla nishanttotla changed the title [proposal] Platform Filter for Service Scheduling [proposal][WIP] Platform Filter for Service Scheduling Feb 24, 2017
@aluzzardi
Copy link
Member

@aaronlehmann To have that in the Task rather than the TaskSpec is weird though.

Task contains more like runtime status data while the spec contains configuration and here I think this is configuration

For instance, changing the supported list of platforms should trigger a rolling update IMO

@aaronlehmann
Copy link
Collaborator

To have that in the Task rather than the TaskSpec is weird though.

Hmm, I disagree. TaskSpec is declarative and SupportedPlatforms is something derived from other information.

For instance, changing the supported list of platforms should trigger a rolling update IMO

Since SupportedPlatforms is derived information, it will never be changed directly. A change in the information that SupportedPlatforms is derived from (such as the image digest) can trigger a rolling update.

@aluzzardi
Copy link
Member

SupportedPlatforms is not derived by SwarmKit. Either SwarmKit derives something internally and puts it into an attribute, or someone else does some computation and instructs SwarmKit to do stuff using the Spec.

Like - if we do put this into Task rather than TaskSpec, how's the engine going to pass it to us since all they can interact with is the spec?

@aaronlehmann
Copy link
Collaborator

I guess I'm uncomfortable with the "engine manipulates the spec" model. It means the REST API is not really declarative, even if the gRPC API is. I think these differences will be confusing and become a maintenance burden.

For example, what happens if the client decides to include a SupportedPlatforms field in the spec? How would we prevent the engine from overwriting it? One possibility is to hide this field from the REST API, but again, introducing differences between the APIs really seems like the wrong direction.

I don't have any great suggestions here. I think in my ideal world, the client would be doing these derivations, not the engine. I hoped to do it that way for digest pinning, but a different approach got chosen, and now I fear it's becoming precedent to continue adding similar hacks.

@stevvooe
Copy link
Contributor

This doesn't look right. We are coupling ourselves way too tightly to the image format (why are we even assuming that tasks have images or even a platform?). It seems like we should only be supporting a constraint that would dispatch a service to a particular platform. Making a platform determination when full context isn't available will make the system limiting when we try to schedule other kinds of workloads.

Again, I hope we can step cautiously when adding multi-platform support. For this to work smoothly, sometimes it is more about what you don't do, than what you do do. In this case, this approach seems limiting.

@aluzzardi
Copy link
Member

@stevvooe SupportedPlatforms is a constraint. Ideally it should be in TaskSpec.Placement.Platforms, similar to TaskSpec.Placement.Constraints.

The engine has a convenience of filling out those fields based on the manifest

@stevvooe
Copy link
Contributor

@stevvooe SupportedPlatforms is a constraint. Ideally it should be in TaskSpec.Placement.Platforms, similar to TaskSpec.Placement.Constraints.

That sounds more right.

@nishanttotla
Copy link
Contributor Author

nishanttotla commented Feb 27, 2017

Personally, I agree with @aaronlehmann that the TaskSpec shouldn't have this information, especially if it results in coupling too tightly with the image format. But the problem as @aluzzardi pointed out is that the Engine only interacts with the spec, so there's a problem wrt passing the information to the manager.

Given this, isn't TaskSpec.Placement.Platforms still (somewhat) coupled with the image format, given that the information we put in Platforms will look like what the manifest contains?

There are also a couple of important questions to answer, based on moby/moby#31348 (comment):

  • How much control should we give the user wrt setting platform constraints? If we put platform information in the spec, then the expectation would be that the user should be able to set it.
  • We are implicitly assuming that cross-platform services are okay. The platform filter as in this PR will not restrict all tasks to run on only a single platform if multiple platforms are supported. Are there images that will be affected by this? In particular, how does this change when we want to have persistent distributed data? cc @friism

@nishanttotla nishanttotla force-pushed the image-platform-proposal branch from 7d6722a to 1339b2c Compare February 28, 2017 19:02
@nishanttotla nishanttotla changed the title [proposal][WIP] Platform Filter for Service Scheduling [proposal] Platform Filter for Service Scheduling Feb 28, 2017
@nishanttotla
Copy link
Contributor Author

Changed to TaskSpec.Placement.Platforms. PTAL.

if placement != nil {
if len(placement.Platforms) > 0 {
// copy the platform information
f.supportedPlatforms = placement.Platforms
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's move this line outside above the if len(placement.Platforms) > 0. It shouldn't make a difference, because the filter should be disabled in this case, but it seems like a good idea to clear data that may be left over from a previous use of the filter.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, you're right. Fixed it.

@aaronlehmann
Copy link
Collaborator

Design LGTM

if placement != nil {
f.supportedPlatforms = placement.Platforms
if len(placement.Platforms) > 0 {
// copy the platform information
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

comment needs to be moved too :)

@nishanttotla nishanttotla force-pushed the image-platform-proposal branch from 3d244d0 to 8226e23 Compare May 3, 2017 22:15
@aaronlehmann
Copy link
Collaborator

Let's add a test case for a missing architecture.

Otherwise, LGTM

@nishanttotla nishanttotla force-pushed the image-platform-proposal branch from 8226e23 to 0cb059a Compare May 3, 2017 22:42
@nishanttotla
Copy link
Contributor Author

@aaronlehmann done, thanks!

The tests seem flaky for some reason, do they need more time to run?

@aaronlehmann
Copy link
Collaborator

No, it's a problem with your test. watchAssignmentFailure silently consumes successes. You can't make assumptions about the order in which tasks are scheduled. It may be easier to create one task at a time, and watch for success or failure.

@nishanttotla nishanttotla force-pushed the image-platform-proposal branch from 0cb059a to 6da8089 Compare May 3, 2017 23:19
@nishanttotla
Copy link
Contributor Author

@aaronlehmann fixed it.

supportedPlatforms []*api.Platform
}

// SetTask returns true when the filter is enabled for a given task.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the name SetTask makes this seem like a setter, with side effects, but the doc comment makes it seem like a comparison, without side effects. but the code body clearly shows side effects. can you rewrite this comment? is the boolean supposed to indicate the success or failure of the operation? if it fails, would it be better to return an error?

Copy link
Contributor Author

@nishanttotla nishanttotla May 4, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This name comes from the interface that PlatformFilter implements:

// Filter checks whether the given task can run on the given node.
// A filter may only operate
type Filter interface {
	// SetTask returns true when the filter is enabled for a given task
	// and assigns the task to the filter. It returns false if the filter
	// isn't applicable to this task.  For instance, a constraints filter
	// would return `false` if the task doesn't contain any constraints.
	SetTask(*api.Task) bool

	// Check returns true if the task assigned by SetTask can be scheduled
	// into the given node. This function should not be called if SetTask
	// returned false.
	Check(*NodeInfo) bool

	// Explain what a failure of this filter means
	Explain(nodes int) string
}

All filters are implemented this way. I agree that the name and the comment could be confusing. I'll open a separate PR to propose changing that, since all filters are implemented this way and this PR shouldn't touch others.

if nodes == 1 {
return "unsupported platform on 1 node"
}
return fmt.Sprintf("unsupported platform on %d nodes", nodes)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

would it not be better to have this return an error somewhere?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Again, following the interface design here.

@dperny
Copy link
Collaborator

dperny commented May 4, 2017

Requested changed, but you do not HAVE to make these changes, as long as you can give me a quick explanation as to why it's done the way it's done.

@aaronlehmann
Copy link
Collaborator

@dperny: These methods satisfy an interface. SetTask is a setter that sets up the filter for this task and returns true/false depending whether the filter should be enabled for that task. Explain returns a string explaining why that filter prevented the task from being scheduled. It isn't useful to return an error because this gets exposed through TaskStatus protobuf message. It's not bubbled up internally.

@dperny
Copy link
Collaborator

dperny commented May 4, 2017

@aaronlehmann thanks, sorry about that then. is the interface documented with that information?

@dperny
Copy link
Collaborator

dperny commented May 4, 2017

Thanks. LGTM 👍 👍 👍

@dongluochen
Copy link
Contributor

It seems we are not very sure of the use cases, should we make this an experimental feature so we can change it when we get user feedbacks?

},
}

// task3 - no platform constraints, should run on any node
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you should add another test with task specifying multiple platforms explicitly. And verify it can be scheduled to nodes with different architecture.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Adding it now.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@dongluochen added a task that can run on multiple platforms, and an extra node.

@nishanttotla
Copy link
Contributor Author

@dongluochen the use case is straightforward (and uncommon IMO). When you have a mixed cluster, you don't want images that are not supported by a platform to run on a node with that platform.

@nishanttotla nishanttotla force-pushed the image-platform-proposal branch from 6da8089 to 8edbb92 Compare May 4, 2017 22:45
@dongluochen
Copy link
Contributor

@nishanttotla My concern is mis-uses. For example, users may not realize an image can run on 2 platforms. It accidentally gets scheduled to a random platform.

@nishanttotla
Copy link
Contributor Author

My concern is mis-uses. For example, users may not realize an image can run on 2 platforms. It accidentally gets scheduled to a random platform.

This is what happens right now. Images are scheduled to any node whether or not it can run there. This PR just provides an extra filter to allow for preventing that.

@nishanttotla nishanttotla merged commit c54f8f7 into moby:master May 4, 2017
@nishanttotla nishanttotla deleted the image-platform-proposal branch May 4, 2017 22:57
@trunet
Copy link

trunet commented Jul 2, 2017

please, see #2294

@nishanttotla
Copy link
Contributor Author

@erikbgithub regarding the problem you're facing with 3 raspberries, can you open a new issue providing more details about individual nodes as well as the service you're running?

We can discuss possible solutions there, as well as ways to make the error report better.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

10 participants