Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

systemd: support cgroup parent with specified slice #336

Merged
merged 1 commit into from
Dec 11, 2015

Conversation

hqhq
Copy link
Contributor

@hqhq hqhq commented Oct 16, 2015

Pick up #119
Fixes: moby/moby#16681

Signed-off-by: Qiang Huang [email protected]

@mrunalp
Copy link
Contributor

mrunalp commented Oct 16, 2015

@hqhq Thanks for reviving this! :) I think we can review your PR.

@hqhq
Copy link
Contributor Author

hqhq commented Oct 26, 2015

ping @LK4D4 @crosbymichael @mrunalp @vishh What do you think about this PR.

A lot of users want --cgroup-parent behavior consistent between cgroupfs and systemd, but AFAIK that is not possible. Systemd organizes hierarchy by slice, we can't create a new scope unit under an existed scope unit by systemd api, even if we can do it manually and fool systemd-cgls, I think that's against systemd philosophy.

If we can't consist them, and use --cgroup-parent to assign parent cgroup on cgroupfs, while assign parent slice on systemd seems confusing, I'm thinking about we separate this usage, allow --cgroup-parent to be allowed only on cgroupfs system, and add a new option like --systemd-slice for systemd system. Does that make sense?

Maybe we can reopen #141 to finish the discussion.

And I'm not a systemd expert, please correct me if I am wrong.

@vishh
Copy link
Contributor

vishh commented Oct 26, 2015

A lot of users want --cgroup-parent behavior consistent between cgroupfs and systemd, but AFAIK that is not possible. Systemd organizes hierarchy by slice, we can't create a new scope unit under an existed scope unit by systemd api, even if we can do it manually and fool systemd-cgls, I think that's against systemd philosophy.

Systemd allows creation of hierarchical slices. Why can't we pass the cgroups path to a slice as --cgroup-parent and create scopes(containers) under that slice?
One of the primary use cases of this feature is to provide differentiated quality of service using multiple cgroup hierarchies, which maps well to systemd slices.
WDYT @hqhq ?

Since runc is attempting to implement the OCI Spec, we should probably move this discussion to OCI spec. FYI: The Spec currently supports taking in an absolute path to cgroup. If runc were to implement that, --cgroup-parent can be possibly be replaced with a --cgroups option.

@crosbymichael
Copy link
Member

would be nicer to drop systemd support since 50%+ of the code is ours setting things up that systemd does not. Its so bad. If we want the container in the systemd scope for a unit then we can just use the fs implementation and set cgroup-parent to the scope

@vishh
Copy link
Contributor

vishh commented Oct 26, 2015

If the end-goal is to track a container using a unit file, we need to
re-use the scope that has been created by systemd. In such a scenario, we
will have to avoid setting up cgroups that systemd does not handle by
default.

On Mon, Oct 26, 2015 at 10:38 AM, Michael Crosby [email protected]
wrote:

would be nicer to drop systemd support since 50%+ of the code is ours
setting things up that systemd does not. Its so bad. If we want the
container in the systemd scope for a unit then we can just use the fs
implementation and set cgroup-parent to the scope


Reply to this email directly or view it on GitHub
#336 (comment).

@crosbymichael
Copy link
Member

@vishh #325 (comment)

@hqhq
Copy link
Contributor Author

hqhq commented Oct 29, 2015

@crosbymichael I understand we can use systemd unit files to manage runc as a systemd service. But I think that's not good enough for users. And we still have to support systemd for Docker right?

Nowadays almost all distros are using systemd, giving up systemd support would be a very big deal, I think we don't have choice but enhance systemd support.

ping @philips since go-systemd is from CoreOS repo, what do you think about problems we are facing with systemd? Such like this --cgroup-parent problem?

@vishh
Copy link
Contributor

vishh commented Oct 29, 2015

IIUC, users want to create systemd services which are docker containers.
What @crosbymichael proposed should help achieve that right?

On Wed, Oct 28, 2015 at 11:29 PM, Qiang Huang [email protected]
wrote:

@crosbymichael https://github.com/crosbymichael I understand we can use
systemd unit files to manage runc as a systemd service. But I think that's
not good enough for users. And we still have to support systemd for Docker
right?

Nowadays almost all distros are using systemd, giving up systemd support
would be a very big deal, I think we don't have choice but enhance systemd
support.

ping @philips https://github.com/philips since go-systemd is from
CoreOS repo, what do you think about problems we are facing with systemd?
Such like this --cgroup-parent problem?


Reply to this email directly or view it on GitHub
#336 (comment).

@hqhq
Copy link
Contributor Author

hqhq commented Oct 30, 2015

@vishh I don't think so, in my understanding, users want to create a Docker container by docker run ..., and then see this container as a systemd service from systemd-cgls, and they can change properties by systemctl set-property .... I don't think what @crosbymichael proposed would help achieve that. With @crosbymichael 's propose, you need to create a unit file xxx.service for every docker run command, and create containers by systemctl start xxx.service instead of docker run xxx from bash.

@crosbymichael Correct me if I'm wrong.

@anusha-ragunathan
Copy link
Contributor

Currently when cgroups are managed by systemd, each container runs in a systemd scope. AFAICT, scopes are not hierarchical, so we cannot nest a container's scope in another (like raw cgroup fs). However, systemd slices are hierarchical. This can be used to group containers and control resource limits on them.

To group containers, users should be able to create a slice definition under /usr/lib/systemd/system, say "limits.slice" and run containers in it using --cgroup-parent option:
docker run -it --cgroup-parent=limits.slice busybox sh

Grouping of containers can be observed under limits.slice using systemd-cgls.

Implementing a change that honors the cgroup-parent as the slice is straightforward. If we are okay with the following side-effects of the change, then we can proceed to implement it:

  • container's scope can be moved away from system.slice. Does that affect accounting in any way?
  • cgroup-parent implementation will be slightly different between cgroupfs (parent is container's cgroup) and systemd cgroup (parent is a systemd slice)

ping @crosbymichael @LK4D4 @hqhq @vishh

@@ -151,8 +151,8 @@ func (m *Manager) Apply(pid int) error {
properties []systemdDbus.Property
)

if c.Slice != "" {
slice = c.Slice
if c.Parent != "" {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When invoked via docker, c.Parent will never be NULL. The default parent cgroup is "docker".

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But Docker can not be the only user, if we enable systemd cgroup for runc, runc itself can set c.Parent to anything.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Docker sets Cgroup Parent to be "docker" in daemon/execdriver/native/template/default_template_linux.go. This should be changed to "system.slice" in case of systemd cgroups.

@mrunalp
Copy link
Contributor

mrunalp commented Dec 1, 2015

@anusha-ragunathan That is an accurate summary. What this PR is doing is fixing the current behavior which is absolutely wrong. I don't see why we shouldn't allow this PR. It fixes systemd behavior to be correct while not really affecting those who want to continue using fs implementation.

@mrunalp
Copy link
Contributor

mrunalp commented Dec 1, 2015

@hqhq This needs rebase btw.

@derekwaynecarr
Copy link
Contributor

@mrunalp @anusha-ragunathan +1 to ensuring that --cgroup-parent can accept a slice. current behavior is broken otherwise on systemd

@hqhq hqhq force-pushed the hq_parent_cgroup_systemd branch from 3c4b761 to 7695a0d Compare December 3, 2015 07:16
@hqhq
Copy link
Contributor Author

hqhq commented Dec 3, 2015

@mrunalp Rebased.

// Parent slice to use for systemd TODO: remove in favor or parent
Slice string `json:"slice"`
// ScopePrefix decribes prefix for the scope name
ScopePrefix string `json:"scope_prefix"`
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For Docker, ScopePrefix should always be "docker".

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, that needs to be done on Docker side and of course we will do that.

@vishh
Copy link
Contributor

vishh commented Dec 3, 2015

@mrunalp @hqhq: I see that parent-cgroup needs to work with systemd. The issue with this PR is that the --cgroup-parent flag will be overloaded.
How about having a wrapper around docker for systemd compatibility? That wrapper can infer the scope for a systemd service, construct the cgroup fs path for that scope and then pass it to docker using --cgroup-parent?
I'd surprised if cgroups is the only compatibility issue between docker and systemd.

The fact that runc now has an in-built support for systemd dilutes my argument though :(

@crosbymichael
Copy link
Member

@vishh i thought runc did not support systemd cgroups or is that some other feature?

@mrunalp
Copy link
Contributor

mrunalp commented Dec 3, 2015

@vishh The problem this is trying to address is in docker that you can see below. Instead of renaming the scope, it should use a different parent slice as specified.

[root@localhost ~]# docker run -it --rm  --cgroup-parent hello busybox sh
/ # cat /proc/self/cgroup 
10:devices:/system.slice/hello-3d850b788c91cf63ba0f4ecc070dbdf326eec04bc46f2c63479be0eff36fee84.scope
9:cpuset:/system.slice/hello-3d850b788c91cf63ba0f4ecc070dbdf326eec04bc46f2c63479be0eff36fee84.scope
8:perf_event:/
7:hugetlb:/system.slice/hello-3d850b788c91cf63ba0f4ecc070dbdf326eec04bc46f2c63479be0eff36fee84.scope
6:freezer:/system.slice/hello-3d850b788c91cf63ba0f4ecc070dbdf326eec04bc46f2c63479be0eff36fee84.scope
5:memory:/system.slice/hello-3d850b788c91cf63ba0f4ecc070dbdf326eec04bc46f2c63479be0eff36fee84.scope
4:blkio:/system.slice/hello-3d850b788c91cf63ba0f4ecc070dbdf326eec04bc46f2c63479be0eff36fee84.scope
3:net_cls,net_prio:/system.slice/hello-3d850b788c91cf63ba0f4ecc070dbdf326eec04bc46f2c63479be0eff36fee84.scope
2:cpu,cpuacct:/system.slice/hello-3d850b788c91cf63ba0f4ecc070dbdf326eec04bc46f2c63479be0eff36fee84.scope
1:name=systemd:/system.slice/hello-3d850b788c91cf63ba0f4ecc070dbdf326eec04bc46f2c63479be0eff36fee84.scope
/ # 

@vishh
Copy link
Contributor

vishh commented Dec 3, 2015

@crosbymichael: You are correct.

@crosbymichael
Copy link
Member

The problem is no one really knows what to expect with systemd and hierarchies because it does not seem to have that concept. @anusha-ragunathan has been doing alot of research on how systemd handles cgroups and its like two totally different worlds with scopes and slices. It does not seem to appear that sytemd supports hierarchies like people want, alot of the parent child relation ship is encoded into the name but @anusha-ragunathan can explain a little better.

@mrunalp
Copy link
Contributor

mrunalp commented Dec 3, 2015

@crosbymichael systemd supports hierarchy of slices and at the lowest levels there are scopes/services. By allowing slices as values that one could pass to cgroup-parent we have parity with the fs implementation. The current behavior of renaming the scope is plain wrong.

@anusha-ragunathan
Copy link
Contributor

systemd (208 in my case. default for RHEL 7.1) allows hierarchy of slices, but no hierarchy of scopes.

  • Refer to StartTransientUnit() dbus API that's used to create a scope. Experiments show that setting the "Slice" property is allowed, however setting the "ControlGroup" is disallowed.
  • Another data point is the systemd-run utility which provides an option for processes to be run in a specific scope. In this experiment, we run process p1 in its own scope and process p2 in p1's scope.

systemd-run --scope sleep 180
Running as unit run-3894.scope.

systemd-run --scope --slice=run-3894.scope sleep 180
Running as unit run-3939.scope.

However, note that systemd-cgls doesnt nest the scopes. It simply creates a slice in p1's scopename and run p2 in it, while p1 is run under system.slice.
├─run.slice
│ └─run-3894.scope.slice -> p2 run under p1 slice name
│ └─run-3939.scope ---> p2 scope
│ └─3939 /bin/sleep 180
└─system.slice
├─run-3894.scope ---> p1 scope
│ └─3894 /bin/sleep 180

If systemd cgroups requires a way to group containers, then slices is the way to go. It's not equivalent to raw cgroupfs tree hierarchy which can allow deep nesting.

If the --cgroup-parent option should not be overloaded to accomodate both cgroupfs and systemd cgroups, then having a separate option (such as --systemd-slice) might be the correct way to proceed.

@mrunalp
Copy link
Contributor

mrunalp commented Dec 3, 2015

@anusha-ragunathan Yes, that's what I said in #336 (comment) regarding hierarchy in systemd.

In systemd a slice is essentially a parent for either sub-slices or scopes/services at the lowest level. Hence, IMO, --cgroup-parent should be allowed to take systemd slices as arguments.

@mrunalp
Copy link
Contributor

mrunalp commented Dec 3, 2015

@crosbymichael Agree with you.
@vishh This PR doesn't modify the fs implementation of cgroups at all. It only touches the systemd cgroups API.
I am arguing for those who choose to pass the flag to make docker daemon use the systemd cgroups API, not the fs API.

@hqhq
Copy link
Contributor Author

hqhq commented Dec 4, 2015

I'll try to summarize:

  • This PR is trying to fix --cgroup-parent usage on systemd-cgroup, which I think is totally right. Though runc is not using systemd-cgroup, Docker can use for now.
  • We can't make --cgroup-parent have the same semantic for both fs-cgroup and systemd-cgroup, but it doesn't matter, see 3.
  • @LK4D4 already changed Docker to use fs-cgroup by default, if people really want to use systemd-cgroup, they need to specifically set it, for these users, they know what they are doing and how to play with systemd, we can check on Docker side, for systemd-cgroup, --cgroup-parent must be named as xxx.slice.
  • For future plans, if we can simulate all systemd behavior with fs-cgroup, we can remove most of the systemd-cgroup code in runc, while keep Docker side, so people can still choose to use systemd way, but they don't care how runc implement it. Before that day comes, we'll need this PR for now.

@hqhq
Copy link
Contributor Author

hqhq commented Dec 4, 2015

@anusha-ragunathan I proposed a new option like --systemd-slice at the beginning of this PR #336 (comment) and I'm still open on that, but according my summarization above, it seems acceptable to use --cgroup-parent for both fs-cgroup and systemd-cgroup, WDYT?

@anusha-ragunathan
Copy link
Contributor

@hqhq : I'm perfectly fine with using --cgroup-parent to take a slice name for systemd cgroup. We can clearly document this difference in behavior.

@derekwaynecarr
Copy link
Contributor

I think its reasonable that if I do the following on a systemd environment:

$ systemctl status foo
● foo.service - Foo that does something important
   Loaded: loaded (/etc/systemd/system/foo.service; disabled; vendor preset: disabled)
   Active: active (running) since Thu 2015-12-03 14:07:38 EST; 1 day 3h ago
 Main PID: 15864 (foo)
   CGroup: /important.slice/foo.service

For user's on those systems, they will think --cgroup-parent maps to important.slice.

Anything else is just confusing to the end-user.

So +1 @anusha-ragunathan

@mrunalp
Copy link
Contributor

mrunalp commented Dec 7, 2015

@crosbymichael @LK4D4 @vishh WDYT?

@LK4D4
Copy link
Contributor

LK4D4 commented Dec 7, 2015

@mrunalp Ok for me
However ScopePrefix sounds too "dockerish". But maybe it's ok.

@vishh
Copy link
Contributor

vishh commented Dec 7, 2015

@mrunalp: Which approach are you referring to?

@mrunalp
Copy link
Contributor

mrunalp commented Dec 8, 2015

@vishh I am talking about allowing --cgroup-parent to take slices as arguments

@mrunalp
Copy link
Contributor

mrunalp commented Dec 8, 2015

@LK4D4 We need not even expose the scope prefix as a cmdline/API option. For docker it could just be set to "docker".

@LK4D4
Copy link
Contributor

LK4D4 commented Dec 8, 2015

@mrunalp Yeah, I meant that we doing that option only for docker.

@hqhq
Copy link
Contributor Author

hqhq commented Dec 8, 2015

@LK4D4 Yes, we can do that when we merge this PR to Docker, it can be hard coded to be "docker" as @mrunalp said.

@mrunalp
Copy link
Contributor

mrunalp commented Dec 9, 2015

LGTM

@vishh
Copy link
Contributor

vishh commented Dec 9, 2015

LGTM.

On Wed, Dec 9, 2015 at 12:12 PM, Mrunal Patel [email protected]
wrote:

LGTM


Reply to this email directly or view it on GitHub
#336 (comment).

@mrunalp
Copy link
Contributor

mrunalp commented Dec 11, 2015

@crosbymichael @LK4D4 ping

@LK4D4
Copy link
Contributor

LK4D4 commented Dec 11, 2015

LGTM

LK4D4 added a commit that referenced this pull request Dec 11, 2015
systemd: support cgroup parent with specified slice
@LK4D4 LK4D4 merged commit cb04f03 into opencontainers:master Dec 11, 2015
@anusha-ragunathan
Copy link
Contributor

Note that there are two remaining tasks to call this feature complete:

  • Docker sets Cgroup Parent to be "docker" in daemon/execdriver/native/template/default_template_linux.go. This should be changed to "system.slice" in case of systemd cgroups.
  • For Docker, ScopePrefix should always be "docker".

@hqhq : Will you be taking care of this in a follow-up change? Thanks.

@derekwaynecarr
Copy link
Contributor

Whatever PR does the follow-up can a link be provided to those that subscribed this PR?

@vishh
Copy link
Contributor

vishh commented Dec 11, 2015

Docker sets Cgroup Parent to be "docker"

isn't this gonna be configurable henceforth at the daemon level?

On Fri, Dec 11, 2015 at 11:57 AM, Derek Carr [email protected]
wrote:

Whatever PR does the follow-up can a link be provided to those that
subscribed this PR?


Reply to this email directly or view it on GitHub
#336 (comment).

@mrunalp
Copy link
Contributor

mrunalp commented Dec 11, 2015

@anusha-ragunathan I am working on integrating this with docker. Should have a PR out today.

@mrunalp
Copy link
Contributor

mrunalp commented Dec 11, 2015

@vishh That is correct but the defaults need to be modified to be "system.slice" when started with --exec-opt native.cgroupdriver=systemd

@mrunalp
Copy link
Contributor

mrunalp commented Dec 11, 2015

Here is the docker PR moby/moby#18612

@hqhq hqhq deleted the hq_parent_cgroup_systemd branch December 14, 2015 01:33
stefanberger pushed a commit to stefanberger/runc that referenced this pull request Sep 8, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants