-
-
Notifications
You must be signed in to change notification settings - Fork 4.2k
systemd-oomd #15206
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
systemd-oomd #15206
Conversation
dschatzberg
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looking good - see comments inline
keszybz
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I didn't look at the actual business logic at all... Some initial comments.
2e62d95 to
2185da1
Compare
poettering
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
heya! sorry for the late review. still working though a lot of stuff queued from when i was in parental leave.
it's not a complete review, but a start.
I generally like a lot what I am seeing here, good work. the stuff i found is generally minor.
|
|
||
| r = safe_atou64(value, &v); | ||
| if (r < 0) | ||
| return r; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
as extra paranoia step we could fail here with an error if v is CGROUP_LIMIT_MAX here, to avoid ambiguity? dunno?
|
oh, i see now that some of those helpers already did get merged? ignore that part of my review then |
|
please rebase! in particular as those helpers apparently got merged, after all |
|
Hmm I thought I had already rebase-ed past the prior PR merges but I'll do that again when I make some updates. Thanks for the first round review |
anitazha
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Working through the comments and should have another update soon but just addressing some comments
| <citerefentry><refentrytitle>systemd.directives</refentrytitle><manvolnum>7</manvolnum></citerefentry>, | ||
| <citerefentry><refentrytitle>systemd-oomd.service</refentrytitle><manvolnum>8</manvolnum></citerefentry> | ||
| </para> | ||
| </refsect1> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's a simplification for this first version of the code since that's how the FB Oomd is typically set up now. We intend to support scopes and services too in a later iteration (though I don't think the code will be different at all).
hixiomh
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull requested
5a2215d to
fa4b4d0
Compare
ba7859b to
e406de7
Compare
70d6f60 to
a382d1c
Compare
|
🤦 CI is greener now |
|
I think we have crossed the point where subsequent versions of this PR are less useful than merging it and doing follow-up PRs. I would propose the following plan:
Two reasons for this:
I scanned the patches again, and it looks all nice. Unless @poettering has any last-minute requests, let's go ahead with the above. |
systemd-oomd can be enabled when in developer mode (-Dmode=developer)
|
I pushed one commit to disable systemd-oomd by default and allow it to be enabled only in developer mode as suggested. I think we can add a NEWS entry in another PR. |
|
Yay! |
|
Woo! |
| @org.freedesktop.DBus.Property.EmitsChangedSignal("false") | ||
| readonly s ManagedOOMMemoryPressure = '...'; | ||
| @org.freedesktop.DBus.Property.EmitsChangedSignal("false") | ||
| readonly s ManagedOOMMemoryPressureLimitPercent = '...'; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hmm, so in the other cases where we expose scaled values (i.e. percent) we encode things on the wire in uint64_t values that are normalized to UINT32_MAX as 100%. i.e. 150% is mapped to be 6442450944 on the wire, 100% is 4294967296 and 50% is 2147483648. I think we should follow the same logic here.
See the MemoryMinScale/MemoryLowScale/TaskMaxScale properties for exampe.
hence, please rework this to be uint64_t, also use "Scale" as suffix in the name
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm I was looking to switching this to "Scale" but it doesn't seem applicable in this case. For the existing properties that are suffixed with "Scale", the percentages are translated to a final value, i.e. number of bytes or number of tasks. But MemoryPressureLimitPercent= is operated on solely as a percentage. PSI averages are shown as a ratio in percent so there is no translation to a whole number.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hmm, it's not really percent though, is it? it has two digits after the dot, i.e. its in units of 1/10,000th, displayed with a dot in the middle, no?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Indeed the PSI average is displayed as a decimal but it represents a percent of time, per https://www.kernel.org/doc/html/latest/accounting/psi.html.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i am not getting what you are trying to say. the dbus interface as merged now accepts percent ranges 0%…100% as integer, i.e. 1% granularity. The kernel otoh exports 0.00%…100.00% as fixed point decimal integer with 0.01% granularity, correct? Thus the D-Bus API loses two decimals of accuracy, no?
I still think we should expose this as 64bit value, with binary fixed point normalized to 2^32. It's more finegrained than the kernel accuracy, but accurate enough to not lose any accuracy compared to what the kernel exposes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh I see what you mean now. Keep the finer level of granularity. Sure I can do that.
Suggested post-merge in systemd#15206
Post-merge suggestion from systemd#15206
Another suggestion from systemd#15206
Suggested post-merge in systemd#15206
Post-merge suggestion from systemd#15206
Another suggestion from systemd#15206
Currently, if an event source callback returns an error, we'll disable the event source and continue. This adds a per-event source flag that if turned on goes further: the event loop is also exited, propagating the error code. This is inspired by some patterns repeatedly seen in systemd#15206. The idea is that event sources that server the "primary" function of a program are marked like this, so that if they fail the failure is instantly propagated and terminates the program.
Suggested post-merge in systemd#15206
Post-merge suggestion from systemd#15206
Another suggestion from systemd#15206
Requested in systemd#15206 (comment), preserve the full granularity for memory pressure limits (permyriad) instead of capping out at percent.
| <para>The system must be running systemd with a full unified cgroup hierarchy for the expected cgroups-v2 features. | ||
| Furthermore, resource accounting must be turned on for all units monitored by <command>systemd-oomd</command>. | ||
| The easiest way to turn on resource accounting is by ensuring the values for <varname>DefaultCPUAccounting</varname>, | ||
| <varname>DefaultIOAccounting</varname>, <varname>DefaultMemoryAccounting</varname>, and | ||
| <varname>DefaultTasksAccounting</varname> are set to <constant>true</constant> in | ||
| <citerefentry><refentrytitle>systemd-system.conf</refentrytitle><manvolnum>5</manvolnum></citerefentry>.</para> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this accurate? Looking at the code, only the memory controller seems to be used.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Issue opened: #19331
Requested in systemd/systemd#15206 (comment), preserve the full granularity for memory pressure limits (permyriad) instead of capping out at percent.
systemd-oomd follows the current oomd model more closely than what was discussed at DevConf.CZ: oomd will periodically ask pid1 over varlink for a list of cgroups to monitor and use that to decide what/when to kill. The candidates for killing are the children of the monitored cgroups: either the recursive cgroups of a cgroup with memory.oom.group = 1, or the leaf cgroups. oomd will do the actual killing and set xattrs appropriately so the system manager can use that to determine if processes were killed by oomd. However, this limits enabling oomd to slices owned by pid1. But it does allow oomd to kill delegated cgroups.