Skip to content

v2: set memory.oom.group => OOMPolicy in systemd#41

Merged
AkihiroSuda merged 1 commit into
opencontainers:mainfrom
tych0:memory-oom-group-support
Jun 24, 2026
Merged

v2: set memory.oom.group => OOMPolicy in systemd#41
AkihiroSuda merged 1 commit into
opencontainers:mainfrom
tych0:memory-oom-group-support

Conversation

@tych0

@tych0 tych0 commented Sep 9, 2025

Copy link
Copy Markdown
Member

We are interested in using memory.oom.cgroup, but need it to be set systemd because of [1], so let's set it. There are a few caveats, in no particular order:

A. systemd does not allow OOMPolicy to be set on units that have already
started, so we must do this in Apply() instead of Set().
B. As the comment suggests, OOMPolicy has three states (continue, stop,
kill), where kill maps to memory.oom.group=1, and continue maps to =0.
However, the bit about runc update doesn't quite make sense: the
values will only ever be expressed in terms of memory.oom.group, so we
only need to map the continue and kill values, which have direct
mappings.

Note that runc update here doesn't make sense anyway: because of (A),
we cannot update these values. Perhaps we should reject these updates
since systemd will? (Or maybe we try to update and just error out, in
the event that systemd eventually allows this? The kernel allows
updating it, the reason the systemd semantics have diverged is unclear.)
C. systemd only gained support for setting OOMPolicy on scopes in
versions >= 253; versions before this will fail.

So, let's add a bit allowing the setup of OOMPolicy to Apply(), and ignore it in Set() -> genV2ResourcesProperties() -> unifiedResToSystemdProps().

[1]: This arguably is more important than the debug-level warning would suggest: if someone does the equivalent of a systemctl daemon-reload, systemd will reset our manually-via-cgroupfs set value to 0, because we did not explicitly set it in the service / scope definition, meaning that
individual tasks will not actually oom the whole cgroup when they oom.

@tych0 tych0 force-pushed the memory-oom-group-support branch 2 times, most recently from d0609a7 to e2f6711 Compare September 9, 2025 17:23
Comment thread systemd/v2.go
// values for OOMPolicy (continue/stop).
fallthrough
// This was set before the unit started, so no need to
// warn about it here.

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if the warning should be louder (maybe just more explicitly stating systemd might override this on a daemon-reload) for other uses that hit it since systemd will stomp on it as you highlight, or is that only for a subset of things that systemd has a knob for? I realize that's a bit out of scope for this PR, but sort of a tricky thing for folks to debug down to.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

for memory.oom.group specifically, I think we can (have to?) assume that it was set up in Apply() viz. this patch (modulo bugs). But r.e. your comment, I wonder if we should make the logging in the default case at least a warning level, or maybe explicitly generate an error?

Comment thread systemd/v2.go
// to 0 in runc update, as there are two other possible
// values for OOMPolicy (continue/stop).
fallthrough
// This was set before the unit started, so no need to

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Dumb question, do we know this to actually be true?

i.e. we've made it so we also process memory.oom.group in Apply(), which iiuc is getting its config for this from when you setup a new manager, but for Set() we're sort of hoping the user already set it up initially and just avoid warning on it. The downside here then is if someone doesn't have it configured for Apply(), and then adds it later with Set(), systemd won't be in the loop and a daemon-reload will cause systemd to overwrite with whatever its OOMPolicy is set to be for the unit?

I guess I don't have a better answer for these sort of "must be done on unit creation" type settings unless we can warn iff we know the systemd prop isn't aligned already.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

but for Set() we're sort of hoping the user already set it up initially and just avoid warning on it

Oh, I suppose what we should do is query the existing value, and warn/error if they mismatch? because you're right: if someone does a runc update with a new value for this, we cannot actually set it.

@halaney

halaney commented Sep 10, 2025

Copy link
Copy Markdown

super nit:

[1]: This arguably is more important than the debug-level warning would
suggest: if someone does the equivalent of a `systemctl daemon-reload`,
systemd will reset our manually-via-cgroupfs set value to 0, because we did
not explicitly set it in the service / scope definition, meaning

left me on the edge of my seat waiting for that conclusion!

@tych0 tych0 force-pushed the memory-oom-group-support branch from e2f6711 to 0080fa4 Compare September 10, 2025 18:10
@tych0

tych0 commented Sep 10, 2025

Copy link
Copy Markdown
Member Author

left me on the edge of my seat waiting for that conclusion!

ha, derp. I updated, thanks.

Comment thread systemd/systemd_test.go Outdated
}

unitName := getUnitName(config)
conn, err := systemdDbus.NewSystemdConnectionContext(context.Background())

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: you can probably use t.Context() here.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes you can, it was added in go1.24 and this is the minimal version we require in go.mod

Comment thread systemd/systemd_test.go Outdated
}
defer conn.Close()

properties, err := conn.GetUnitPropertiesContext(context.Background(), unitName)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto

Comment thread systemd/systemd_test.go

@kolyshkin kolyshkin left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall this makes sense; left some nits. Also I wish we also had a place to document that.

Comment thread systemd/v2.go Outdated

properties = append(properties, c.SystemdProps...)

if c.Resources != nil && c.Resources.Unified != nil {

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. Would be nice to have a comment saying that systemd does not allow OOMPolicy to be set on units that have already started, so we do this here in Apply() instead of Set().
  2. you don't have to check for c.Resources.Unified != nil as querying the nil map is fine.

@kolyshkin

Copy link
Copy Markdown
Contributor

Needs a rebase

We are interested in using memory.oom.cgroup, but need it to be set systemd
because of [1], so let's set it. There are a few caveats, in no particular
order:

A. systemd does not allow OOMPolicy to be set on units that have already
   started, so we must do this in Apply() instead of Set().
B. As the comment suggests, OOMPolicy has three states (continue, stop,
   kill), where kill maps to memory.oom.group=1, and continue maps to =0.
   However, the bit about `runc update` doesn't quite make sense: the
   values will only ever be expressed in terms of memory.oom.group, so we
   only need to map the continue and kill values, which have direct
   mappings.

   Note that `runc update` here doesn't make sense anyway: because of (A),
   we cannot update these values. Perhaps we should reject these updates
   since systemd will? (Or maybe we try to update and just error out, in
   the event that systemd eventually allows this? The kernel allows
   updating it, the reason the systemd semantics have diverged is unclear.)
C. systemd only gained support for setting OOMPolicy on scopes in versions
   >= 253; versions before this will fail.

So, let's add a bit allowing the setup of OOMPolicy to Apply(), and ignore
it in Set() -> genV2ResourcesProperties() -> unifiedResToSystemdProps().

[1]: This arguably is more important than the debug-level warning would
suggest: if someone does the equivalent of a `systemctl daemon-reload`,
systemd will reset our manually-via-cgroupfs set value to 0, because we did
not explicitly set it in the service / scope definition, meaning that
individual tasks will not actually oom the whole cgroup when they oom.

Co-authored-by: Ethan Adams <eadams@netflix.com>
Signed-off-by: Tycho Andersen <tandersen@netflix.com>
[halaney: Address review comments and fix unit test]
Signed-off-by: Andrew Halaney <ahalaney@netflix.com>
@halaney halaney force-pushed the memory-oom-group-support branch from 0080fa4 to 2e81f83 Compare June 23, 2026 22:15
@halaney halaney requested a review from a team as a code owner June 23, 2026 22:15
@halaney

halaney commented Jun 23, 2026

Copy link
Copy Markdown

We're kind of playing hot potato with this PR, but I've updated it and addressed all comments I believe :)

@kolyshkin can you approve the github workflows or whatever so we can get this thru CI?

@AkihiroSuda AkihiroSuda merged commit da6d3c9 into opencontainers:main Jun 24, 2026
14 checks passed
@halaney

halaney commented Jun 24, 2026

Copy link
Copy Markdown

@kolyshkin shall I cut a PR for this in runc? I see we have some tagging in cgroups here, not sure what the process is to get this integrated in runc proper

@kolyshkin

Copy link
Copy Markdown
Contributor

@kolyshkin shall I cut a PR for this in runc? I see we have some tagging in cgroups here, not sure what the process is to get this integrated in runc proper

Is there anything that can be done from the runc side (except for oc/cgroups version bump and the changelog entry)? If not, I will do it once I'm back from vacation (in 10 days or so).

@halaney

halaney commented Jun 24, 2026

Copy link
Copy Markdown

Is there anything that can be done from the runc side (except for oc/cgroups version bump and the changelog entry)? If not, I will do it once I'm back from vacation (in 10 days or so).

Nothing to do outside of bumping the package, if you wanna do it that's fine by me, wasn't sure if there was process around tagging / etc before bumping! I'll bother you post vacation :)

@kolyshkin

Copy link
Copy Markdown
Contributor

Is there anything that can be done from the runc side (except for oc/cgroups version bump and the changelog entry)? If not, I will do it once I'm back from vacation (in 10 days or so).

Nothing to do outside of bumping the package, if you wanna do it that's fine by me, wasn't sure if there was process around tagging / etc before bumping! I'll bother you post vacation :)

Yes, we'll need a tag first of course, I just wanted to wait for #62 and #64 first, for less noise / less frequent releases.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants