Skip to content

feat: Kubernetes support on AppArmor-enabled host nodes #1643

@alexclewontin

Description

@alexclewontin

Problem Statement

On Canonical Kubernetes clusters running on Ubuntu hosts (and possibly other Kubernetes distros on other AppArmor enabled hosts), OpenShell sandbox pods can receive CAP_SYS_ADMIN and CAP_NET_ADMIN and still fail during supervisor startup because the runtime/default AppArmor profile blocks the mount operations used by ip netns add.

Observed on the local Canonical Kubernetes cluster:

  • Kubernetes: v1.32.13
  • OS image: Ubuntu 24.04.4 LTS
  • Kernel: 6.17.0-29-generic
  • Runtime: containerd://1.6.39

A normal sandbox without a localhost AppArmor profile reached CrashLoopBackOff with:

Network namespace creation failed and proxy mode requires isolation.
Ensure CAP_NET_ADMIN and CAP_SYS_ADMIN are available and iproute2 is installed.
Error: /usr/sbin/ip netns add sandbox-66ed3353 failed: mount --make-shared /run/netns failed: Permission denied

A minimal pod with the same relevant capabilities and no localhost AppArmor profile reproduced the same kernel denial:

+ mkdir -p /run/netns
+ ip netns add aa-default
mount --make-shared /run/netns failed: Permission denied

A straightforward fix proves the basic direction: load a localhost AppArmor profile on each node and apply that profile to sandbox pods. That approach is not safe enough as-is for all Kubernetes users because it unconditionally requires Localhost/openshell-supervisor and an unconditional privileged loader. Non-AppArmor, SELinux-first, or restricted clusters could fail even though they could otherwise run OpenShell without this AppArmor-specific workaround.

Proposed Design

Add conditional AppArmor support to the Kubernetes compute driver and Helm chart. The design splits into two pieces:

  1. The Kubernetes driver decides whether sandbox pods should request a localhost AppArmor profile.
  2. A node-local loader DaemonSet installs that profile onto nodes and advertises readiness through a node label.

Runtime behavior

When AppArmor is effectively enabled for a sandbox, the driver should inject the following:

securityContext:
  appArmorProfile:
    type: Localhost
    localhostProfile: openshell-supervisor
nodeSelector:
  openshell.ai/apparmor-supervisor: loaded

The node selector matters because Kubernetes requires Localhost profiles to already be loaded on the node where the pod lands. Kubernetes docs also note that the scheduler is not aware of loaded AppArmor profiles and recommend labeling nodes for profile availability.

This behavior should be controlled by configuration in the Kubernetes driver:

  • auto (default): use the OpenShell AppArmor profile only when at least one schedulable node is known to have successfully loaded it; otherwise create the existing non-AppArmor sandbox pod spec.
  • required: require a ready AppArmor node and fail sandbox creation with a clear precondition error if none exists.
  • disabled: never request a localhost AppArmor profile.

This split gives the desired behavior across cluster types: clusters that need it can pick up AppArmor automatically, while clusters without working AppArmor support continue to use the current non-AppArmor pod spec unless the operator explicitly asks for fail-closed behavior.

Loader behavior

Add an AppArmor loader DaemonSet. It should use a dedicated shell-capable image rather than the distroless gateway image, and it should run under its own ServiceAccount and node-labeling RBAC. The loader DaemonSet should:

  1. Run only where cluster policy allows privileged host access.
  2. Load /etc/apparmor.d/openshell-supervisor on the host.
  3. Verify /sys/kernel/security/apparmor/profiles contains openshell-supervisor after apparmor_parser succeeds.
  4. Label the node openshell.ai/apparmor-supervisor=loaded only after both parser success and profile visibility verification.
  5. Remove or clear that label on unsupported/failure paths.

The loader DaemonSet has its own mode:

  • auto (default): initially, this behaves the same as required. A future implementation could use Node Feature Discovery (NFD) to detect which nodes need the profile installed.
  • required: try to install the profile on all nodes, regardless of NFD deployment.
  • disabled: do not render the loader and do not attempt to install the profile.

The loader mode controls whether OpenShell tries to install the profile. The driver mode controls whether sandbox pods request the profile. The "just works" defaults should be loader.mode=auto and driver.appArmorMode=auto: clusters with AppArmor-enabled nodes get the profile when loading succeeds, while clusters where the loader cannot run fall back to the existing non-AppArmor pod spec. Setting loader.mode=disabled and driver.appArmorMode=required is guaranteed to fail.

Minimum AppArmor profile found during testing

Starting from a broad permissive profile for the supervisor, the profile can be stripped down for the current default sandbox path.

The following profile was verified end-to-end with a temporary gateway build that injected appArmorProfile.type=Localhost and localhostProfile=openshell-research into sandbox pods. openshell sandbox create --name repro-aa-caps-setid --from base --no-auto-providers --no-tty -- /bin/sh -lc 'echo connected' succeeded and printed connected.

#include <tunables/global>

profile openshell-supervisor flags=(attach_disconnected,mediate_deleted) {
  #include <abstractions/base>

  network,
  file,

  capability sys_admin,
  capability net_admin,
  capability sys_ptrace,
  capability syslog,
  capability setuid,
  capability setgid,

  mount options=(rw, bind) -> /run/netns/**,
  mount options=(rw, rbind) /run/netns/ -> /run/netns/,
  mount options=(rw, rbind) /run/netns -> /run/netns,
  mount options=(rw, rshared) -> /run/netns,
  mount options=(rw, rshared) -> /run/netns/,
  umount /run/netns/**,
}

Rules from the broader starting profile that were not required for the verified default path:

  • capability dac_override
  • capability dac_read_search
  • capability chown
  • deny mount -> /proc/**
  • deny mount -> /sys/**
  • deny mount -> /dev/**
  • mount options=(rw, rslave) -> /
  • umount /sys/
  • mount fstype=sysfs -> /sys/
  • signal (send, receive) peer=@{profile_name}
  • ptrace (trace, tracedby) peer=@{profile_name}

Important caveats:

  • network, is required for the actual sandbox, not just for the ip netns reproducer. Without it, the supervisor could create the namespace but failed to connect back to the gateway.
  • setuid and setgid are required even when they are not explicitly in the Kubernetes capabilities.add list, because they are part of the default Linux capability set unless the pod drops them. Without them, the sandbox failed after namespace setup with Invalid argument (os error 22).
  • dac_read_search may be needed when Kubernetes user namespaces are enabled because the driver intentionally adds DAC_READ_SEARCH for cross-UID /proc/<pid>/fd inspection in that mode. The final implementation should either include that capability in the profile unconditionally or render the profile according to the configured sandbox capability set.
  • this narrowing investigation only explored creating the sandbox, it did not explore running workloads in it, filesharing, or GPU or other hardware passthrough.

Alternatives Considered

Always apply Localhost/openshell-supervisor

Rejected. It fixes Ubuntu nodes with the profile loaded, but it breaks nodes where the profile is not loaded. Kubernetes rejects pods that request a missing localhost profile.

Disable AppArmor support by default and document an opt-in

Rejected as the default because it keeps clusters with AppArmor-enabled nodes broken until the operator discovers the workaround. It remains useful as an explicit disabled mode for restricted clusters or operators who do not want any privileged profile loader.

Node Feature Discovery integration

In theory, Node Feature Discovery (NFD) could be used to avoid running the privileged loader container on nodes where it is detectably unnecessary.

However, NFD by itself would only detect whether the node has AppArmor enabled, not whether or not our specific profile has been loaded.

Additionally, Node Feature Discovery is only compelling here if it reduces where the privileged loader pod is scheduled. Runtime NFD checks inside a loader pod do not materially reduce the security surface: the pod has already been scheduled with host access, and the loader's own local checks can already avoid mutating the host when AppArmor is inactive or unavailable. If using NFD requires adding CRD/discovery/NodeFeature read permissions to the loader ServiceAccount, it actually increases the RBAC surface of the loader pod.

Therefore, any NFD integration should meet these constraints:

  1. NFD must not be required for correctness. The loader must still work without NFD by probing each node where it runs. NFD is primarily useful for clusters that already run it as trusted node inventory.
  2. NFD should be consumed through node labels only. NFD should not expand the loader pod's runtime RBAC beyond nodes get,patch, which is the RBAC profile it already has.
  3. NFD should be used for scheduler-level loader placement, not just runtime skip logic.

Thus, the flow would look something like:

  1. If NFD CRDs are absent, Helm renders the loader DaemonSet without the openshell.ai/apparmor-configured=true node selector. The loader schedules broadly and self-detects AppArmor on each node.
  2. If NFD CRDs are present, Helm renders the OpenShell NodeFeatureRule and renders the loader DaemonSet with a node selector or node affinity for openshell.ai/apparmor-configured=true.
  3. If NFD later labels one or more nodes openshell.ai/apparmor-configured=true, the loader schedules on those nodes and performs the normal AppArmor self-check/load/verify path.
  4. If NFD never labels any node true because AppArmor is unavailable, the loader remains unscheduled. In auto driver mode, sandbox creation falls back to the non-AppArmor pod spec because no node receives openshell.ai/apparmor-supervisor=loaded. In required driver mode, sandbox creation fails with a clear diagnostic.

Limitation: breaks helm install/upgrade --wait

In the above flow, if NFD is enabled but no nodes support AppArmor, the loader DaemonSet will remain Pending forever. This will cause helm upgrade/install --wait to fail. This limitation by itself is likely enough to block the use of NFD integration today. Potential future Helm development may enable this behavior: helm/helm#12800

Limitation: NFD AppArmor detection

Today, as of v0.18.3, NFD does not provide any built-in AppArmor detection. That is not necessarily a blocker, because it does allow custom labels through NodeFeatureRules. Those rules must work with the data NFD already exposes, and for our purposes they only need to avoid false negatives: the loader performs the final AppArmor detection, so a rule can safely include some extra nodes as long as it does not exclude nodes that really support AppArmor.

The most relevant built-in signals exposed today are kernel config values:

  • kernel.config.SECURITY_APPARMOR=y
  • kernel.config.DEFAULT_SECURITY_APPARMOR=y
  • kernel.config.LSM=landlock,lockdown,yama,integrity,apparmor

These values are useful, but they still describe kernel configuration rather than the active boot state. The LSMs active for the current boot are exposed by /sys/kernel/security/lsm, and boot parameters can override the configured default LSM list. Still, nodes with kernel.config.SECURITY_APPARMOR=n categorically cannot enable AppArmor, so limiting the loader to nodes with kernel.config.SECURITY_APPARMOR=y is a safe coarse filter.

NFD also provides a SELinux-enabled feature. Today, SELinux and AppArmor are mutually exclusive, but future LSM stacking work means we should avoid making SELinux-specific assumptions central to correctness. However, it does suggest a possible willingness upstream to accept an AppArmor detection feature.

There is a more accurate alternative: NFD's local feature source can consume labels written by an external detector under /etc/kubernetes/node-feature-discovery/features.d/. An OpenShell detector could read /sys/kernel/security/lsm and /sys/kernel/security/apparmor/profiles and write openshell.ai/apparmor-active=true, which would be more accurate than kernel config matching. In practice, though, that means introducing another node-side detector with host access, which defeats much of the purpose of using NFD to reduce the privileged footprint of the AppArmor loader.

Agent Investigation

  • Loaded the openshell-cli skill for CLI workflows.
  • Reproduced the current failure on the live Canonical Kubernetes cluster with a normal OpenShell sandbox and with a minimal ip netns pod.
  • Wrote a permissive profile.
  • Temporarily built a gateway image that injected Localhost/openshell-research into sandbox pods, imported it into the cluster containerd, and verified that the permissive profile makes sandbox creation work.
  • Iteratively removed AppArmor profile rules and retested sandbox creation to identify the smaller profile above.
  • Installed NFD v0.18.3, confirmed the default labels do not expose AppArmor directly, confirmed raw NodeFeature data contains AppArmor kernel config values, and verified a NodeFeatureRule can create an AppArmor-configuration label.

Checklist

  • I've reviewed existing issues and the architecture docs
  • This is a design proposal, not a "please build this" request

Metadata

Metadata

Assignees

No one assigned

    Labels

    state:triage-neededOpened without agent diagnostics and needs triage

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions