Skip to content

feat(seccomp): ExtraHandler — user-supplied syscall handlers#20

Open
dzerik wants to merge 1 commit intomultikernel:mainfrom
dzerik:feature/extra-handlers
Open

feat(seccomp): ExtraHandler — user-supplied syscall handlers#20
dzerik wants to merge 1 commit intomultikernel:mainfrom
dzerik:feature/extra-handlers

Conversation

@dzerik
Copy link
Copy Markdown

@dzerik dzerik commented Apr 24, 2026

Summary

Adds a public extension point for downstream crates that need to register their own seccomp-notification handlers alongside sandlock's builtin chroot/cow/procfs/network/port_remap logic.

Motivation. Downstream crates that want to intercept additional syscalls in the same supervisor task as sandlock's builtins have no clean way to do it today — one SECCOMP_FILTER_FLAG_NEW_LISTENER per process means a single listener, so a second supervisor cannot run alongside. The only alternative is forking sandlock or patching notif::supervisor wholesale.

API.

  • New type dispatch::ExtraHandler { syscall_nr, handler }.
  • New entry Sandbox::run_with_extra_handlers(policy, cmd, extras).
  • Existing Sandbox::run() delegates to it with empty extras — zero behaviour change for current callers.

Ordering contract (documented + tested).

  • Builtins register first (chroot path normalization, COW, procfs, …).
  • Extras appended last, in the Vec order.
  • Chain stops at first non-Continue — user handlers cannot subvert builtin confinement.

Docs.

  • docs/extension-handlers.md: design rationale, security boundary, panics policy, non-goals, downstream sketch.
  • crates/sandlock-core/examples/openat_audit.rs: runnable example.

Minor bump 0.6 → 0.7 suggested.

Test plan

  • 4 new unit tests in dispatch::extra_handler_tests (ctor, insertion order, append-after-builtin, empty-extras nop) — passing
  • All 215 unit tests pass
  • Example openat_audit.rs runs against a python3 -c guest

@dzerik dzerik force-pushed the feature/extra-handlers branch from 5f2b730 to 71c5724 Compare April 24, 2026 14:00
@congwang-mk
Copy link
Copy Markdown
Contributor

congwang-mk commented Apr 26, 2026

Thanks for the PR!

Two main issues:

  1. The mechanism (split builtins-first, extras-after; security boundary preserved) is the right shape.
    What's missing is the plumbing through to the BPF filter so that extras can actually intercept the
    syscalls they're written for. Without that, the API ships a footgun: registers handlers that silently
    never fire, with no error to tell the user why.

  2. Tests that actually exercise dispatch. The current tests verify Vec::push, not the security contract.
    At minimum, an integration test that registers an extra returning NotifAction::Errno(libc::EACCES) and
    confirms the child sees EACCES.

@dzerik dzerik force-pushed the feature/extra-handlers branch 2 times, most recently from b334ab3 to 1d0783d Compare April 27, 2026 07:41
Adds a public extension point for downstream crates that need to
register their own seccomp-notification handlers alongside sandlock's
builtin chroot/cow/procfs/network/port_remap logic.

Motivation: downstream crates that want to intercept additional
syscalls in the same supervisor task as sandlock's builtins have no
clean way to do it today — one SECCOMP_FILTER_FLAG_NEW_LISTENER per
process means a single listener, so a second supervisor cannot run
alongside.  The only alternative is forking sandlock or patching
notif::supervisor wholesale.

API:
- New type dispatch::ExtraHandler { syscall_nr, handler }.
- New entry Sandbox::run_with_extra_handlers(policy, cmd, extras).
- Existing Sandbox::run() delegates to it with empty extras — zero
  behaviour change for current callers.

Ordering contract (documented + tested):
- Builtins register first (chroot path normalization, COW, procfs, …).
- Extras appended last, in the Vec order.
- Chain stops at first non-Continue — user handlers cannot subvert
  builtin confinement.

BPF coverage (this is what plumbs extras to the kernel):
- Sandbox::do_spawn collects the syscall numbers from extra_handlers
  and threads them into the child via the new ChildSpawnArgs.extra_syscalls
  field on context::confine_child.
- The child merges them into notif_syscalls(policy) before
  bpf::assemble_filter, with sort + dedup so a syscall registered both
  by a builtin and an extra produces a single JEQ.
- Without this step the kernel would never raise USER_NOTIF for a
  syscall that has no builtin handler — the dispatch table would
  receive nothing and the user handler would silently never fire.

Default-deny bypass guard:
- The cBPF program emits notif JEQs before deny JEQs, so a syscall
  present in both lists hits SECCOMP_RET_USER_NOTIF first.  An extra
  on a DEFAULT_DENY syscall would therefore convert a kernel-deny into
  a user-supervised path, and a Continue from the handler would
  silently bypass deny.
- Sandbox::run_with_extra_handlers now validates extras against the
  policy's deny list at registration time via
  dispatch::validate_extras_against_policy and returns
  SandboxError::Child naming the offending syscall — no silent footgun.

Internals:
- build_dispatch_table now takes Vec<ExtraHandler> and drains it into
  register() calls after builtins.
- notif::supervisor signature extended to accept extras and pass them
  through.  sandbox.rs moves self.extra_handlers via std::mem::take
  on spawn (HandlerFn is Box<dyn Fn> — not Clone).
- confine_child's seven positional parameters packed into
  context::ChildSpawnArgs to keep the call site readable.

Docs:
- docs/extension-handlers.md: design rationale, security boundary,
  panics policy, non-goals, downstream sketch.  Adds §3.0 (BPF-filter
  merge semantics) and §3.0.1 (default-deny bypass guard); corrects
  the NotifAction variant table (ReturnValue, Kill { sig, pgid }).
- crates/sandlock-core/examples/openat_audit.rs: runnable example.

Tests:
- 4 unit tests on dispatch::extra_handler_tests (ctor, insertion
  order, append-after-builtin, empty-extras nop).
- 7 integration tests under tests/integration/test_extra_handlers.rs
  exercising the full kernel path:
  * extra on SYS_uname (not intercepted by any builtin) returning
    Errno(EACCES) reaches the guest;
  * Continue lets the kernel resume the syscall;
  * empty extras vector preserves baseline behaviour;
  * cross-handler ordering: extra on SYS_openat fires after the
    /proc-virtualization builtin returns Continue;
  * registration on SYS_mount (DEFAULT_DENY) is rejected up-front
    with a descriptive error;
  * builtin non-Continue blocks extra: openat on /proc/1/cmdline is
    rejected by the procfs builtin and is never observed by the
    extra (path inspected via process_vm_readv), while a peer
    openat on /etc/hostname is observed — proves the chain stops at
    first non-Continue end-to-end through the kernel;
  * chain of two extras on the same syscall: first returns Continue,
    second returns Errno(EACCES) — both counters increment in lock
    step (insertion order preserved) and the guest sees the EACCES.
- All 215 unit tests pass; the 178-test integration suite passes
  modulo the pre-existing 54-test failure set observed on origin/main
  (kernel/capability environment, unrelated to this change).

Minor bump 0.6 → 0.7 suggested.

Signed-off-by: dzerik <dzerik@gmail.com>
@dzerik dzerik force-pushed the feature/extra-handlers branch from 1d0783d to 431c207 Compare April 27, 2026 07:51
@dzerik
Copy link
Copy Markdown
Author

dzerik commented Apr 27, 2026

You were right on both counts — the missing BPF plumbing was the
load-bearing gap, and the original tests were verifying Vec::push
rather than the security contract. Reworked in 431c207, details below.

1. BPF plumbing

  • Sandbox::do_spawn collects the syscall numbers from the
    Vec<ExtraHandler> before fork and threads them into the child via a
    new context::ChildSpawnArgs.extra_syscalls field.
  • The child merges them into notif_syscalls(policy) before
    bpf::assemble_filter, with sort_unstable + dedup so a syscall
    registered by both a builtin and an extra produces a single JEQ.

While wiring this up I noticed an adjacent footgun: the cBPF program emits
notif JEQs before deny JEQs, so an extra registered on a syscall in
DEFAULT_DENY_SYSCALLS (e.g. mount) would convert a kernel-deny into a
user-supervised path and let a Continue from the handler bypass deny via
SECCOMP_USER_NOTIF_FLAG_CONTINUE. Sandbox::run_with_extra_handlers
now validates extras against the policy's deny list at registration time
and returns SandboxError::Child naming the offending syscall —
surfacing the "no error to tell the user why" gap you flagged.
Documented in §3.0.1.

2. Tests that actually exercise dispatch

The unit-level Vec::push checks remain (cheap regression cover); on top
there are now seven integration tests that drive the full kernel path:

  • extra_handler_intercepts_syscall_outside_builtin_set — the case you
    asked for: SYS_uname (not in any builtin's notif list under default
    policy), handler returns Errno(EACCES), the guest observes EACCES.
  • extra_handler_continue_lets_syscall_proceedContinue becomes
    SECCOMP_USER_NOTIF_FLAG_CONTINUE and the kernel resumes the syscall.
  • extra_handler_runs_after_builtin_returns_continueSYS_openat
    with the always-on /proc-virt builtin returning Continue for
    non-/proc paths, extra observes those openats.
  • builtin_non_continue_blocks_extra — symmetric half: openat on
    /proc/1/cmdline is rejected by the procfs builtin and is never
    observed by the extra, while a peer openat on /etc/hostname is.
    Handler reads the path via process_vm_readv so the assertion is
    structural rather than counter-based.
  • chain_of_extras_runs_in_insertion_order — two extras on SYS_uname,
    first returns Continue, second Errno(EACCES); counters increment
    in lock step (c1 == c2) and the guest sees the EACCES.
  • extra_handler_on_default_deny_syscall_is_rejectedSYS_mount
    registration is refused up-front with a descriptive error.
  • empty_extras_preserves_default_behaviour — backwards compatibility.

Cosmetic

confine_child had grown to seven positional parameters; packed into
ChildSpawnArgs so the call site stays readable.

Deliberately deferred

  • HandlerFn is still Box<dyn Fn ...> + Send + Sync rather than a
    trait Handler { async fn ... }; happy to convert if you'd prefer
    the trait shape, but it's an API-shape change that probably wants
    its own discussion.
  • HandlerPriority::Before (audit handlers seeing pre-builtin
    arguments) remains in ## 4 Non-goals in the doc — orthogonal,
    add when there's a concrete use case.
  • Panic isolation around user handlers stays the responsibility of
    the downstream crate (documented in §3.4 with a catch_unwind
    example) — wrapping every dispatch in catch_unwind by default
    would mask bugs.

Diff stat: 8 files, ~1046 / -9. All 215 unit tests pass; integration
suite passes modulo the same pre-existing 54-test failure set observed
on origin/main (kernel/capability env, unrelated to this change).
Latest head: 431c207.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants