Skip to content

feat(jailer): Linux defense-in-depth: user namespaces, AppArmor, bwrap mounts, seccomp#256

Merged
DorianZheng merged 11 commits intomainfrom
refactor/extract-advanced-options
Feb 16, 2026
Merged

feat(jailer): Linux defense-in-depth: user namespaces, AppArmor, bwrap mounts, seccomp#256
DorianZheng merged 11 commits intomainfrom
refactor/extract-advanced-options

Conversation

@DorianZheng
Copy link
Copy Markdown
Member

Summary

  • Chrome-style user namespace probe: Detects clone3/CLONE_NEWUSER support with targeted diagnostics when it fails (kernel config, sysctl, AppArmor denials)
  • Auto-generate AppArmor profile: Creates and loads a permissive AppArmor profile for bundled bwrap so it works on Ubuntu/Debian systems with AppArmor restrictions on user namespaces
  • Bwrap rootfs/volume mounts: Bind-mounts the rootfs and user volumes into the bwrap sandbox so the shim process can access them
  • SIGSYS crash capture: Records seccomp SIGSYS in CrashCapture for better diagnostics when syscall filtering blocks something
  • Go runtime syscalls: Adds Go runtime syscalls (futex, clone3, rseq, etc.) to x86_64 VMM seccomp filters so gvproxy runs without SIGSYS
  • Two-phase stacked seccomp: Splits seccomp into a permissive gvproxy phase and a restrictive VMM phase, applied in sequence during shim startup

Test plan

  • Clippy clean on macOS (cargo clippy -p boxlite -p boxlite-python -p boxlite-node -- -D warnings)
  • Node SDK tests pass (cargo test -p boxlite-node — 5/5)
  • Formatting clean (cargo fmt --check)
  • E2E on GCP Linux (AppArmor + user namespace probe + seccomp)

@DorianZheng DorianZheng force-pushed the refactor/extract-advanced-options branch 6 times, most recently from 008c45e to 110ab3b Compare February 16, 2026 07:40
…tics

Port Chrome's CanCreateProcessInNewUserNS() and CheckCloneNewUserErrno()
from sandbox/linux/services/credentials.cc. Dual-probe approach:
1. Raw clone(CLONE_NEWUSER) for kernel-level errno diagnosis
2. bwrap --unshare-user for actual bwrap capability (handles AppArmor
   per-binary profiles where bwrap may work even if our clone fails)

When bwrap fails, build_diagnostic() combines Chrome errno + sysctl
detection to provide targeted fix commands for each scenario:
- AppArmor restrict_unprivileged_userns (Ubuntu 23.10+)
- kernel.unprivileged_userns_clone (Debian/older distros)
- user.max_user_namespaces (RHEL/CentOS)
When bundled bwrap fails on Ubuntu 23.10+ with
kernel.apparmor_restrict_unprivileged_userns=1, generate an AppArmor
profile at ~/.boxlite/apparmor/boxlite-bwrap and include the
`sudo apparmor_parser -r` command in the diagnostic.

- Add apparmor.rs with generate_bwrap_profile() and write_bwrap_profile()
- Profile mirrors Ubuntu's bwrap-userns-restrict with unique names
  (boxlite_bwrap/boxlite_unpriv_bwrap) to avoid collision
- Caller in bwrap.rs computes apparmor_dir (Minimal Knowledge)
The bwrap sandbox was missing two critical mount categories:
1. ~/.boxlite/rootfs (ro) - VM init rootfs (Alpine bootstrap)
2. User volume host_paths - from BoxOptions.volumes

Without the rootfs mount, libkrun couldn't boot the VM inside bwrap,
causing the shim to exit immediately.
@DorianZheng DorianZheng force-pushed the refactor/extract-advanced-options branch 2 times, most recently from 9dd5432 to 9699056 Compare February 16, 2026 07:52
Add explicit unsafe block in credentials.rs for unsafe-op-in-unsafe-fn
compliance and fix cargo fmt formatting in bwrap.rs.
Separate Go runtime syscalls from VMM seccomp filter using Linux's
seccomp filter stacking semantics. Previously, the VMM filter contained
~100 syscalls (VMM + Go runtime combined). Now:

- VMM filter: ~66 unique syscalls (original Firecracker + libkrun)
- Gvproxy filter: ~106 unique syscalls (strict superset, includes Go runtime)

Two-phase application in shim:
  1. Apply gvproxy filter with TSYNC (before gvproxy creation)
  2. Create gvproxy → Go threads inherit permissive filter
  3. Stack VMM filter on main thread only (no TSYNC)
  4. krun_start_enter → vCPU threads inherit both from main

Stacked filters evaluate as intersection (most restrictive wins).
Since VMM ⊂ gvproxy, effective filter on main/vCPU = VMM.
Go threads keep only the gvproxy filter (more permissive).

Without gvproxy, VMM filter applied with TSYNC (original behavior).
The Firecracker-derived VMM filter (45 syscalls) was fundamentally
inadequate for libkrun on modern glibc (2.38+):

1. Missing modern glibc equivalents: glibc rewrites open→openat,
   stat→newfstatat, etc. The filter had legacy names but not modern ones.

2. Missing libkrun runtime syscalls: libkrun needs mprotect, bind,
   listen, clone3, pread64, etc. for VM setup and operation.

3. Missing thread init syscalls: vCPU threads created after seccomp
   need set_tid_address, rseq, arch_prctl for pthread initialization.

4. Insufficient ioctl coverage: Firecracker used 8 specific KVM ioctls;
   libkrun requires 28+ for VM creation and vCPU management.

Adds 47 entries to VMM filter (86 unique syscalls, up from 45).
Verified: vmm ⊂ gvproxy (all VMM syscalls are in gvproxy superset).

This fixes a pre-existing SIGSYS crash (exit code 159) on Linux when
seccomp is enabled — the shim was killed on the first openat() call.
Remove unused imports (SecurityOptions, FilesystemLayout) from
linux/mod.rs and add explicit unsafe block in credentials.rs for
Rust 2024 edition compliance (unsafe-op-in-unsafe-fn).
…SYNC

The VMM filter was expanded to 106 syscalls covering both libkrun and Go
runtime needs. The gvproxy filter (107 syscalls) was a strict superset
differing only by the `seccomp` syscall needed for two-phase stacking.

With a single TSYNC application after gvproxy creation, the `seccomp`
syscall is no longer needed and the gvproxy filter becomes redundant.

- Remove gvproxy section from seccomp JSON (~650 lines)
- Remove SeccompRole::Gvproxy variant and apply_gvproxy_filter()
- Simplify apply_vmm_filter() to always use TSYNC (remove tsync param)
- Remove two-phase stacking logic from shim main.rs
…helper

The previous commit removed `linux::apply_isolation()` but missed
updating the `PlatformIsolation` trait impl that called it. Inline
the logic directly: call `apply_vmm_filter()` when seccomp is enabled.
The `layout: &FilesystemLayout` parameter was unused in all three
platform implementations (Linux, macOS, Unsupported). No external
code called this trait method. Clean up the dead parameter.
@DorianZheng DorianZheng force-pushed the refactor/extract-advanced-options branch from 9699056 to 60fe6f8 Compare February 16, 2026 07:55
@DorianZheng DorianZheng changed the base branch from main to feat/jailer-bwrap-improvements February 16, 2026 07:58
Base automatically changed from feat/jailer-bwrap-improvements to main February 16, 2026 08:00
Save pre-modification Firecracker-derived filters as *.original.json
for reference. Add TODO noting the current VMM filter is intentionally
broad (all arg restrictions removed) and should be tightened once
libkrun's actual syscall arg patterns are profiled.
@DorianZheng DorianZheng merged commit fc1fc78 into main Feb 16, 2026
14 checks passed
@DorianZheng DorianZheng deleted the refactor/extract-advanced-options branch February 16, 2026 08:08
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant