Weight Enclave for SL5-style GPU Inference
muinference implements a minimal Weight Enclave for secure AI model inference, following the SL5 (Security Level 5) security model. The host system is treated as untrusted; only the enclave handles model weights.
- Docker Desktop (Windows, macOS, or Linux)
- Python 3.10+ with pip (for local testing)
Test the enclave server directly without building the full Buildroot VM:
# Clone and enter directory
git clone https://github.com/luiscosio/muInference.git
cd muInference
# Create Python virtual environment
python -m venv venv
source venv/bin/activate # Linux/macOS
# or: venv\Scripts\activate # Windows
# Install dependencies
pip install pycryptodome torch transformers accelerate
# Terminal 1: Start enclave server
python external/muinference/rootfs_overlay/opt/enclave_server.py
# Terminal 2: Connect with host proxy
python scripts/host_proxy_and_attest.py --port 9000
# Type prompts and get responses!Build everything using Docker (includes Buildroot VM):
# Clone repository
git clone https://github.com/luiscosio/muInference.git
cd muInference
# Build the Buildroot enclave VM (first build: 30-60 min, cached after)
docker compose -f docker-compose.build.yml run buildroot
# Run attack surface analysis
docker compose -f docker-compose.build.yml run tools attack
# Run security scan
docker compose -f docker-compose.build.yml run tools securityAutomated test that starts server, connects, and runs inference:
# With venv activated
python test/e2e_test.pyThis project provides:
- muEnclave: A Buildroot-based minimal Linux VM that runs GPU inference with a radically reduced attack surface
- Host Proxy: Bandwidth-limited proxy with attestation for communicating with the enclave
- Baseline Stack: Realistic Docker-based inference for security comparison
- Security Tools: Scripts for comparing attack surface and running security tests
muinference/
├── docker/
│ ├── Dockerfile.buildroot # Buildroot build environment
│ ├── Dockerfile.baseline-realistic # Realistic baseline for comparison
│ ├── Dockerfile.tools # Analysis tools (scc, trivy)
│ ├── analyze.sh # Analysis entrypoint
│ └── analyze_attack_surface.sh # Full attack surface analysis
├── docker-compose.build.yml # Build orchestration
├── external/
│ └── muinference/ # Buildroot external tree
│ ├── board/x86_64/muinference/
│ │ ├── linux.config # Minimal kernel config
│ │ └── post_build.sh
│ ├── rootfs_overlay/
│ │ ├── etc/muinference.conf
│ │ └── opt/enclave_server.py # Main enclave server
│ └── muinference_defconfig
├── scripts/
│ ├── build_buildroot_enclave.sh # Build the VM
│ ├── host_proxy_and_attest.py # Host-side proxy
│ ├── exfil_timing_test.py # Bandwidth analysis
│ └── fuzz_enclave_rpc.py # Protocol fuzzer
├── baseline_stack/ # Docker baseline for comparison
│ ├── Dockerfile
│ ├── docker-compose.yml
│ └── server.py
└── test/
└── e2e_test.py # End-to-end test
The enclave uses a simple length-prefixed JSON protocol:
1. Enclave -> Host: {"measurement": "<sha256>"} # Attestation
2. Host -> Enclave: <AES-GCM encrypted token> # Unlock
3. Enclave -> Host: {"status": "ready"} # Ready
4. Host -> Enclave: {"prompt": "...", "max_new_tokens": N} # Request
5. Enclave -> Host: {"completion": "...", "elapsed_ms": N} # Response
Default Model: unsloth/Llama-3.2-3B-Instruct
- 3.2 billion parameters
- Instruction-tuned for chat/assistant tasks
- ~6GB in BF16 precision
- Optimized by Unsloth for faster inference
The model auto-downloads from Hugging Face on first run. To pre-download:
pip install huggingface_hub
huggingface-cli download unsloth/Llama-3.2-3B-InstructFor offline deployment, download to a local directory:
huggingface-cli download unsloth/Llama-3.2-3B-Instruct --local-dir ./models/Llama-3.2-3B-Instruct
export MODEL_PATH=./models/Llama-3.2-3B-InstructEnvironment variables:
MODEL_PATH: Path to model weights (default: downloads from HuggingFace)ENCLAVE_PORT: Listen port (default: 9000)
python scripts/host_proxy_and_attest.py \
--host 127.0.0.1 \
--port 9000 \
--rate-kb 5 \
--expected-measurement <hash>muinference uses Docker's native caching:
- Docker images: Built once, reused automatically
- Buildroot cache: Stored in
muinference-buildroot-cacheDocker volume - Model weights: Cached by HuggingFace in
~/.cache/huggingface
To force a full rebuild:
docker compose -f docker-compose.build.yml build --no-cache buildrootTo clear the Buildroot cache:
docker volume rm muinference-buildroot-cache# Full attack surface analysis (recommended)
docker compose -f docker-compose.build.yml run tools attack
# Custom code LOC only
docker compose -f docker-compose.build.yml run tools loc
# Vulnerability scan
docker compose -f docker-compose.build.yml run tools security
# All analyses
docker compose -f docker-compose.build.yml run tools allMake sure Docker Desktop is running:
docker --version
docker compose versionPre-download the model:
pip install huggingface_hub
huggingface-cli download unsloth/Llama-3.2-3B-Instruct# Remove Docker images and volumes
docker compose -f docker-compose.build.yml down --rmi all --volumes
# Rebuild from scratch
docker compose -f docker-compose.build.yml build --no-cache
docker compose -f docker-compose.build.yml run buildroot┌─────────────────────────────────────────────────────────────────────────────┐
│ UNTRUSTED HOST │
│ ┌───────────────────────────────────────────────────────────────────────┐ │
│ │ Host Proxy │ │
│ │ • Attestation verification (SHA256 measurement) │ │
│ │ • Bandwidth limiting (5 KB/s egress) │ │
│ │ • AES-GCM encrypted unlock tokens │ │
│ └───────────────────────────────────────────────────────────────────────┘ │
│ │ │
│ TCP :9000 │ (rate-limited) │
│ ▼ │
│ ┌───────────────────────────────────────────────────────────────────────┐ │
│ │ WEIGHT ENCLAVE (VM) │ │
│ │ ┌─────────────────────────────────────────────────────────────────┐ │ │
│ │ │ Buildroot Linux (~50MB rootfs) │ │ │
│ │ │ • Minimal kernel (Linux 6.6.22) │ │ │
│ │ │ • BusyBox userspace (~400K LOC) │ │ │
│ │ │ • No SSH, no package manager, single service │ │ │
│ │ └─────────────────────────────────────────────────────────────────┘ │ │
│ │ │ │ │
│ │ ┌─────────────────────────────▼─────────────────────────────────┐ │ │
│ │ │ Enclave Server │ │ │
│ │ │ • Receives attestation challenge │ │ │
│ │ │ • Decrypts unlock token (AES-GCM) │ │ │
│ │ │ • Loads model weights (in-enclave only) │ │ │
│ │ │ • Runs inference (PyTorch/Transformers) │ │ │
│ │ └─────────────────────────────┬─────────────────────────────────┘ │ │
│ │ │ │ │
│ │ ┌─────────────────────────────▼─────────────────────────────────┐ │ │
│ │ │ Model Weights (Protected) │ │ │
│ │ │ • Llama-3.2-3B-Instruct (~6GB BF16) │ │ │
│ │ │ • Never leave enclave boundary │ │ │
│ │ │ • Exfiltration: ~14 days at 5 KB/s │ │ │
│ │ └───────────────────────────────────────────────────────────────┘ │ │
│ │ │ │ │
│ │ ┌─────────────────────────────▼─────────────────────────────────┐ │ │
│ │ │ GPU (VFIO Passthrough) │ │ │
│ │ │ • Direct hardware access inside enclave │ │ │
│ │ │ • Isolated from host via IOMMU │ │ │
│ │ └───────────────────────────────────────────────────────────────┘ │ │
│ └───────────────────────────────────────────────────────────────────────┘ │
└─────────────────────────────────────────────────────────────────────────────┘
INFERENCE FLOW
==============
User ──► Host Proxy ──► [Attestation Check] ──► Enclave Server
│ │
│ ┌──────────────────────────────┘
│ │
│ ▼
│ [Decrypt Unlock Token]
│ │
│ ▼
│ [Load Model Weights]
│ │
│ ▼
└──► [Rate-Limited Response] ◄── [GPU Inference]
(5 KB/s max)
muinference implements key SL5 Weight Enclave controls:
| Control | Implementation |
|---|---|
| Minimal OS | Buildroot with busybox, ~50MB rootfs (~500K LOC) |
| Allow-by-exception | Single port 9000, no SSH/management |
| Attestation | SHA256 measurement hash of server + kernel |
| Bandwidth limiting | 5 KB/s default egress rate |
| Weight protection | AES-GCM encrypted unlock, in-enclave only |
| Host isolation | KVM/QEMU VM boundary |
| GPU isolation | VFIO passthrough (optional) |
- Minimal OS: Buildroot-based system with only essential packages (~500K LOC vs ~100M)
- Allow-by-Exception: Single network service, no SSH, no package manager at runtime
- Attestation: Host verifies enclave measurement before unlocking weights
- Bandwidth Limiting: Output rate-limited to make weight exfiltration impractical
- GPU Passthrough: NVIDIA GPU runs inside the enclave via VFIO (optional)
The key security benefit of muinference is the radical reduction in attack surface compared to typical inference deployments.
| Component | muEnclave | Typical Inference Server |
|---|---|---|
| Base OS | Buildroot + BusyBox | Ubuntu 22.04 Server |
| OS Packages | ~10-20 | ~500-800 |
| OS Binaries | ~100-200 | ~2,000-3,000 |
| OS LOC (est.) | ~500K | ~50-100M |
| Python Runtime | Minimal (inference only) | Full CPython |
| pip packages | 0 at runtime | 50-100+ |
| GPU Stack | VFIO passthrough | Full CUDA runtime (~500MB) |
| Network Services | 1 port (9000) | SSH, API, monitoring, etc. |
| Package Managers | None at runtime | apt + pip available |
| Image/Rootfs Size | ~50MB | ~10-20GB |
| Attack Surface Ratio | 1x | ~100-500x larger |
| Component | Lines of Code | Source |
|---|---|---|
| Linux kernel | ~30M | OpenHub |
| Ubuntu userspace | ~50-100M | Aggregate |
| BusyBox | ~400K | OpenHub |
| PyTorch | ~3M | GitHub |
| vLLM | ~100K | GitHub |
| Transformers | ~500K | GitHub |
At 5 KB/s bandwidth limit:
| Model | Size | Exfil Time |
|---|---|---|
| Llama-3.2-3B-Instruct (BF16) | 6 GB | ~14 days |
| LLaMA-70B (FP16) | 140 GB | ~324 days |
| GPT-3 175B (FP16) | 350 GB | ~2.2 years |
The current implementation uses QEMU/KVM for VM isolation. The planned production architecture will use seL4 as the formally verified microkernel:
┌─────────────────────────────────────────────────────────────────┐
│ seL4 Microkernel (Host) │
│ Formally verified isolation anchor │
│ • Mathematical proof of isolation properties │
│ • ~10K LOC kernel (vs ~30M for Linux) │
│ • Eliminates entire classes of kernel vulnerabilities │
├─────────────────────────────────────────────────────────────────┤
│ CAmkES │
│ Component architecture framework │
├─────────────────────────────────────────────────────────────────┤
│ ┌─────────────────┐ ┌─────────────────┐ ┌─────────────────┐ │
│ │ Buildroot VM │ │ File Server │ │ Serial Server │ │
│ │ (Guest Linux) │ │ │ │ │ │
│ │ │ │ │ │ │ │
│ │ Candle │ │ │ │ │ │
│ │ │ │ │ │ │ │
│ └────────┬────────┘ └─────────────────┘ └─────────────────┘ │
│ │ │
│ ▼ │
│ ┌─────────────────┐ │
│ │ GPU (VFIO) │ Why Linux VM is needed: │
│ │ │ • NVIDIA drivers require Linux │
│ └─────────────────┘ • seL4 cannot directly drive GPUs │
└─────────────────────────────────────────────────────────────────┘
- Formal Verification: Mathematical proof that the kernel correctly enforces isolation
- Minimal TCB: ~10,000 LOC vs ~30 million for Linux kernel
- No Undefined Behavior: Verified absence of buffer overflows, null pointer dereferences, etc.
- Proven Isolation: Components cannot interfere with each other except through defined interfaces
- Set up seL4 + CAmkES build environment
- Configure CAmkES VM component to host Buildroot Linux guest
- Implement file server for kernel/rootfs images
- Pass GPU through to guest VM via VFIO
- Integrate attestation with seL4's secure boot chain
See docs/ for detailed CAmkES VM setup instructions.
Apache 2.0