This guide shows you how to deploy GPU-accelerated LLM inference on your Mac using Metal (Apple's GPU framework) with Kubernetes orchestration.
Performance: Get Ollama-level speeds (60-120 tok/s) with Kubernetes-native deployment! 🚀
- macOS with Apple Silicon (M1/M2/M3/M4) or Intel Mac with Metal 2+
- Recommended: M-series Mac with 16GB+ RAM
-
Minikube with Docker driver
brew install minikube minikube start --driver=docker
-
llama.cpp with Metal support
brew install llama.cpp
-
LLMKube CLI
# Install via Homebrew (Recommended for macOS) brew install defilantech/tap/llmkube -
LLMKube Operator
# Option 1: Install from GitHub (Recommended) kubectl apply -k https://github.com/defilantech/LLMKube/config/default # Wait for the operator to be ready: kubectl wait --for=condition=ready pod -l control-plane=controller-manager -n llmkube-system --timeout=60s
For development/testing from branch:
# From the llmkube repository root make deploy # Wait for the operator to be ready: kubectl wait --for=condition=ready pod -l control-plane=controller-manager -n llmkube-system --timeout=60s
-
Metal Agent
# Build and install from source (Recommended) # From llmkube repository make install-metal-agent # Or download pre-built binary from releases page: # https://github.com/defilantech/LLMKube/releases/latest # Look for: llmkube-metal-agent_*_darwin_*.tar.gz
# 1. Check Metal support
system_profiler SPDisplaysDataType | grep "Metal"
# Should show: Metal Support: Metal 4 (or Metal 3/2)
# 2. Check llama-server is installed
which llama-server
# Should show: /usr/local/bin/llama-server
# 3. Check Metal agent is running
launchctl list | grep llmkube
# Should show: com.llmkube.metal-agent
# 3b. Check Metal agent health endpoint
curl http://localhost:9090/healthz
# Should show: ok
# 4. Check minikube is running
minikube status
# Should show: host: Running, kubelet: Running
# 5. Check kubectl works
kubectl get nodes
# Should show your minikube node# Build the CLI from the repository (if testing from branch)
make build
# Deploy Llama 3.1 8B with Metal acceleration
./bin/llmkube deploy llama-3.1-8b --accelerator metal
# Output:
# 📚 Using catalog model: Llama 3.1 8B Instruct
# ℹ️ Metal acceleration: Using native llama-server (not containerized)
# 🚀 Deploying LLM inference service
# ...Note: Use --accelerator metal explicitly to ensure Metal acceleration is used.
# Specify custom GPU layer offloading
./bin/llmkube deploy llama-3.1-8b --accelerator metal --gpu-layers 33
# Or with custom settings
llmkube deploy qwen-2.5-coder-7b \
--accelerator metal \
--gpu-layers 33 \
--memory 8Gillmkube deploy my-custom-model \
--accelerator metal \
--source https://huggingface.co/bartowski/Llama-3.2-3B-Instruct-GGUF/resolve/main/Llama-3.2-3B-Instruct-Q8_0.gguf \
--quantization Q8_0 \
--gpu-layers 28# Watch the deployment
kubectl get inferenceservices -w
# Check Metal agent health
curl http://localhost:9090/healthz # Liveness: "ok"
curl http://localhost:9090/readyz # Readiness: "ready" when processes are healthy
# Check Metal agent metrics
curl -s http://localhost:9090/metrics | grep llmkube_metal_agent
# Check Metal agent logs
tail -f /tmp/llmkube-metal-agent.log
# Monitor GPU usage
sudo powermetrics --samplers gpu_power -i 1000Once deployed, test your inference service:
# The Metal agent runs llama-server on your Mac (port 8080)
# You can access it directly without Kubernetes
curl http://localhost:8080/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{"messages":[{"role":"user","content":"What is 2+2?"}],"max_tokens":20}'Metal services are selectorless (no pods), so kubectl port-forward will not work.
To access the service from within the cluster, use Kubernetes DNS:
# From a pod inside the cluster (note the sanitized name: llama-3-1-8b)
curl http://llama-3-1-8b.default.svc.cluster.local:8080/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{"messages":[{"role":"user","content":"What is 2+2?"}]}'You can also test from a temporary pod:
kubectl run curl-test --rm -it --image=curlimages/curl -- \
curl -s http://llama-3-1-8b.default.svc.cluster.local:8080/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{"messages":[{"role":"user","content":"What is 2+2?"}]}'Note: The Kubernetes Service name is sanitized (dots replaced with dashes): llama-3.1-8b → llama-3-1-8b
On M4 Max (32 GPU cores):
| Model | Generation Speed | Prompt Processing | VRAM Usage |
|---|---|---|---|
| Llama 3.2 3B | 80-120 tok/s | 1000+ tok/s | 2-3 GB |
| Llama 3.1 8B | 40-60 tok/s | 500-800 tok/s | 5-8 GB |
| Mistral 7B | 45-65 tok/s | 600-900 tok/s | 5-7 GB |
| Qwen Coder 7B | 40-55 tok/s | 500-750 tok/s | 5-8 GB |
┌─────────────────────────────────────────────────┐
│ Your Mac (macOS) │
│ │
│ ┌──────────────────────────────────────────┐ │
│ │ Minikube (Kubernetes in VM) │ │
│ │ ✅ Creates InferenceService CRD │ │
│ │ ✅ Service points to host │ │
│ └──────────────────────────────────────────┘ │
│ ↕ │
│ ┌──────────────────────────────────────────┐ │
│ │ Metal Agent (Native Process) │ │
│ │ ✅ Watches K8s for InferenceService │ │
│ │ ✅ Downloads model from HuggingFace │ │
│ │ ✅ Spawns llama-server with Metal │ │
│ └──────────────────────────────────────────┘ │
│ ↕ │
│ ┌──────────────────────────────────────────┐ │
│ │ llama-server (Metal Accelerated) 🚀 │ │
│ │ ✅ Runs on localhost:8080+ │ │
│ │ ✅ Direct Metal GPU access │ │
│ │ ✅ OpenAI-compatible API │ │
│ └──────────────────────────────────────────┘ │
└─────────────────────────────────────────────────┘
If you see ImagePullBackOff when running make deploy:
# Build the controller image in minikube's Docker environment
eval $(minikube docker-env)
make docker-build
# Delete the failing pod to recreate it
kubectl delete pod -n llmkube-system -l control-plane=controller-manager
# Patch the deployment to use local images
kubectl patch deployment -n llmkube-system llmkube-controller-manager \
-p '{"spec":{"template":{"spec":{"containers":[{"name":"manager","imagePullPolicy":"IfNotPresent"}]}}}}'# Check logs
cat /tmp/llmkube-metal-agent.log
# Verify llama-server is installed
llama-server --version
# Restart the agent
launchctl unload ~/Library/LaunchAgents/com.llmkube.metal-agent.plist
launchctl load ~/Library/LaunchAgents/com.llmkube.metal-agent.plistThe Metal Agent checks whether a model fits in memory before starting it. If your InferenceService shows InsufficientMemory:
# Check the status
kubectl get isvc -o wide
# See the detailed reason
kubectl get isvc <name> -o jsonpath='{.status.schedulingMessage}'Fix options:
- Use a smaller quantization (Q4_K_M instead of Q8_0)
- Reduce
contextSizein the InferenceService spec - Increase the budget: restart the agent with
--memory-fraction 0.9
# Check disk space
df -h
# Check model store
ls -lh /tmp/llmkube-models/
# Manually download model
curl -L https://huggingface.co/bartowski/Meta-Llama-3.1-8B-Instruct-GGUF/resolve/main/Meta-Llama-3.1-8B-Instruct-Q5_K_M.gguf \
-o /tmp/llmkube-models/llama-3.1-8b/model.ggufNote: Service names are sanitized - dots are replaced with dashes!
# Check service exists (note the sanitized name: llama-3-1-8b)
kubectl get svc llama-3-1-8b
# Check endpoints
kubectl get endpoints llama-3-1-8b
# Check if llama-server is running
ps aux | grep llama-server
# Check Metal agent can reach K8s
kubectl get inferenceservices# Verify Metal is being used
llama-server --version # Should show Metal support
# Check GPU utilization
sudo powermetrics --samplers gpu_power -i 1000
# Verify all layers offloaded to GPU
# Check Metal agent logs for "n-gpu-layers"
grep "n-gpu-layers" /tmp/llmkube-metal-agent.log# Delete the inference service
llmkube delete llama-3.1-8b
# Stop Metal agent
launchctl unload ~/Library/LaunchAgents/com.llmkube.metal-agent.plist
# Or completely uninstall
make uninstall-metal-agent- Scale up: Try larger models (Mixtral 8x7B, Llama 70B)
- Production: Deploy multiple replicas for high availability
- Integration: Connect to your applications using OpenAI SDK
- Monitoring: Scrape
localhost:9090/metricswith Prometheus for agent health and process metrics
from openai import OpenAI
client = OpenAI(
base_url="http://localhost:8080/v1",
api_key="not-needed"
)
response = client.chat.completions.create(
model="llama-3.1-8b",
messages=[
{"role": "user", "content": "Explain quantum computing"}
]
)
print(response.choices[0].message.content)curl http://localhost:8080/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "llama-3.1-8b",
"messages": [
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "Write a haiku about Kubernetes"}
],
"temperature": 0.7,
"max_tokens": 100
}'- Documentation: https://github.com/defilantech/llmkube#metal-support
- Issues: https://github.com/defilantech/llmkube/issues
- Discussions: https://github.com/defilantech/llmkube/discussions
- Use Q5_K_M quantization - Best balance of quality and speed
- Offload all layers - Set
--gpu-layers 99for maximum Metal usage - Close other apps - Free up unified memory and GPU resources
- Monitor temperature - Keep your Mac cool for sustained performance
- Use catalog models - Pre-optimized settings for your hardware
- Tune memory fraction - On a dedicated inference Mac, use
--memory-fraction 0.9to allow larger models
Congratulations! 🎉 You're now running Kubernetes-native LLM inference with Metal acceleration on your Mac!