metal-quickstart

Metal Quickstart - Local LLM Inference on Apple Silicon

This guide shows you how to deploy GPU-accelerated LLM inference on your Mac using Metal (Apple's GPU framework) with Kubernetes orchestration.

Performance: Get Ollama-level speeds (60-120 tok/s) with Kubernetes-native deployment! 🚀

Prerequisites

Hardware

macOS with Apple Silicon (M1/M2/M3/M4) or Intel Mac with Metal 2+
Recommended: M-series Mac with 16GB+ RAM

Software

Minikube with Docker driver

brew install minikube
minikube start --driver=docker

llama.cpp with Metal support
```
brew install llama.cpp
```

LLMKube CLI

# Install via Homebrew (Recommended for macOS)
brew install defilantech/tap/llmkube

LLMKube Operator

# Option 1: Install from GitHub (Recommended)
kubectl apply -k https://github.com/defilantech/LLMKube/config/default

# Wait for the operator to be ready:
kubectl wait --for=condition=ready pod -l control-plane=controller-manager -n llmkube-system --timeout=60s

For development/testing from branch:

# From the llmkube repository root
make deploy

# Wait for the operator to be ready:
kubectl wait --for=condition=ready pod -l control-plane=controller-manager -n llmkube-system --timeout=60s

Metal Agent

# Build and install from source (Recommended)
# From llmkube repository
make install-metal-agent

# Or download pre-built binary from releases page:
# https://github.com/defilantech/LLMKube/releases/latest
# Look for: llmkube-metal-agent_*_darwin_*.tar.gz

Verify Setup

# 1. Check Metal support
system_profiler SPDisplaysDataType | grep "Metal"
# Should show: Metal Support: Metal 4 (or Metal 3/2)

# 2. Check llama-server is installed
which llama-server
# Should show: /usr/local/bin/llama-server

# 3. Check Metal agent is running
launchctl list | grep llmkube
# Should show: com.llmkube.metal-agent

# 3b. Check Metal agent health endpoint
curl http://localhost:9090/healthz
# Should show: ok

# 4. Check minikube is running
minikube status
# Should show: host: Running, kubelet: Running

# 5. Check kubectl works
kubectl get nodes
# Should show your minikube node

Quick Start

Option 1: Deploy from Catalog (Recommended)

# Build the CLI from the repository (if testing from branch)
make build

# Deploy Llama 3.1 8B with Metal acceleration
./bin/llmkube deploy llama-3.1-8b --accelerator metal

# Output:
# 📚 Using catalog model: Llama 3.1 8B Instruct
# ℹ️  Metal acceleration: Using native llama-server (not containerized)
# 🚀 Deploying LLM inference service
# ...

Note: Use --accelerator metal explicitly to ensure Metal acceleration is used.

Option 2: With Custom GPU Settings

# Specify custom GPU layer offloading
./bin/llmkube deploy llama-3.1-8b --accelerator metal --gpu-layers 33

# Or with custom settings
llmkube deploy qwen-2.5-coder-7b \
  --accelerator metal \
  --gpu-layers 33 \
  --memory 8Gi

Option 3: Custom Model

llmkube deploy my-custom-model \
  --accelerator metal \
  --source https://huggingface.co/bartowski/Llama-3.2-3B-Instruct-GGUF/resolve/main/Llama-3.2-3B-Instruct-Q8_0.gguf \
  --quantization Q8_0 \
  --gpu-layers 28

Monitor the Deployment

# Watch the deployment
kubectl get inferenceservices -w

# Check Metal agent health
curl http://localhost:9090/healthz   # Liveness: "ok"
curl http://localhost:9090/readyz    # Readiness: "ready" when processes are healthy

# Check Metal agent metrics
curl -s http://localhost:9090/metrics | grep llmkube_metal_agent

# Check Metal agent logs
tail -f /tmp/llmkube-metal-agent.log

# Monitor GPU usage
sudo powermetrics --samplers gpu_power -i 1000

Test the Endpoint

Once deployed, test your inference service:

Option 1: Direct Access (Fastest)

# The Metal agent runs llama-server on your Mac (port 8080)
# You can access it directly without Kubernetes
curl http://localhost:8080/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{"messages":[{"role":"user","content":"What is 2+2?"}],"max_tokens":20}'

Option 2: Via Kubernetes Service (In-Cluster)

Metal services are selectorless (no pods), so kubectl port-forward will not work. To access the service from within the cluster, use Kubernetes DNS:

# From a pod inside the cluster (note the sanitized name: llama-3-1-8b)
curl http://llama-3-1-8b.default.svc.cluster.local:8080/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{"messages":[{"role":"user","content":"What is 2+2?"}]}'

You can also test from a temporary pod:

kubectl run curl-test --rm -it --image=curlimages/curl -- \
  curl -s http://llama-3-1-8b.default.svc.cluster.local:8080/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{"messages":[{"role":"user","content":"What is 2+2?"}]}'

Note: The Kubernetes Service name is sanitized (dots replaced with dashes): llama-3.1-8b → llama-3-1-8b

Performance Expectations

On M4 Max (32 GPU cores):

Model	Generation Speed	Prompt Processing	VRAM Usage
Llama 3.2 3B	80-120 tok/s	1000+ tok/s	2-3 GB
Llama 3.1 8B	40-60 tok/s	500-800 tok/s	5-8 GB
Mistral 7B	45-65 tok/s	600-900 tok/s	5-7 GB
Qwen Coder 7B	40-55 tok/s	500-750 tok/s	5-8 GB

How It Works

┌─────────────────────────────────────────────────┐
│              Your Mac (macOS)                    │
│                                                  │
│  ┌──────────────────────────────────────────┐   │
│  │   Minikube (Kubernetes in VM)            │   │
│  │   ✅ Creates InferenceService CRD        │   │
│  │   ✅ Service points to host              │   │
│  └──────────────────────────────────────────┘   │
│                     ↕                            │
│  ┌──────────────────────────────────────────┐   │
│  │   Metal Agent (Native Process)           │   │
│  │   ✅ Watches K8s for InferenceService    │   │
│  │   ✅ Downloads model from HuggingFace    │   │
│  │   ✅ Spawns llama-server with Metal      │   │
│  └──────────────────────────────────────────┘   │
│                     ↕                            │
│  ┌──────────────────────────────────────────┐   │
│  │   llama-server (Metal Accelerated) 🚀   │   │
│  │   ✅ Runs on localhost:8080+             │   │
│  │   ✅ Direct Metal GPU access             │   │
│  │   ✅ OpenAI-compatible API               │   │
│  └──────────────────────────────────────────┘   │
└─────────────────────────────────────────────────┘

Troubleshooting

Operator pod ImagePullBackOff (when deploying from branch)

If you see ImagePullBackOff when running make deploy:

# Build the controller image in minikube's Docker environment
eval $(minikube docker-env)
make docker-build

# Delete the failing pod to recreate it
kubectl delete pod -n llmkube-system -l control-plane=controller-manager

# Patch the deployment to use local images
kubectl patch deployment -n llmkube-system llmkube-controller-manager \
  -p '{"spec":{"template":{"spec":{"containers":[{"name":"manager","imagePullPolicy":"IfNotPresent"}]}}}}'

Metal agent not starting

# Check logs
cat /tmp/llmkube-metal-agent.log

# Verify llama-server is installed
llama-server --version

# Restart the agent
launchctl unload ~/Library/LaunchAgents/com.llmkube.metal-agent.plist
launchctl load ~/Library/LaunchAgents/com.llmkube.metal-agent.plist

Model rejected (InsufficientMemory)

The Metal Agent checks whether a model fits in memory before starting it. If your InferenceService shows InsufficientMemory:

# Check the status
kubectl get isvc -o wide

# See the detailed reason
kubectl get isvc <name> -o jsonpath='{.status.schedulingMessage}'

Fix options:

Use a smaller quantization (Q4_K_M instead of Q8_0)
Reduce contextSize in the InferenceService spec
Increase the budget: restart the agent with --memory-fraction 0.9

Model download slow/failing

# Check disk space
df -h

# Check model store
ls -lh /tmp/llmkube-models/

# Manually download model
curl -L https://huggingface.co/bartowski/Meta-Llama-3.1-8B-Instruct-GGUF/resolve/main/Meta-Llama-3.1-8B-Instruct-Q5_K_M.gguf \
  -o /tmp/llmkube-models/llama-3.1-8b/model.gguf

Service not accessible

Note: Service names are sanitized - dots are replaced with dashes!

# Check service exists (note the sanitized name: llama-3-1-8b)
kubectl get svc llama-3-1-8b

# Check endpoints
kubectl get endpoints llama-3-1-8b

# Check if llama-server is running
ps aux | grep llama-server

# Check Metal agent can reach K8s
kubectl get inferenceservices

Poor performance

# Verify Metal is being used
llama-server --version  # Should show Metal support

# Check GPU utilization
sudo powermetrics --samplers gpu_power -i 1000

# Verify all layers offloaded to GPU
# Check Metal agent logs for "n-gpu-layers"
grep "n-gpu-layers" /tmp/llmkube-metal-agent.log

Cleanup

# Delete the inference service
llmkube delete llama-3.1-8b

# Stop Metal agent
launchctl unload ~/Library/LaunchAgents/com.llmkube.metal-agent.plist

# Or completely uninstall
make uninstall-metal-agent

Next Steps

Scale up: Try larger models (Mixtral 8x7B, Llama 70B)
Production: Deploy multiple replicas for high availability
Integration: Connect to your applications using OpenAI SDK
Monitoring: Scrape localhost:9090/metrics with Prometheus for agent health and process metrics

Example Applications

Python Client

from openai import OpenAI

client = OpenAI(
    base_url="http://localhost:8080/v1",
    api_key="not-needed"
)

response = client.chat.completions.create(
    model="llama-3.1-8b",
    messages=[
        {"role": "user", "content": "Explain quantum computing"}
    ]
)

print(response.choices[0].message.content)

cURL Example

curl http://localhost:8080/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "llama-3.1-8b",
    "messages": [
      {"role": "system", "content": "You are a helpful assistant."},
      {"role": "user", "content": "Write a haiku about Kubernetes"}
    ],
    "temperature": 0.7,
    "max_tokens": 100
  }'

Support

Performance Tips

Use Q5_K_M quantization - Best balance of quality and speed
Offload all layers - Set --gpu-layers 99 for maximum Metal usage
Close other apps - Free up unified memory and GPU resources
Monitor temperature - Keep your Mac cool for sustained performance
Use catalog models - Pre-optimized settings for your hardware
Tune memory fraction - On a dedicated inference Mac, use --memory-fraction 0.9 to allow larger models

Congratulations! 🎉 You're now running Kubernetes-native LLM inference with Metal acceleration on your Mac!

Name		Name	Last commit message	Last commit date
parent directory ..
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

Metal Quickstart - Local LLM Inference on Apple Silicon

Prerequisites

Hardware

Software

Verify Setup

Quick Start

Option 1: Deploy from Catalog (Recommended)

Option 2: With Custom GPU Settings

Option 3: Custom Model

Monitor the Deployment

Test the Endpoint

Option 1: Direct Access (Fastest)

Option 2: Via Kubernetes Service (In-Cluster)

Performance Expectations

How It Works

Troubleshooting

Operator pod ImagePullBackOff (when deploying from branch)

Metal agent not starting

Model rejected (InsufficientMemory)

Model download slow/failing

Service not accessible

Poor performance

Cleanup

Next Steps

Example Applications

Python Client

cURL Example

Support

Performance Tips

FilesExpand file tree

metal-quickstart

Directory actions

More options

Directory actions

More options

Latest commit

History

metal-quickstart

Folders and files

parent directory

README.md

Metal Quickstart - Local LLM Inference on Apple Silicon

Prerequisites

Hardware

Software

Verify Setup

Quick Start

Option 1: Deploy from Catalog (Recommended)

Option 2: With Custom GPU Settings

Option 3: Custom Model

Monitor the Deployment

Test the Endpoint

Option 1: Direct Access (Fastest)

Option 2: Via Kubernetes Service (In-Cluster)

Performance Expectations

How It Works

Troubleshooting

Operator pod ImagePullBackOff (when deploying from branch)

Metal agent not starting

Model rejected (InsufficientMemory)

Model download slow/failing

Service not accessible

Poor performance

Cleanup

Next Steps

Example Applications

Python Client

cURL Example

Support

Performance Tips