Apple Silicon (MLX) port of Karpathy's autoresearch.
Full credit to @karpathy for the core idea: fixed-time autonomous research loops controlled through program.md. This port keeps the same basic rules: one mutable train.py, one metric (val_bpb), a fixed 5-minute training budget, and keep-or-revert via git. It runs natively on Apple Silicon through MLX, so there is no PyTorch or CUDA dependency.
Requirements: Apple Silicon Mac, Python 3.10+, uv.
# install uv if needed
curl -LsSf https://astral.sh/uv/install.sh | sh
# install dependencies
uv sync
# one-time data + tokenizer prep
uv run prepare.py
# run one 5-minute training experiment
uv run train.pyThen point Claude Code or another coding agent at program.md and let it run the loop.
prepare.py- data prep, tokenizer, dataloader, and evaluation. Treat as fixed.train.py- model, optimizer, and training loop. This is the file the agent edits.program.md- the autonomous experiment protocol.results.tsv- logged experiment history.
The loop is the same as upstream: edit train.py, run a fixed-budget experiment, read val_bpb, keep the change if it wins, revert if it loses, and repeat.
The public results.tsv captures the initial hardware-local walk from the default baseline down to 1.807902:
| Commit | val_bpb | Status | Description |
|---|---|---|---|
383abb4 |
2.667000 | keep | baseline (AdamW, default config) |
909dd59 |
2.588904 | keep | halve total batch size to 2^16 |
4161af3 |
2.533728 | keep | increase matrix LR to 0.04 |
5efc7aa |
1.807902 | keep | reduce depth from 8 to 4 |
That result already shows the core Apple Silicon pattern: with a fixed 5-minute wall clock, smaller faster-training models can beat larger ones simply by fitting more optimizer steps into the budget.
Longer overnight runs on the working MLX port pushed much further. The long Mac Mini test is included here because it found a meaningfully different winner stack from the Max-class machines.
| Machine | Current best | Starting point | Repeated wins |
|---|---|---|---|
| M4 Max #1 | 1.294526 | 1.596971 | AdamW-only, low matrix LR, 3x MLP, no logit cap, moderate weight decay |
| M4 Max #2 | 1.330509 | 1.807902 | leaner batch, long anneal, SiLU, lower regularization, no logit cap |
| Mac Mini (long run) | 1.353329 | 1.922472 | Muon, sharper attention, smaller MLP, lower scalar LR |
The Mac Mini result matters because it did not just rediscover the same exact recipe. On smaller Apple Silicon hardware, the strongest changes leaned toward more aggressive step-efficiency wins. Later transfer tests showed some of those Mac Mini findings did not carry cleanly onto the Max baseline, which is exactly the kind of hardware-specific behavior this loop is useful for uncovering.
- MLX instead of PyTorch/CUDA. Native Apple Silicon training with unified memory.
- AdamW-only public path. This public
train.pykeeps the default path simple. The long Mac Mini run above explored a Muon variant in the working port, but that branch is not exposed as a public default here. - Smaller eval token budget. Reduced for faster iteration on Apple Silicon while keeping the same
evaluate_bpbinterface inprepare.py. - Roughly 6-7 minutes per experiment. Expect 5 minutes of training plus compile and eval overhead.
- MFU reporting is placeholder. There is no Apple Silicon equivalent to the H100 FLOPs reference used upstream.
- Andrej Karpathy - autoresearch and nanochat
- scasella/nanochat-mlx - MLX GPT and optimizer reference
- awni/picochat - MLX training patterns
- Apple MLX team
MIT. See LICENSE.