autoresearch-mlx

Apple Silicon (MLX) port of Karpathy's autoresearch.

Full credit to @karpathy for the core idea: fixed-time autonomous research loops controlled through program.md. This port keeps the same basic rules: one mutable train.py, one metric (val_bpb), a fixed 5-minute training budget, and keep-or-revert via git. It runs natively on Apple Silicon through MLX, so there is no PyTorch or CUDA dependency.

Quick start

Requirements: Apple Silicon Mac, Python 3.10+, uv.

# install uv if needed
curl -LsSf https://astral.sh/uv/install.sh | sh

# install dependencies
uv sync

# one-time data + tokenizer prep
uv run prepare.py

# run one 5-minute training experiment
uv run train.py

Then point Claude Code or another coding agent at program.md and let it run the loop.

What matters

prepare.py - data prep, tokenizer, dataloader, and evaluation. Treat as fixed.
train.py - model, optimizer, and training loop. This is the file the agent edits.
program.md - the autonomous experiment protocol.
results.tsv - logged experiment history.

The loop is the same as upstream: edit train.py, run a fixed-budget experiment, read val_bpb, keep the change if it wins, revert if it loses, and repeat.

Public baseline results

The public results.tsv captures the initial hardware-local walk from the default baseline down to 1.807902:

Commit	val_bpb	Status	Description
`383abb4`	2.667000	keep	baseline (AdamW, default config)
`909dd59`	2.588904	keep	halve total batch size to `2^16`
`4161af3`	2.533728	keep	increase matrix LR to `0.04`
`5efc7aa`	1.807902	keep	reduce depth from `8` to `4`

That result already shows the core Apple Silicon pattern: with a fixed 5-minute wall clock, smaller faster-training models can beat larger ones simply by fitting more optimizer steps into the budget.

Longer Apple Silicon runs

Longer overnight runs on the working MLX port pushed much further. The long Mac Mini test is included here because it found a meaningfully different winner stack from the Max-class machines.

Machine	Current best	Starting point	Repeated wins
M4 Max #1	1.294526	1.596971	AdamW-only, low matrix LR, 3x MLP, no logit cap, moderate weight decay
M4 Max #2	1.330509	1.807902	leaner batch, long anneal, SiLU, lower regularization, no logit cap
Mac Mini (long run)	1.353329	1.922472	Muon, sharper attention, smaller MLP, lower scalar LR

The Mac Mini result matters because it did not just rediscover the same exact recipe. On smaller Apple Silicon hardware, the strongest changes leaned toward more aggressive step-efficiency wins. Later transfer tests showed some of those Mac Mini findings did not carry cleanly onto the Max baseline, which is exactly the kind of hardware-specific behavior this loop is useful for uncovering.

Differences from upstream

MLX instead of PyTorch/CUDA. Native Apple Silicon training with unified memory.
AdamW-only public path. This public train.py keeps the default path simple. The long Mac Mini run above explored a Muon variant in the working port, but that branch is not exposed as a public default here.
Smaller eval token budget. Reduced for faster iteration on Apple Silicon while keeping the same evaluate_bpb interface in prepare.py.
Roughly 6-7 minutes per experiment. Expect 5 minutes of training plus compile and eval overhead.
MFU reporting is placeholder. There is no Apple Silicon equivalent to the H100 FLOPs reference used upstream.

Acknowledgments

Andrej Karpathy - autoresearch and nanochat
scasella/nanochat-mlx - MLX GPT and optimizer reference
awni/picochat - MLX training patterns
Apple MLX team

License

MIT. See LICENSE.

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
prepare.py		prepare.py
program.md		program.md
pyproject.toml		pyproject.toml
results.tsv		results.tsv
train.py		train.py
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

autoresearch-mlx

Quick start

What matters

Public baseline results

Longer Apple Silicon runs

Differences from upstream

Acknowledgments

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors 1

Languages

Folders and files

Latest commit

History

Repository files navigation

autoresearch-mlx

Quick start

What matters

Public baseline results

Longer Apple Silicon runs

Differences from upstream

Acknowledgments

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 1

Languages

Packages