Skip to content

trevin-creator/autoresearch-mlx

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

4 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

autoresearch-mlx

Apple Silicon (MLX) port of Karpathy's autoresearch.

Full credit to @karpathy for the core idea: fixed-time autonomous research loops controlled through program.md. This port keeps the same basic rules: one mutable train.py, one metric (val_bpb), a fixed 5-minute training budget, and keep-or-revert via git. It runs natively on Apple Silicon through MLX, so there is no PyTorch or CUDA dependency.

Quick start

Requirements: Apple Silicon Mac, Python 3.10+, uv.

# install uv if needed
curl -LsSf https://astral.sh/uv/install.sh | sh

# install dependencies
uv sync

# one-time data + tokenizer prep
uv run prepare.py

# run one 5-minute training experiment
uv run train.py

Then point Claude Code or another coding agent at program.md and let it run the loop.

What matters

  • prepare.py - data prep, tokenizer, dataloader, and evaluation. Treat as fixed.
  • train.py - model, optimizer, and training loop. This is the file the agent edits.
  • program.md - the autonomous experiment protocol.
  • results.tsv - logged experiment history.

The loop is the same as upstream: edit train.py, run a fixed-budget experiment, read val_bpb, keep the change if it wins, revert if it loses, and repeat.

Public baseline results

The public results.tsv captures the initial hardware-local walk from the default baseline down to 1.807902:

Commit val_bpb Status Description
383abb4 2.667000 keep baseline (AdamW, default config)
909dd59 2.588904 keep halve total batch size to 2^16
4161af3 2.533728 keep increase matrix LR to 0.04
5efc7aa 1.807902 keep reduce depth from 8 to 4

That result already shows the core Apple Silicon pattern: with a fixed 5-minute wall clock, smaller faster-training models can beat larger ones simply by fitting more optimizer steps into the budget.

Longer Apple Silicon runs

Longer overnight runs on the working MLX port pushed much further. The long Mac Mini test is included here because it found a meaningfully different winner stack from the Max-class machines.

Machine Current best Starting point Repeated wins
M4 Max #1 1.294526 1.596971 AdamW-only, low matrix LR, 3x MLP, no logit cap, moderate weight decay
M4 Max #2 1.330509 1.807902 leaner batch, long anneal, SiLU, lower regularization, no logit cap
Mac Mini (long run) 1.353329 1.922472 Muon, sharper attention, smaller MLP, lower scalar LR

The Mac Mini result matters because it did not just rediscover the same exact recipe. On smaller Apple Silicon hardware, the strongest changes leaned toward more aggressive step-efficiency wins. Later transfer tests showed some of those Mac Mini findings did not carry cleanly onto the Max baseline, which is exactly the kind of hardware-specific behavior this loop is useful for uncovering.

Differences from upstream

  • MLX instead of PyTorch/CUDA. Native Apple Silicon training with unified memory.
  • AdamW-only public path. This public train.py keeps the default path simple. The long Mac Mini run above explored a Muon variant in the working port, but that branch is not exposed as a public default here.
  • Smaller eval token budget. Reduced for faster iteration on Apple Silicon while keeping the same evaluate_bpb interface in prepare.py.
  • Roughly 6-7 minutes per experiment. Expect 5 minutes of training plus compile and eval overhead.
  • MFU reporting is placeholder. There is no Apple Silicon equivalent to the H100 FLOPs reference used upstream.

Acknowledgments

License

MIT. See LICENSE.

About

Apple Silicon (MLX) port of Karpathy's autoresearch — autonomous AI research loops on Mac, no PyTorch required.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Languages