army is a small GPT-style transformer training stack written from scratch in
C++17. It is meant to keep the whole system understandable: tokenizer, forward
pass, backward pass, optimizer, sampling, gradient checks, and CPU build paths
all live in this repository.
make check
make
./army train
./army train small 500 data/curated_train.txt
./army chatmake check builds a double-precision checker and compares the hand-written
backward pass against central finite differences. make builds the training and
sampling binary for the host platform. Training writes army.bin; ./army chat
loads that checkpoint and opens a prompt-continuation REPL.
On macOS, the default targets use Apple Accelerate for matrix multiplies. The explicit Apple Silicon aliases are also available:
make m1-check
make m1
./army_m1 train
./army_m1 chat- Decoder-only transformer training on CPU.
- Hand-written forward and backward passes, with no autograd framework.
- Byte-level BPE tokenizer trained from the input corpus.
- AdamW optimizer with warmup plus cosine decay.
- Sampling and prompt continuation from a local checkpoint.
- Host builds for Linux/OpenMP and macOS/Accelerate.
This is a learning and systems project, not a production language model. The small models can learn local style from a corpus, but they do not have reliable world knowledge.
Each block is:
x = x + attention(rmsnorm(x))
x = x + swiglu(rmsnorm(x))The model includes:
- RMSNorm pre-normalization.
- Rotary position embeddings applied to queries and keys.
- Grouped-query attention: more query heads than key/value heads.
- SwiGLU feed-forward layers.
- Multi-token prediction heads; generation uses head 0.
- Bias-free linear layers.
- One flat parameter buffer with typed views for model code, optimizer code, and gradient checking.
Current presets:
| preset | dim | query heads | kv heads | layers | context | mtp heads | batch | default steps |
|---|---|---|---|---|---|---|---|---|
small |
128 | 4 | 2 | 4 | 128 | 4 | 32 | 3000 |
big |
384 | 6 | 2 | 6 | 256 | 4 | 16 | 5000 |
The tokenizer starts from the 256 byte values and learns up to 256 BPE merges from the selected corpus, so the default vocabulary is at most 512 tokens.
./army gradcheck
./army train [small|big] [steps] [corpus]
./army chatExamples:
./army train small 500 shakespeare.txt
./army train small 500 data/curated_train.txt
./army train big 1000 data/pretrain.txtThe default training command is equivalent to:
./army train small 3000 shakespeare.txtThe default corpus is shakespeare.txt. The data/ directory also includes
small local corpora and scripts for larger experiments.
Build the curated local corpus:
sh data/make_curated.sh
./army train small 500 data/curated_train.txtIncluded local source files:
data/curated_general.txt- compact prose about algorithms, debugging, numerical checks, data cleaning, and systems habits.data/textbook_transformer.txt- explanations of the model pieces used here.data/nanoeuler_tasks.txt- project-specific instruction examples, commands, troubleshooting notes, review prompts, and continuation seeds.
Larger generated corpora are optional and ignored by git:
sh data/get_gutenberg.sh
sh data/get_web.sh
sh data/get_alpaca.sh
cat data/gutenberg.txt data/web.txt > data/pretrain.txtdata/get_web.sh expects the DuckDB CLI so it can read a FineWeb-Edu parquet
slice without adding a Python dependency.
Linux builds use g++ with OpenMP:
make
make checkmacOS builds use clang++ with Accelerate.framework:
make
make check
make m1
make m1-checkmake lint runs cpplint over src/ and the root compatibility shim.
src/army.cpp single translation-unit entry for the CPU build
src/app/ CLI modes: train, chat, sampling, gradcheck
src/core/ common types, runtime helpers, RNG
src/data/ byte-level BPE and corpus batching
src/kernels/ attention, RoPE, normalization, linear, loss kernels
src/model/ config, parameter layout, activations, forward/backward
army.cpp compatibility shim that includes src/army.cpp
Makefile host, Apple Silicon, check, lint, and clean targets
data/ local corpora and corpus-generation scripts
shakespeare.txt default tiny training corpusarmy is intentionally CPU-first and compact. It is useful for reading,
modifying, and verifying the mechanics of a transformer training loop end to end.
Future work could add a native GPU backend while keeping the current simple CPU
path easy to inspect.