Skip to content

zeuzmakessoftware/army

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

28 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

army

army is a small GPT-style transformer training stack written from scratch in C++17. It is meant to keep the whole system understandable: tokenizer, forward pass, backward pass, optimizer, sampling, gradient checks, and CPU build paths all live in this repository.

Quick start

make check
make

./army train
./army train small 500 data/curated_train.txt
./army chat

make check builds a double-precision checker and compares the hand-written backward pass against central finite differences. make builds the training and sampling binary for the host platform. Training writes army.bin; ./army chat loads that checkpoint and opens a prompt-continuation REPL.

On macOS, the default targets use Apple Accelerate for matrix multiplies. The explicit Apple Silicon aliases are also available:

make m1-check
make m1
./army_m1 train
./army_m1 chat

What is implemented

  • Decoder-only transformer training on CPU.
  • Hand-written forward and backward passes, with no autograd framework.
  • Byte-level BPE tokenizer trained from the input corpus.
  • AdamW optimizer with warmup plus cosine decay.
  • Sampling and prompt continuation from a local checkpoint.
  • Host builds for Linux/OpenMP and macOS/Accelerate.

This is a learning and systems project, not a production language model. The small models can learn local style from a corpus, but they do not have reliable world knowledge.

Architecture

Each block is:

x = x + attention(rmsnorm(x))
x = x + swiglu(rmsnorm(x))

The model includes:

  • RMSNorm pre-normalization.
  • Rotary position embeddings applied to queries and keys.
  • Grouped-query attention: more query heads than key/value heads.
  • SwiGLU feed-forward layers.
  • Multi-token prediction heads; generation uses head 0.
  • Bias-free linear layers.
  • One flat parameter buffer with typed views for model code, optimizer code, and gradient checking.

Current presets:

preset dim query heads kv heads layers context mtp heads batch default steps
small 128 4 2 4 128 4 32 3000
big 384 6 2 6 256 4 16 5000

The tokenizer starts from the 256 byte values and learns up to 256 BPE merges from the selected corpus, so the default vocabulary is at most 512 tokens.

Commands

./army gradcheck
./army train [small|big] [steps] [corpus]
./army chat

Examples:

./army train small 500 shakespeare.txt
./army train small 500 data/curated_train.txt
./army train big 1000 data/pretrain.txt

The default training command is equivalent to:

./army train small 3000 shakespeare.txt

Data

The default corpus is shakespeare.txt. The data/ directory also includes small local corpora and scripts for larger experiments.

Build the curated local corpus:

sh data/make_curated.sh
./army train small 500 data/curated_train.txt

Included local source files:

  • data/curated_general.txt - compact prose about algorithms, debugging, numerical checks, data cleaning, and systems habits.
  • data/textbook_transformer.txt - explanations of the model pieces used here.
  • data/nanoeuler_tasks.txt - project-specific instruction examples, commands, troubleshooting notes, review prompts, and continuation seeds.

Larger generated corpora are optional and ignored by git:

sh data/get_gutenberg.sh
sh data/get_web.sh
sh data/get_alpaca.sh
cat data/gutenberg.txt data/web.txt > data/pretrain.txt

data/get_web.sh expects the DuckDB CLI so it can read a FineWeb-Edu parquet slice without adding a Python dependency.

Build notes

Linux builds use g++ with OpenMP:

make
make check

macOS builds use clang++ with Accelerate.framework:

make
make check
make m1
make m1-check

make lint runs cpplint over src/ and the root compatibility shim.

Project layout

src/army.cpp              single translation-unit entry for the CPU build
src/app/                  CLI modes: train, chat, sampling, gradcheck
src/core/                 common types, runtime helpers, RNG
src/data/                 byte-level BPE and corpus batching
src/kernels/              attention, RoPE, normalization, linear, loss kernels
src/model/                config, parameter layout, activations, forward/backward
army.cpp                  compatibility shim that includes src/army.cpp
Makefile                  host, Apple Silicon, check, lint, and clean targets
data/                     local corpora and corpus-generation scripts
shakespeare.txt           default tiny training corpus

Current scope

army is intentionally CPU-first and compact. It is useful for reading, modifying, and verifying the mechanics of a transformer training loop end to end. Future work could add a native GPU backend while keeping the current simple CPU path easy to inspect.

About

a small GPT transformer training stack written from scratch in C++ and Accelerate Framework for BLAS

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors