xchplot2

GPU plotter for Chia v2 proofs of space (CHIP-48). Produces farmable .plot2 files byte-identical to the pos2-chip CPU reference.

Quick start

# Install — needs CUDA Toolkit 12+, CMake ≥ 3.24, a C++20 compiler,
# and Rust. NVIDIA only.
cargo install --git https://github.com/Jsewill/xchplot2 --branch cuda-only

# Plot — 10 × k=28 files, keys derived internally from your BLS pair.
xchplot2 plot -k 28 -n 10 \
    -f <farmer-pk-hex> \
    -c <pool-contract-xch1-or-txch1> \
    -o /mnt/plots

# Multi-GPU — one worker per GPU, round-robin partition.
# (`--devices all` adds a CPU worker too; `--devices gpu` sticks to GPUs.)
xchplot2 plot ... --devices gpu

See Hardware compatibility for GPU / VRAM / OS requirements, Build for alternative install paths, and Use for every flag. Windows users: that cargo install line works as-is from an x64 Native Tools Command Prompt for VS 2022 — see Windows (experimental) for the prereqs (Windows SDK, LIB setup, LNK1181 troubleshooting).

Hardware compatibility

GPU: NVIDIA, compute capability ≥ 5.0 (Maxwell / GTX 750-class and newer). Builds auto-detect the installed GPU's compute_cap via nvidia-smi; override with $CUDA_ARCHITECTURES for fat or cross-target builds (see Build). Pre-sm_53 cards lack native FP16 ALUs, but cuda_fp16.h falls back to fp32 emulation for the half-precision intrinsics — any kernel paths touching FP16 still work correctly, with the emulation cost. The AES + match kernels at the heart of plotting are integer-only and see no FP16 penalty.
VRAM: ~1.1 GiB minimum at k=28. Cards with < 15 GB free use the streaming pipeline (four sub-tiers — plain ~7.4 GiB, compact ~5.3 GiB, minimal ~3.7 GiB, tiny ~1.1 GiB — auto-picked by free VRAM); 16 GB+ cards use the persistent buffer pool for faster steady-state. All paths produce byte-identical plots. Detailed breakdown in VRAM.

With --devices, each worker picks its own pool-vs-streaming path from its own GPU's free VRAM — heterogeneous rigs (e.g. one 16 GB + one 8 GB card) plot concurrently with each device on its matching path. The <id>:<tier> suffix on --devices (see Per-GPU streaming tier) overrides the auto-pick per GPU, useful when a card is also serving the desktop and needs more headroom than the picker would leave.
PCIe: Gen4 x16 or wider recommended. A physically narrower slot (e.g. Gen4 x4) adds ~240 ms per plot to the final fragment D2H copy; check cat /sys/bus/pci/devices/*/current_link_width under load if throughput looks off.
Host RAM: ≥ 16 GB recommended; batch mode pins ~4 GB of host memory for D2H double-buffering (pool or streaming).
CUDA Toolkit: 12+ required to build (tested on 13.x). The toolkit-vs-arch matrix:
- sm_50 – sm_72 (Maxwell / Pascal / Volta): need CUDA 12.9 (last toolkit with codegen for these arches — 13.x dropped them entirely). build.rs catches the 13.x + old-arch pairing in a preflight and points at the fix path.
- sm_75 – sm_90 (Turing / Ampere / Hopper): 12.x or 13.x both work.
- sm_120 (RTX 50-series Blackwell): need 12.8+; earlier toolkits lack Blackwell codegen.
CPU architecture: x86_64 is the tested path. aarch64 is also supported for NVIDIA ARM platforms — Jetson Orin (sm_87), IGX Orin, and Grace Hopper / GH200 (sm_90, SBSA). build.rs picks sm_87 as the aarch64 fallback arch when nvidia-smi isn't available, and searches the JetPack (targets/aarch64-linux/lib) and SBSA (targets/sbsa-linux/lib) CUDA library layouts. Apple Silicon is not supported (no CUDA on macOS).
OS: Linux (tested on modern glibc distributions) is the supported path. Windows builds are possible via MSVC + CUDA — see Windows (experimental) below. macOS is not supported (no CUDA).

Build

Requires CUDA Toolkit 12.0+ (12.0 is the floor — cudaGetDeviceProperties_v2, the v2 ABI we link, and CUDA C++20 dialect all need 12.0; 12.9 is the newest tested), C++20 host compiler, CMake ≥ 3.26 (3.26+ knows how to drive nvcc 12.5+; lower works for older nvcc), and a Rust toolchain new enough to parse edition2024 (rustc ≥ 1.85, i.e. rustup stable; most distro-packaged Rust is too old).

Verified install matrix

Distro	CUDA source	CMake source	Rust source
Ubuntu 24.04	apt `nvidia-cuda-toolkit` (12.0)	apt `cmake` (3.28)	rustup `stable`
Ubuntu 24.04	NVIDIA apt repo `cuda-toolkit-12-9`	apt `cmake` (3.28)	rustup `stable`
Ubuntu 22.04	NVIDIA apt repo `cuda-toolkit-12-9`	Kitware apt `cmake`	rustup `stable`
Debian 12 (Bookworm)	NVIDIA apt repo `cuda-toolkit-12-9`	Kitware apt `cmake`	rustup `stable`
Fedora 41	NVIDIA dnf repo `cuda-toolkit-12-9`	dnf `cmake` (3.30)	rustup `stable`
Rocky / Alma / RHEL 9	NVIDIA dnf repo `cuda-toolkit-12-9`	dnf `cmake` (3.26)	rustup `stable`
Arch / CachyOS	pacman `cuda` (12.x)	pacman `cmake`	pacman `rust` or rustup

Combinations that don't work on a stock install:

Ubuntu 22.04 + apt CUDA: ships CUDA 11.5 — nvcc too old for the C++20 dialect we use, and the v1-ABI libcudart lacks cudaGetDeviceProperties_v2. Use NVIDIA's apt repo instead.
Debian 12 + apt CUDA + apt CMake: stock CMake 3.25 doesn't know how to drive nvcc 12.5+. Use Kitware's CMake apt repo.
Ubuntu 22.04/24.04 + apt cargo: distro-packaged Rust (1.75) can't parse edition2024 required by the chia-client 0.42 dep tree. Install rustup instead.
WSL: works the same as native — the only WSL-specific bits are the libcuda.so injection at /usr/lib/wsl/lib (driver, not runtime). Install the toolkit + rustup inside the WSL distro.

`cargo install`

# rustup, if not already installed (apt/dnf cargo is too old)
curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh
source $HOME/.cargo/env

cargo install --git https://github.com/Jsewill/xchplot2

The CUDA runtime is statically linked into the binary, so users don't need any libcudart.so version pinning at runtime, and there's no class of "wrong libcudart on linker path" install failures regardless of how mixed the user's previous CUDA installs are. The binary is ~1 MB larger (3.8 MB vs 2.8 MB) for that property.

build.rs auto-detects the local GPU's compute capability by querying nvidia-smi --query-gpu=compute_cap and builds for only that architecture. That keeps the binary small and the build fast when the install and the target GPU are the same machine.

If auto-detection fails (no nvidia-smi in PATH, or nvidia-smi can't see a GPU — common when building inside a container or on a headless build host that lacks the CUDA driver), the build falls back to sm_89.

If you need to target a GPU that isn't the one doing the build — or if you want a single "fat build" binary that covers multiple architectures — override with $CUDA_ARCHITECTURES:

# Fat build for Ada (4090) and Blackwell (5090):
CUDA_ARCHITECTURES="89;120" cargo install --git https://github.com/Jsewill/xchplot2

# Single target (e.g. Turing 2080 Ti):
CUDA_ARCHITECTURES=75 cargo install --git https://github.com/Jsewill/xchplot2

Common values: 52 GTX 9-series (Maxwell, needs CUDA 12.9 toolkit), 61 GTX 10-series, 70 Volta, 75 Turing, 80 A100, 86 RTX 30- series, 89 RTX 40-series, 90 H100, 120 RTX 50-series.

CMake (also builds the parity tests)

cmake -B build -S . -DCMAKE_BUILD_TYPE=Release
cmake --build build -j

pos2-chip is auto-fetched via FetchContent; override with -DPOS2_CHIP_DIR=/abs/path/to/pos2-chip to point at a local checkout.

Outputs:

build/tools/xchplot2/xchplot2
build/tools/parity/{aes,xs,t1,t2,t3}_parity — bit-exact CPU/GPU tests

Container (`podman compose` or `docker compose`)

The CUDA Toolkit + Rust toolchain live inside the image — the host only needs an engine plus nvidia-container-toolkit for GPU pass-through. scripts/install-container-deps.sh installs both, then scripts/build-container.sh probes nvidia-smi for the right CUDA_ARCH and runs compose build:

./scripts/install-container-deps.sh    # one-time: podman + nvidia-container-toolkit + CDI
./scripts/build-container.sh           # auto-pins CUDA 12.9 base on pre-Turing rigs
podman compose run --rm cuda plot -k 28 -n 10 \
    -f <farmer-pk> -c <pool-contract> -o /out

Plot files land in ./plots/ on the host. compose.yaml uses CDI shorthand (devices: - nvidia.com/gpu=all) so the runtime path is podman-first; bare docker run --gpus all still works after install-container-deps.sh --engine docker, but the docker compose run step won't see the GPU.

Windows (experimental)

This branch is CUDA-only, so a Windows build needs nothing beyond the standard NVIDIA toolchain — no SYCL runtime required. Only one POSIX site in the code (Cancel.cpp) and it's already #if defined(__unix__) -guarded. This path is untested — please file an issue with your results.

Prerequisites:

Windows 10 21H2+ or Windows 11, x64
Visual Studio 2022 Community with the "Desktop development with C++" workload. That workload bundles MSVC + the Windows SDK; the SDK is non-optional because it ships kernel32.lib / user32.lib / etc. that link.exe consumes. If you've trimmed the installer to "C++ build tools" only, open Visual Studio Installer → Modify → Individual components and tick the latest Windows 11 SDK before retrying.
CUDA Toolkit 12.0+ — install after Visual Studio so the CUDA installer wires up the MSBuild integration. 12.8+ required for RTX 50-series (Blackwell, sm_120).
Rust using the MSVC toolchain (rustup default stable-x86_64-pc-windows-msvc)
CMake 3.24+ and Git for Windows

Launch the x64 Native Tools Command Prompt for VS 2022 from the Start menu — there are several similarly-named prompts (x86 / x86_64 / 2019 / 2022); the one that matters is the x64 for 2022. That prompt is the one that sets LIB, INCLUDE, and PATH so cl.exe, link.exe, nvcc, and cmake all see each other plus the Windows SDK. A plain cmd / PowerShell / Windows Terminal tab does not do this — running cargo install from one of those produces LNK1181: cannot open input file 'kernel32.lib' at the first link step.

Quick sanity check in the prompt:

where link.exe
echo %LIB%

%LIB% should include a ...\Windows Kits\10\Lib\...\um\x64 entry. If it doesn't, you're in the wrong prompt or the Windows SDK component isn't installed.

Build:

set CUDA_ARCHITECTURES=89
cargo install --git https://github.com/Jsewill/xchplot2 --branch cuda-only

Or for a local checkout you can iterate on:

git clone -b cuda-only https://github.com/Jsewill/xchplot2
cd xchplot2
set CUDA_ARCHITECTURES=89
cargo install --path .

Set CUDA_ARCHITECTURES to match your card (see the list above). PowerShell users: use $env:CUDA_ARCHITECTURES = "89" instead of set. The CMake path (cmake -B build -S . && cmake --build build) also works inside the same Native Tools prompt if you prefer that over cargo install.

Use

Standalone (farmable plots)

xchplot2 plot -k 28 -n 10 \
    -f <farmer-pk> \
    -c <pool-contract-address> \
    -o <output-dir>

Pool variants: -p <pool-pk> or --pool-ph <pool-ph>. Other common flags: -s <strength>, -T testnet, -S <seed> for reproducible runs, -v verbose. Full help: xchplot2 -h.

Grouping plots: `-i <plot-index>` and `-g <meta-group>`

Both are v2 PoS fields and default to 0. <plot-index> (u16) is the within-group identifier; plot -n N uses it as the base and increments per plot (so -i 0 -n 1000 produces plots with plot_index 0..999). <meta-group> (u8) is a challenge-isolation boundary — plots with different meta_group values are guaranteed never to pass the same challenge.

The PoS2 spec defines a grouped-plot file layout (multiple plots interleaved into one container per storage device, for harvester seek amortization), but the on-disk format is not yet defined upstream in pos2-chip / chia-rs. xchplot2 currently produces one .plot2 file per plot — this is in lieu of those upstream decisions. When the grouped layout lands, the auto-incrementing <plot-index> above is the per-plot within-group identifier it will expect.

Multi-GPU: `--devices`

xchplot2 devices prints id, name, VRAM, SM count, and compute capability for every visible CUDA device, plus the host CPU plotter row. Use the printed [N] / [cpu] index with --devices:

$ xchplot2 devices
Visible devices (1 GPU + 1 CPU):
  [0]   NVIDIA GeForce RTX 4090          vram=24076 MB  SMs=128  CC=8.9
  [cpu] Host CPU plotter                 threads=32                       (1-2 orders slower than GPU)

Both plot and batch accept --devices <SPEC> to fan plots out across multiple NVIDIA GPUs — one worker thread per device, each bound via cudaSetDevice and carrying its own buffer pool + writer channel. Plots are partitioned round-robin, so a batch of 10 plots on 2 GPUs sends plots 0/2/4/6/8 to the first GPU and 1/3/5/7/9 to the second.

# Every visible CUDA device — enumerated at runtime. No CPU worker.
xchplot2 plot --k 28 --num 10 -f <farmer-pk> -c <pool-contract> \
    --out /mnt/plots --devices gpu

# Every CUDA device PLUS a CPU worker on the same batch.
xchplot2 plot ... --devices all

# Only these specific device ids (sorted, deduplicated).
xchplot2 plot ... --devices 0,2,3

# Explicit single id (same as omitting the flag on a single-GPU host).
xchplot2 plot ... --devices 0

# CPU only, or specific GPUs + CPU as a list.
xchplot2 plot ... --devices cpu
xchplot2 plot ... --devices 0,1,cpu

Per-GPU streaming tier

Any GPU selector in --devices accepts a :tier suffix to pin the streaming tier for that device. Tier ∈ plain|compact|minimal|tiny|auto. Useful when GPUs differ in VRAM, or when one card is also serving the desktop and you want to leave it more headroom:

# All GPUs auto-pick from free VRAM, except GPU 2 which uses tiny.
xchplot2 plot ... --devices gpu,2:tiny

# All GPUs + CPU worker; GPU 2 = tiny.
xchplot2 plot ... --devices all,2:tiny

# All GPUs pinned to tiny, except GPU 2 which uses plain.
xchplot2 plot ... --devices gpu:tiny,2:plain

# All GPUs pinned to tiny, except GPU 2 which auto-picks (the `:auto`
# sentinel explicitly re-enables auto-pick for a single GPU).
xchplot2 plot ... --devices gpu:tiny,2:auto

# All-explicit form (still works).
xchplot2 plot ... --devices 0:tiny,1:minimal,2:plain

Precedence (highest wins):

Per-GPU <id>:<tier> token
gpu:<tier> / all:<tier> shorthand
Global --tier <name> / XCHPLOT2_STREAMING_TIER
Auto-pick from free VRAM

cpu:<tier> is rejected (the CPU worker doesn't use streaming tiers). Duplicate IDs with conflicting tiers (0:tiny,0:plain) and unknown tier names are also rejected at parse time.

Omitted flag = single device on the CUDA-default device — identical to pre-multi-GPU behavior, zero regression risk.

Caveats for v1:

Static round-robin partition. If your GPUs differ in speed the batch finishes only as fast as the slowest worker's slice; use --devices to pick matched cards when that matters.
Each worker gets its own ~4 GB pinned host pool (pool path) or ~6 GB pinned scratch (compact streaming), so host RAM scales linearly. A 4-GPU rig pins ~16-24 GB — size accordingly.
The workers share stderr (line-buffered, atomic per-fprintf) so log lines from different GPUs may interleave.

Smoke test: scripts/test-multi-gpu.sh exercises argument parsing (works on any host, even single-GPU) and, when 2+ GPUs are visible, runs a live k=22 plot across --devices 0,1.

Lower-level subcommands

xchplot2 test          <k> <plot-id-hex> [strength] ...   # single plot, raw inputs
xchplot2 batch         <manifest.tsv> [-v] [--devices <SPEC>]
xchplot2 parity-check  [--dir PATH]                       # CPU↔GPU regression screen

Environment variables

Variable	Effect
`XCHPLOT2_STREAMING=1`	Force the low-VRAM streaming pipeline even when the pool would fit.
`XCHPLOT2_STREAMING_TIER=plain\|compact\|minimal\|tiny`	Override the streaming-tier auto-pick (plain ~7.4 GiB peak, compact ~5.3 GiB, minimal ~3.7 GiB, tiny ~1.1 GiB — k=28 measured). Equivalent CLI flag: `--tier`. Either form forces the streaming pipeline even on cards big enough to fit the pool, so `--tier tiny` works on a 4090 too.
`POS2GPU_MAX_VRAM_MB=N`	Cap the VRAM query to N MB — exercises the streaming fallback.
`POS2GPU_STREAMING_STATS=1`	Log every streaming-path `cudaMalloc` / `cudaFree`.
`POS2GPU_POOL_DEBUG=1`	Log pool allocation sizes at construction.
`POS2GPU_PHASE_TIMING=1`	Per-phase wall-time breakdown (Xs / sort / T1 / T2 / T3) on stderr.
`CUDA_ARCHITECTURES=sm_XX`	Override the CUDA arch autodetected from `nvidia-smi`.
`CUDA_PATH=/path/to/cuda`	Override the CUDA Toolkit root for linking (default: `/opt/cuda`, `/usr/local/cuda`). Useful on JetPack / non-standard installs.
`CUDA_HOME=/path/to/cuda`	Fallback for `CUDA_PATH` — same effect.
`POS2_CHIP_DIR=/path`	Build-time: point at a local pos2-chip checkout instead of FetchContent.
`XCHPLOT2_TEST_GPU_COUNT=N`	Override `scripts/test-multi-gpu.sh`'s auto-detected GPU count (forces run / skip without consulting `nvidia-smi`).

Testing farming on a testnet

v2 (CHIP-48) farming in stock chia-blockchain is presently unfinished upstream — services aren't wired into the farmer group, a message handler's signature doesn't match its decorator, ProofOfSpace. challenge is computed from the wrong input, and the dependency pin on chia_rs excludes the 0.42 release where compute_plot_id_v2 lives. contrib/testnet-farming.patch is a minimal self-contained fix-up that gets a private testnet running end-to-end:

git clone https://github.com/Chia-Network/chia-blockchain
cd chia-blockchain
git checkout 39f8bec88   # 2.7.0 Checkpoint Merge
git apply /path/to/xchplot2/contrib/testnet-farming.patch

The patch's header comment describes each hunk. None of the changes are xchplot2-specific — they're the farmer / harvester / daemon pieces any v2 plot needs for farming, regardless of who produced it.

Architecture

src/gpu/                 CUDA kernels — AES, Xs, T1, T2, T3
src/host/
├── GpuPipeline          Xs → T1 → T2 → T3 device orchestration;
│                          pool + streaming (low-VRAM) variants
├── GpuBufferPool        persistent device + 2× pinned host pool
├── BatchPlotter         producer / consumer batch driver
└── PlotFileWriterParallel  sole TU touching pos2-chip headers
tools/xchplot2/          CLI: plot / test / batch
tools/parity/            CPU↔GPU bit-exactness tests
keygen-rs/               Rust staticlib: plot_id_v2, BLS HD, bech32m

VRAM

PoS2 plots are k=28 by spec. Four code paths, dispatched automatically based on available VRAM:

Pool path (~15 GB, 16 GB+ cards). The persistent buffer pool is sized worst-case and reused across plots in batch mode for amortised allocator cost and double-buffered D2H. Targets for steady-state: RTX 4080 / 4090 / 5080 / 5090, A6000, etc.
Plain streaming (~7.4 GiB floor). Allocates per-phase and frees between phases; no pinned-host parks, single-pass T2 match. Used on 10-11 GB cards that can't fit the pool but have headroom above compact. ~400 ms/plot faster than compact.
Compact streaming (~5.3 GiB floor). Park/rehydrate of the large intermediates on pinned host across their idle windows + N=2 T2 match staging (cap/2 ≈ 2280 MB at k=28). T1/T2 sorts are tiled (N=2 and N=4) with merge trees. Targets 6-8 GiB cards.
Minimal streaming (~3.7 GiB floor). Compact's parks plus six layered cuts that bring every phase below the 4 GiB cliff: (1) N=8 T2 match staging (cap/8 ≈ 570 MB at k=28); (2) N=4 T1/T2 sort gather tiling — the merged-key + permuted-meta gather output is D2H'd per tile to pinned host; (3) T3 match section-pair input slicing — d_t2_meta_sorted is parked on pinned host across T3 match, with the section_l + section_r row slices H2D'd per pass to a cap/2 device buffer (xbits + keys stay full-cap for binary-search reads); (4) N=4 T1 match slicing — each section_l pass writes to cap/4 device staging, D2H to pinned host; (5) CUB sub-phase tiling in T1/T2/T3 sort — replaces the four cap-sized uint32/uint64 sort I/O buffers with cap/N per-tile staging + host pinned accumulators, with the multi-way merge done on the CPU; and (6) Xs gen+sort+pack tiling — generate the full (keys, vals) once, then sort in cap/N tiles to host pinned accumulators (carved out of scratch.h_meta), CPU-merge, and pack into d_xs via two strided cudaMemcpy2DAsync H2D copies (no separate device-side pack buffer pair). Measured overall peak at k=28 strength=2 on RTX 4090 (compact → minimal): 5200 → 3640 MB; per-phase peaks: Xs 2570, T1 sort 3640, T2 sort 3640, T3 match 3640, T3 sort 3640. Targets 4 GiB cards (GTX 1050 Ti / 1650, RTX 3050 4GB, MX450) and fits comfortably on 5 GiB+ cards with ~2 GiB headroom. Trade-off: ~6 extra cap-sized PCIe round-trips per plot + ~6 sec/plot of host-CPU merge work — k=28 wall on sm_89: ~31 s/plot vs ~12 s for compact (~2.6×). 4 GiB cards remain an edge case since real 4 GiB hardware reports ~3.5 GiB free post-CUDA-context; please report actual fit.
Tiny streaming (~1.1 GiB floor). Full Phase 1.4 + 1.5 + 1.6 algorithm port, byte-for-byte peak parity with the SYCL Tiny tier. On top of Minimal's six cuts, adds: per-section-pair T1 match tile (Xs data parks on pinned host h_xs_pinned; T1 reads via per-(L,R) section H2D), per-(section_l, match_key_r) bucket-pair sub-section for T1/T2/T3 match (per-pass tile is L section + one R bucket instead of full L+R), streaming-partition T1 sort + streaming-partition T2 sort with global_idx tiebreak + tile-and-merge T3 sort (eliminates the cap-sized d_t1_meta, d_t2_mi, and d_t3 on device by partitioning to per-bucket arenas
- per-bucket CUB sort), host-side T2/T3 prepare offsets (binary search on already-sorted h_keys_merged, skipping the cap-sized GPU prepare-keys H2D), d_t3_stage + d_frags_out → host-pinned aliases (T3 match writes via UVA-mapped host pinned; T3 sort lands sorted fragments directly in pinned_dst), and Xs gen+sort per-tile generation via launch_xs_gen_range (eliminates the cap × 2 × u32 full-cap gen output that the non-range path requires). Targets sub-2 GiB NVIDIA cards (Quadro P620 2 GB, GTX 1050 2 GB, older laptop dGPUs). Measured at k=28 strength=2 on RTX 4090: 1064 MB plot peak — byte-identical to SYCL Tiny's measured 1064 MB. Per-phase peaks: Xs 1030, T1 match 1040, T1 sort 1056, T2 match 1040, T2 sort 1064 (floor), T3 match 1024, T3 sort 1047. All phases ≤ 1064 MB. Trade-off: ~17 s/plot extra wall vs minimal (per-bucket sequential gen+sort+pack+merge) — k=28 wall on sm_89 ~50 s/plot. Byte-identical to other tiers at k=22/24/26/28 (validated). There is no smaller tier — a forced tiny on a card below the floor throws.

xchplot2 queries cudaMemGetInfo at pool construction; if the pool doesn't fit, the streaming-tier dispatch picks the largest streaming tier that fits with a 128 MB margin. Force streaming on any card with XCHPLOT2_STREAMING=1. --tier plain|compact|minimal|tiny|auto (or XCHPLOT2_STREAMING_TIER) overrides the auto-pick — useful for testing or to step down from a tight margin (e.g. an 8 GiB card OOMing mid-plot can --tier compact).

Plot output is bit-identical across all paths — streaming reorganises memory, not algorithms.

Performance

k=28, strength=2, RTX 4090 (sm_89), PCIe Gen4 x16:

Mode	Per plot
pos2-chip CPU baseline	~50 s
`xchplot2 batch` steady-state wall (pool path)	2.15 s
`xchplot2 batch` steady-state wall (streaming path, ≤8 GB cards)	~3.7 s
Producer GPU time, steady-state	1.96 s
Device-kernel floor (single-plot nsys)	1.91 s

Numbers above are single-GPU. With --devices 0,1,... the batch is partitioned round-robin across N worker threads (one per device), so wall-clock throughput is bounded by the slowest device's slice — ≈ linear scaling on matched cards, less if cards differ. Live multi-GPU plots were confirmed end-to-end on NVIDIA.

License

MIT — see LICENSE and NOTICE for third-party attributions. Built collaboratively with Claude.

Name		Name	Last commit message	Last commit date
Latest commit History 151 Commits
.github		.github
ci/install-matrix		ci/install-matrix
contrib		contrib
keygen-rs		keygen-rs
scripts		scripts
src		src
tools		tools
.dockerignore		.dockerignore
.gitignore		.gitignore
.markdownlint.json		.markdownlint.json
CMakeLists.txt		CMakeLists.txt
Cargo.lock		Cargo.lock
Cargo.toml		Cargo.toml
Containerfile		Containerfile
LICENSE		LICENSE
NOTICE		NOTICE
README.md		README.md
_typos.toml		_typos.toml
build.rs		build.rs
compose.yaml		compose.yaml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

xchplot2

Quick start

Hardware compatibility

Build

Verified install matrix

`cargo install`

CMake (also builds the parity tests)

Container (`podman compose` or `docker compose`)

Windows (experimental)

Use

Standalone (farmable plots)

Grouping plots: `-i <plot-index>` and `-g <meta-group>`

Multi-GPU: `--devices`

Per-GPU streaming tier

Lower-level subcommands

Environment variables

Testing farming on a testnet

Architecture

VRAM

Performance

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

xchplot2

Quick start

Hardware compatibility

Build

Verified install matrix

cargo install

CMake (also builds the parity tests)

Container (podman compose or docker compose)

Windows (experimental)

Use

Standalone (farmable plots)

Grouping plots: -i <plot-index> and -g <meta-group>

Multi-GPU: --devices

Per-GPU streaming tier

Lower-level subcommands

Environment variables

Testing farming on a testnet

Architecture

VRAM

Performance

License

About

Resources

License

Contributing

Security policy

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

`cargo install`

Container (`podman compose` or `docker compose`)

Grouping plots: `-i <plot-index>` and `-g <meta-group>`

Multi-GPU: `--devices`

Packages