GPU plotter for Chia v2 proofs of space (CHIP-48). Produces farmable
.plot2 files byte-identical to the
pos2-chip CPU reference.
# Install — needs CUDA Toolkit 12+, CMake ≥ 3.24, a C++20 compiler,
# and Rust. NVIDIA only.
cargo install --git https://github.com/Jsewill/xchplot2 --branch cuda-only
# Plot — 10 × k=28 files, keys derived internally from your BLS pair.
xchplot2 plot -k 28 -n 10 \
-f <farmer-pk-hex> \
-c <pool-contract-xch1-or-txch1> \
-o /mnt/plots
# Multi-GPU — one worker per GPU, round-robin partition.
# (`--devices all` adds a CPU worker too; `--devices gpu` sticks to GPUs.)
xchplot2 plot ... --devices gpuSee Hardware compatibility for GPU / VRAM /
OS requirements, Build for alternative install paths, and
Use for every flag. Windows users: that cargo install
line works as-is from an x64 Native Tools Command Prompt for VS 2022
— see Windows (experimental) for the
prereqs (Windows SDK, LIB setup, LNK1181 troubleshooting).
-
GPU: NVIDIA, compute capability ≥ 5.0 (Maxwell / GTX 750-class and newer). Builds auto-detect the installed GPU's
compute_capvianvidia-smi; override with$CUDA_ARCHITECTURESfor fat or cross-target builds (see Build). Pre-sm_53 cards lack native FP16 ALUs, butcuda_fp16.hfalls back to fp32 emulation for the half-precision intrinsics — any kernel paths touching FP16 still work correctly, with the emulation cost. The AES + match kernels at the heart of plotting are integer-only and see no FP16 penalty. -
VRAM: ~1.1 GiB minimum at k=28. Cards with < 15 GB free use the streaming pipeline (four sub-tiers — plain ~7.4 GiB, compact ~5.3 GiB, minimal ~3.7 GiB, tiny ~1.1 GiB — auto-picked by free VRAM); 16 GB+ cards use the persistent buffer pool for faster steady-state. All paths produce byte-identical plots. Detailed breakdown in VRAM.
With
--devices, each worker picks its own pool-vs-streaming path from its own GPU's free VRAM — heterogeneous rigs (e.g. one 16 GB + one 8 GB card) plot concurrently with each device on its matching path. The<id>:<tier>suffix on--devices(see Per-GPU streaming tier) overrides the auto-pick per GPU, useful when a card is also serving the desktop and needs more headroom than the picker would leave. -
PCIe: Gen4 x16 or wider recommended. A physically narrower slot (e.g. Gen4 x4) adds ~240 ms per plot to the final fragment D2H copy; check
cat /sys/bus/pci/devices/*/current_link_widthunder load if throughput looks off. -
Host RAM: ≥ 16 GB recommended;
batchmode pins ~4 GB of host memory for D2H double-buffering (pool or streaming). -
CUDA Toolkit: 12+ required to build (tested on 13.x). The toolkit-vs-arch matrix:
sm_50–sm_72(Maxwell / Pascal / Volta): need CUDA 12.9 (last toolkit with codegen for these arches — 13.x dropped them entirely).build.rscatches the 13.x + old-arch pairing in a preflight and points at the fix path.sm_75–sm_90(Turing / Ampere / Hopper): 12.x or 13.x both work.sm_120(RTX 50-series Blackwell): need 12.8+; earlier toolkits lack Blackwell codegen.
-
CPU architecture:
x86_64is the tested path.aarch64is also supported for NVIDIA ARM platforms — Jetson Orin (sm_87), IGX Orin, and Grace Hopper / GH200 (sm_90, SBSA).build.rspickssm_87as the aarch64 fallback arch whennvidia-smiisn't available, and searches the JetPack (targets/aarch64-linux/lib) and SBSA (targets/sbsa-linux/lib) CUDA library layouts. Apple Silicon is not supported (no CUDA on macOS). -
OS: Linux (tested on modern glibc distributions) is the supported path. Windows builds are possible via MSVC + CUDA — see Windows (experimental) below. macOS is not supported (no CUDA).
Requires CUDA Toolkit 12.0+ (12.0 is the floor — cudaGetDeviceProperties_v2,
the v2 ABI we link, and CUDA C++20 dialect all need 12.0; 12.9 is the
newest tested), C++20 host compiler, CMake ≥ 3.26 (3.26+ knows
how to drive nvcc 12.5+; lower works for older nvcc), and a Rust
toolchain new enough to parse edition2024 (rustc ≥ 1.85, i.e.
rustup stable; most distro-packaged Rust is too old).
| Distro | CUDA source | CMake source | Rust source |
|---|---|---|---|
| Ubuntu 24.04 | apt nvidia-cuda-toolkit (12.0) |
apt cmake (3.28) |
rustup stable |
| Ubuntu 24.04 | NVIDIA apt repo cuda-toolkit-12-9 |
apt cmake (3.28) |
rustup stable |
| Ubuntu 22.04 | NVIDIA apt repo cuda-toolkit-12-9 |
Kitware apt cmake |
rustup stable |
| Debian 12 (Bookworm) | NVIDIA apt repo cuda-toolkit-12-9 |
Kitware apt cmake |
rustup stable |
| Fedora 41 | NVIDIA dnf repo cuda-toolkit-12-9 |
dnf cmake (3.30) |
rustup stable |
| Rocky / Alma / RHEL 9 | NVIDIA dnf repo cuda-toolkit-12-9 |
dnf cmake (3.26) |
rustup stable |
| Arch / CachyOS | pacman cuda (12.x) |
pacman cmake |
pacman rust or rustup |
Combinations that don't work on a stock install:
- Ubuntu 22.04 + apt CUDA: ships CUDA 11.5 — nvcc too old for the
C++20 dialect we use, and the v1-ABI
libcudartlackscudaGetDeviceProperties_v2. Use NVIDIA's apt repo instead. - Debian 12 + apt CUDA + apt CMake: stock CMake 3.25 doesn't know how to drive nvcc 12.5+. Use Kitware's CMake apt repo.
- Ubuntu 22.04/24.04 + apt cargo: distro-packaged Rust (1.75) can't
parse
edition2024required by thechia-client0.42 dep tree. Install rustup instead. - WSL: works the same as native — the only WSL-specific bits are
the
libcuda.soinjection at/usr/lib/wsl/lib(driver, not runtime). Install the toolkit + rustup inside the WSL distro.
# rustup, if not already installed (apt/dnf cargo is too old)
curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh
source $HOME/.cargo/env
cargo install --git https://github.com/Jsewill/xchplot2The CUDA runtime is statically linked into the binary, so users don't
need any libcudart.so version pinning at runtime, and there's no
class of "wrong libcudart on linker path" install failures regardless
of how mixed the user's previous CUDA installs are. The binary is ~1 MB
larger (3.8 MB vs 2.8 MB) for that property.
build.rs auto-detects the local GPU's compute capability by querying
nvidia-smi --query-gpu=compute_cap and builds for only that
architecture. That keeps the binary small and the build fast when the
install and the target GPU are the same machine.
If auto-detection fails (no nvidia-smi in PATH, or
nvidia-smi can't see a GPU — common when building inside a container
or on a headless build host that lacks the CUDA driver), the build
falls back to sm_89.
If you need to target a GPU that isn't the one doing the build — or if
you want a single "fat build" binary that covers multiple
architectures — override with $CUDA_ARCHITECTURES:
# Fat build for Ada (4090) and Blackwell (5090):
CUDA_ARCHITECTURES="89;120" cargo install --git https://github.com/Jsewill/xchplot2
# Single target (e.g. Turing 2080 Ti):
CUDA_ARCHITECTURES=75 cargo install --git https://github.com/Jsewill/xchplot2Common values: 52 GTX 9-series (Maxwell, needs CUDA 12.9 toolkit),
61 GTX 10-series, 70 Volta, 75 Turing, 80 A100, 86 RTX 30-
series, 89 RTX 40-series, 90 H100, 120 RTX 50-series.
cmake -B build -S . -DCMAKE_BUILD_TYPE=Release
cmake --build build -jpos2-chip is auto-fetched via FetchContent; override with
-DPOS2_CHIP_DIR=/abs/path/to/pos2-chip to point at a local checkout.
Outputs:
build/tools/xchplot2/xchplot2build/tools/parity/{aes,xs,t1,t2,t3}_parity— bit-exact CPU/GPU tests
The CUDA Toolkit + Rust toolchain live inside the image — the host
only needs an engine plus nvidia-container-toolkit for GPU
pass-through. scripts/install-container-deps.sh installs both, then
scripts/build-container.sh probes nvidia-smi for the right
CUDA_ARCH and runs compose build:
./scripts/install-container-deps.sh # one-time: podman + nvidia-container-toolkit + CDI
./scripts/build-container.sh # auto-pins CUDA 12.9 base on pre-Turing rigs
podman compose run --rm cuda plot -k 28 -n 10 \
-f <farmer-pk> -c <pool-contract> -o /outPlot files land in ./plots/ on the host. compose.yaml uses CDI
shorthand (devices: - nvidia.com/gpu=all) so the runtime path is
podman-first; bare docker run --gpus all still works after
install-container-deps.sh --engine docker, but the docker compose run step won't see the GPU.
This branch is CUDA-only, so a Windows build needs nothing beyond the
standard NVIDIA toolchain — no SYCL runtime required. Only one POSIX
site in the code (Cancel.cpp) and it's already #if defined(__unix__)
-guarded. This path is untested — please file an issue with your
results.
Prerequisites:
- Windows 10 21H2+ or Windows 11, x64
- Visual Studio 2022 Community
with the "Desktop development with C++" workload. That workload
bundles MSVC + the Windows SDK; the SDK is non-optional because it
ships
kernel32.lib/user32.lib/ etc. thatlink.execonsumes. If you've trimmed the installer to "C++ build tools" only, open Visual Studio Installer → Modify → Individual components and tick the latest Windows 11 SDK before retrying. - CUDA Toolkit 12.0+ —
install after Visual Studio so the CUDA installer wires up the
MSBuild integration. 12.8+ required for RTX 50-series (Blackwell,
sm_120). - Rust using the MSVC
toolchain (
rustup default stable-x86_64-pc-windows-msvc) - CMake 3.24+ and Git for Windows
Launch the x64 Native Tools Command Prompt for VS 2022 from the
Start menu — there are several similarly-named prompts (x86 /
x86_64 / 2019 / 2022); the one that matters is the x64 for 2022.
That prompt is the one that sets LIB, INCLUDE, and PATH so
cl.exe, link.exe, nvcc, and cmake all see each other plus
the Windows SDK. A plain cmd / PowerShell / Windows Terminal tab
does not do this — running cargo install from one of those
produces LNK1181: cannot open input file 'kernel32.lib' at the
first link step.
Quick sanity check in the prompt:
where link.exe
echo %LIB%%LIB% should include a ...\Windows Kits\10\Lib\...\um\x64
entry. If it doesn't, you're in the wrong prompt or the Windows SDK
component isn't installed.
Build:
set CUDA_ARCHITECTURES=89
cargo install --git https://github.com/Jsewill/xchplot2 --branch cuda-onlyOr for a local checkout you can iterate on:
git clone -b cuda-only https://github.com/Jsewill/xchplot2
cd xchplot2
set CUDA_ARCHITECTURES=89
cargo install --path .Set CUDA_ARCHITECTURES to match your card (see the list above).
PowerShell users: use $env:CUDA_ARCHITECTURES = "89" instead of
set. The CMake path (cmake -B build -S . && cmake --build build)
also works inside the same Native Tools prompt if you prefer that over
cargo install.
xchplot2 plot -k 28 -n 10 \
-f <farmer-pk> \
-c <pool-contract-address> \
-o <output-dir>Pool variants: -p <pool-pk> or --pool-ph <pool-ph>. Other common
flags: -s <strength>, -T testnet, -S <seed> for reproducible runs,
-v verbose. Full help: xchplot2 -h.
Both are v2 PoS fields and default to 0.
<plot-index> (u16) is the within-group identifier; plot -n N
uses it as the base and increments per plot (so -i 0 -n 1000
produces plots with plot_index 0..999).
<meta-group> (u8) is a challenge-isolation boundary — plots with
different meta_group values are guaranteed never to pass the same
challenge.
The PoS2 spec defines a grouped-plot file layout (multiple plots
interleaved into one container per storage device, for harvester
seek amortization), but the on-disk format is not yet defined
upstream in pos2-chip / chia-rs. xchplot2 currently produces one
.plot2 file per plot — this is in lieu of those upstream
decisions. When the grouped layout lands, the auto-incrementing
<plot-index> above is the per-plot within-group identifier it
will expect.
xchplot2 devices prints id, name, VRAM, SM count, and compute
capability for every visible CUDA device, plus the host CPU plotter
row. Use the printed [N] / [cpu] index with --devices:
$ xchplot2 devices
Visible devices (1 GPU + 1 CPU):
[0] NVIDIA GeForce RTX 4090 vram=24076 MB SMs=128 CC=8.9
[cpu] Host CPU plotter threads=32 (1-2 orders slower than GPU)
Both plot and batch accept --devices <SPEC> to fan plots out
across multiple NVIDIA GPUs — one worker thread per device, each
bound via cudaSetDevice and carrying its own buffer pool + writer
channel. Plots are partitioned round-robin, so a batch of 10 plots on
2 GPUs sends plots 0/2/4/6/8 to the first GPU and 1/3/5/7/9 to the
second.
# Every visible CUDA device — enumerated at runtime. No CPU worker.
xchplot2 plot --k 28 --num 10 -f <farmer-pk> -c <pool-contract> \
--out /mnt/plots --devices gpu
# Every CUDA device PLUS a CPU worker on the same batch.
xchplot2 plot ... --devices all
# Only these specific device ids (sorted, deduplicated).
xchplot2 plot ... --devices 0,2,3
# Explicit single id (same as omitting the flag on a single-GPU host).
xchplot2 plot ... --devices 0
# CPU only, or specific GPUs + CPU as a list.
xchplot2 plot ... --devices cpu
xchplot2 plot ... --devices 0,1,cpuAny GPU selector in --devices accepts a :tier suffix to pin the
streaming tier for that device. Tier ∈ plain|compact|minimal|tiny|auto.
Useful when GPUs differ in VRAM, or when one card is also serving
the desktop and you want to leave it more headroom:
# All GPUs auto-pick from free VRAM, except GPU 2 which uses tiny.
xchplot2 plot ... --devices gpu,2:tiny
# All GPUs + CPU worker; GPU 2 = tiny.
xchplot2 plot ... --devices all,2:tiny
# All GPUs pinned to tiny, except GPU 2 which uses plain.
xchplot2 plot ... --devices gpu:tiny,2:plain
# All GPUs pinned to tiny, except GPU 2 which auto-picks (the `:auto`
# sentinel explicitly re-enables auto-pick for a single GPU).
xchplot2 plot ... --devices gpu:tiny,2:auto
# All-explicit form (still works).
xchplot2 plot ... --devices 0:tiny,1:minimal,2:plainPrecedence (highest wins):
- Per-GPU
<id>:<tier>token gpu:<tier>/all:<tier>shorthand- Global
--tier <name>/XCHPLOT2_STREAMING_TIER - Auto-pick from free VRAM
cpu:<tier> is rejected (the CPU worker doesn't use streaming
tiers). Duplicate IDs with conflicting tiers (0:tiny,0:plain) and
unknown tier names are also rejected at parse time.
Omitted flag = single device on the CUDA-default device — identical to pre-multi-GPU behavior, zero regression risk.
Caveats for v1:
- Static round-robin partition. If your GPUs differ in speed the
batch finishes only as fast as the slowest worker's slice; use
--devicesto pick matched cards when that matters. - Each worker gets its own ~4 GB pinned host pool (pool path) or ~6 GB pinned scratch (compact streaming), so host RAM scales linearly. A 4-GPU rig pins ~16-24 GB — size accordingly.
- The workers share
stderr(line-buffered, atomic per-fprintf) so log lines from different GPUs may interleave.
Smoke test: scripts/test-multi-gpu.sh exercises argument parsing
(works on any host, even single-GPU) and, when 2+ GPUs are visible,
runs a live k=22 plot across --devices 0,1.
xchplot2 test <k> <plot-id-hex> [strength] ... # single plot, raw inputs
xchplot2 batch <manifest.tsv> [-v] [--devices <SPEC>]
xchplot2 parity-check [--dir PATH] # CPU↔GPU regression screen| Variable | Effect |
|---|---|
XCHPLOT2_STREAMING=1 |
Force the low-VRAM streaming pipeline even when the pool would fit. |
XCHPLOT2_STREAMING_TIER=plain|compact|minimal|tiny |
Override the streaming-tier auto-pick (plain ~7.4 GiB peak, compact ~5.3 GiB, minimal ~3.7 GiB, tiny ~1.1 GiB — k=28 measured). Equivalent CLI flag: --tier. Either form forces the streaming pipeline even on cards big enough to fit the pool, so --tier tiny works on a 4090 too. |
POS2GPU_MAX_VRAM_MB=N |
Cap the VRAM query to N MB — exercises the streaming fallback. |
POS2GPU_STREAMING_STATS=1 |
Log every streaming-path cudaMalloc / cudaFree. |
POS2GPU_POOL_DEBUG=1 |
Log pool allocation sizes at construction. |
POS2GPU_PHASE_TIMING=1 |
Per-phase wall-time breakdown (Xs / sort / T1 / T2 / T3) on stderr. |
CUDA_ARCHITECTURES=sm_XX |
Override the CUDA arch autodetected from nvidia-smi. |
CUDA_PATH=/path/to/cuda |
Override the CUDA Toolkit root for linking (default: /opt/cuda, /usr/local/cuda). Useful on JetPack / non-standard installs. |
CUDA_HOME=/path/to/cuda |
Fallback for CUDA_PATH — same effect. |
POS2_CHIP_DIR=/path |
Build-time: point at a local pos2-chip checkout instead of FetchContent. |
XCHPLOT2_TEST_GPU_COUNT=N |
Override scripts/test-multi-gpu.sh's auto-detected GPU count (forces run / skip without consulting nvidia-smi). |
v2 (CHIP-48) farming in stock chia-blockchain is presently unfinished
upstream — services aren't wired into the farmer group, a message
handler's signature doesn't match its decorator, ProofOfSpace. challenge is computed from the wrong input, and the dependency pin
on chia_rs excludes the 0.42 release where compute_plot_id_v2
lives. contrib/testnet-farming.patch is a minimal self-contained
fix-up that gets a private testnet running end-to-end:
git clone https://github.com/Chia-Network/chia-blockchain
cd chia-blockchain
git checkout 39f8bec88 # 2.7.0 Checkpoint Merge
git apply /path/to/xchplot2/contrib/testnet-farming.patchThe patch's header comment describes each hunk. None of the changes are xchplot2-specific — they're the farmer / harvester / daemon pieces any v2 plot needs for farming, regardless of who produced it.
src/gpu/ CUDA kernels — AES, Xs, T1, T2, T3
src/host/
├── GpuPipeline Xs → T1 → T2 → T3 device orchestration;
│ pool + streaming (low-VRAM) variants
├── GpuBufferPool persistent device + 2× pinned host pool
├── BatchPlotter producer / consumer batch driver
└── PlotFileWriterParallel sole TU touching pos2-chip headers
tools/xchplot2/ CLI: plot / test / batch
tools/parity/ CPU↔GPU bit-exactness tests
keygen-rs/ Rust staticlib: plot_id_v2, BLS HD, bech32m
PoS2 plots are k=28 by spec. Four code paths, dispatched automatically based on available VRAM:
- Pool path (~15 GB, 16 GB+ cards). The persistent buffer pool is
sized worst-case and reused across plots in
batchmode for amortised allocator cost and double-buffered D2H. Targets for steady-state: RTX 4080 / 4090 / 5080 / 5090, A6000, etc. - Plain streaming (~7.4 GiB floor). Allocates per-phase and frees between phases; no pinned-host parks, single-pass T2 match. Used on 10-11 GB cards that can't fit the pool but have headroom above compact. ~400 ms/plot faster than compact.
- Compact streaming (~5.3 GiB floor). Park/rehydrate of the large intermediates on pinned host across their idle windows + N=2 T2 match staging (cap/2 ≈ 2280 MB at k=28). T1/T2 sorts are tiled (N=2 and N=4) with merge trees. Targets 6-8 GiB cards.
- Minimal streaming (~3.7 GiB floor). Compact's parks plus six layered cuts that bring every phase below the 4 GiB cliff: (1) N=8 T2 match staging (cap/8 ≈ 570 MB at k=28); (2) N=4 T1/T2 sort gather tiling — the merged-key + permuted-meta gather output is D2H'd per tile to pinned host; (3) T3 match section-pair input slicing — d_t2_meta_sorted is parked on pinned host across T3 match, with the section_l + section_r row slices H2D'd per pass to a cap/2 device buffer (xbits + keys stay full-cap for binary-search reads); (4) N=4 T1 match slicing — each section_l pass writes to cap/4 device staging, D2H to pinned host; (5) CUB sub-phase tiling in T1/T2/T3 sort — replaces the four cap-sized uint32/uint64 sort I/O buffers with cap/N per-tile staging + host pinned accumulators, with the multi-way merge done on the CPU; and (6) Xs gen+sort+pack tiling — generate the full (keys, vals) once, then sort in cap/N tiles to host pinned accumulators (carved out of scratch.h_meta), CPU-merge, and pack into d_xs via two strided cudaMemcpy2DAsync H2D copies (no separate device-side pack buffer pair). Measured overall peak at k=28 strength=2 on RTX 4090 (compact → minimal): 5200 → 3640 MB; per-phase peaks: Xs 2570, T1 sort 3640, T2 sort 3640, T3 match 3640, T3 sort 3640. Targets 4 GiB cards (GTX 1050 Ti / 1650, RTX 3050 4GB, MX450) and fits comfortably on 5 GiB+ cards with ~2 GiB headroom. Trade-off: ~6 extra cap-sized PCIe round-trips per plot + ~6 sec/plot of host-CPU merge work — k=28 wall on sm_89: ~31 s/plot vs ~12 s for compact (~2.6×). 4 GiB cards remain an edge case since real 4 GiB hardware reports ~3.5 GiB free post-CUDA-context; please report actual fit.
- Tiny streaming (~1.1 GiB floor). Full Phase 1.4 + 1.5 + 1.6
algorithm port, byte-for-byte peak parity with the SYCL Tiny
tier. On top of Minimal's six cuts, adds: per-section-pair T1
match tile (Xs data parks on pinned host h_xs_pinned; T1 reads
via per-(L,R) section H2D), per-(section_l, match_key_r)
bucket-pair sub-section for T1/T2/T3 match (per-pass tile is L
section + one R bucket instead of full L+R), streaming-partition
T1 sort + streaming-partition T2 sort with global_idx tiebreak +
tile-and-merge T3 sort (eliminates the cap-sized d_t1_meta,
d_t2_mi, and d_t3 on device by partitioning to per-bucket arenas
- per-bucket CUB sort), host-side T2/T3 prepare offsets (binary search on already-sorted h_keys_merged, skipping the cap-sized GPU prepare-keys H2D), d_t3_stage + d_frags_out → host-pinned aliases (T3 match writes via UVA-mapped host pinned; T3 sort lands sorted fragments directly in pinned_dst), and Xs gen+sort per-tile generation via launch_xs_gen_range (eliminates the cap × 2 × u32 full-cap gen output that the non-range path requires). Targets sub-2 GiB NVIDIA cards (Quadro P620 2 GB, GTX 1050 2 GB, older laptop dGPUs). Measured at k=28 strength=2 on RTX 4090: 1064 MB plot peak — byte-identical to SYCL Tiny's measured 1064 MB. Per-phase peaks: Xs 1030, T1 match 1040, T1 sort 1056, T2 match 1040, T2 sort 1064 (floor), T3 match 1024, T3 sort 1047. All phases ≤ 1064 MB. Trade-off: ~17 s/plot extra wall vs minimal (per-bucket sequential gen+sort+pack+merge) — k=28 wall on sm_89 ~50 s/plot. Byte-identical to other tiers at k=22/24/26/28 (validated). There is no smaller tier — a forced tiny on a card below the floor throws.
xchplot2 queries cudaMemGetInfo at pool construction; if the
pool doesn't fit, the streaming-tier dispatch picks the largest
streaming tier that fits with a 128 MB margin. Force streaming on
any card with XCHPLOT2_STREAMING=1. --tier plain|compact|minimal|tiny|auto (or XCHPLOT2_STREAMING_TIER)
overrides the auto-pick — useful for testing or to step down from
a tight margin (e.g. an 8 GiB card OOMing mid-plot can
--tier compact).
Plot output is bit-identical across all paths — streaming reorganises memory, not algorithms.
k=28, strength=2, RTX 4090 (sm_89), PCIe Gen4 x16:
| Mode | Per plot |
|---|---|
| pos2-chip CPU baseline | ~50 s |
xchplot2 batch steady-state wall (pool path) |
2.15 s |
xchplot2 batch steady-state wall (streaming path, ≤8 GB cards) |
~3.7 s |
| Producer GPU time, steady-state | 1.96 s |
| Device-kernel floor (single-plot nsys) | 1.91 s |
Numbers above are single-GPU. With --devices 0,1,... the batch is
partitioned round-robin across N worker threads (one per device), so
wall-clock throughput is bounded by the slowest device's slice —
≈ linear scaling on matched cards, less if cards differ. Live
multi-GPU plots were confirmed end-to-end on NVIDIA.
MIT — see LICENSE and NOTICE for third-party attributions. Built collaboratively with Claude.