While GSPN-1 established the algorithmic breakthrough of line-scan propagation, profiling revealed opportunities for system-level optimization—the initial implementation achieved 3–8% of peak memory bandwidth due to per-column kernel launches and strided memory access. GSPN-2 restructures the pipeline into three stages:
- Fused propagation. A single kernel now handles entire scans across chunks, batches, and channels, with 2D thread blocks mapping rows and channel slices.
- On-chip reuse. Tile-local hidden states are staged in shared memory when beneficial, while coalesced reads keep L1/L2 caches warm for structured access patterns.
- Proxy feature space. Inputs are projected to a compact channel subspace before propagation and lifted back afterward, shrinking the concurrent block budget without losing expressiveness.
Step-by-Step CUDA Optimization
We benchmark a typical configuration (1024×1024 image size, batch size 16, 8 channels) and quantify the impact of each CUDA kernel optimization. Starting from the GSPN-1 baseline (71.4 ms), we systematically apply system-level optimizations to fully leverage modern GPU architectures. Our optimizations compound as follows: