GSPN-2 Logo

GSPN-2: Efficient Parallel Sequence Modeling

1NVIDIA 2The University of Hong Kong 3University of California, San Diego
† Work done while interning at NVIDIA   * Corresponding author

Abstract


Vision transformers underpin nearly every state-of-the-art vision foundation model—text-to-image diffusion networks, vision-language aligners, and detection/segmentation pipelines all depend on dense self-attention. However, attention scales quadratically with pixel count, forcing practical deployments to cap resolutions. Generalized Spatial Propagation Network (GSPN) replaces 2D self-attention with a line-scan approach that reduces complexity from quadratic to approximately linear, while achieving up to 84× speedup for 16K-resolution diffusion inference.

Building upon GSPN-1's strong algorithmic foundation, GSPN-2 introduces a joint algorithm–system redesign to unlock the full potential of the line-scan approach: (a) consolidating all propagation into a single unified CUDA kernel; (b) introducing compact multi-channel propagation that projects features into a proxy space; and (c) refining grid/block configuration with coalesced memory access. These system-level optimizations address the implementation bottlenecks in GSPN-1, enabling GSPN-2 to achieve near-theoretical peak memory bandwidth. On an NVIDIA A100 GPU, runtime for a 1024×1024×8 input decreases from 71.4 ms in GSPN-1 to just 1.8 ms—achieving a 40× speedup while maintaining transformer-level accuracy across ImageNet and text-to-image synthesis.

GSPN-2 teaser comparing runtime and design
Figure 1: GSPN-2 achieves transformative performance improvements over GSPN-1, running up to 30-50× faster across diverse input configurations on modern GPU architectures.

GSPN-2 Highlights


Single-Kernel Propagation

All four directional scans are fused into one CUDA kernel, eliminating thousands of micro-launches and keeping the GPU saturated.

Compact Channel Proxy

A lightweight projection compresses channels before propagation, preventing concurrency saturation and enabling near-constant latency.

Hardware-Aligned Affinities

Channel-shared propagation matrices align with transformer affinity maps, reducing parameters while preserving dense spatial context.

Algorithm–System Co-Design


While GSPN-1 established the algorithmic breakthrough of line-scan propagation, profiling revealed opportunities for system-level optimization—the initial implementation achieved 3–8% of peak memory bandwidth due to per-column kernel launches and strided memory access. GSPN-2 restructures the pipeline into three stages:

  1. Fused propagation. A single kernel now handles entire scans across chunks, batches, and channels, with 2D thread blocks mapping rows and channel slices.
  2. On-chip reuse. Tile-local hidden states are staged in shared memory when beneficial, while coalesced reads keep L1/L2 caches warm for structured access patterns.
  3. Proxy feature space. Inputs are projected to a compact channel subspace before propagation and lifted back afterward, shrinking the concurrent block budget without losing expressiveness.
Pipeline optimization from GSPN-1 to GSPN-2
Pipeline Optimization: GSPN-1 launches separate kernels for each image column, causing redundant global-memory access and limited temporal data reuse. GSPN-2 fuses all operations into a single kernel with an inner loop over columns, enabling register and cache reuse of intermediate states while minimizing memory traffic.

Step-by-Step CUDA Optimization

We benchmark a typical configuration (1024×1024 image size, batch size 16, 8 channels) and quantify the impact of each CUDA kernel optimization. Starting from the GSPN-1 baseline (71.4 ms), we systematically apply system-level optimizations to fully leverage modern GPU architectures. Our optimizations compound as follows:

Step-by-step optimization and throughput comparison
Left: Step-by-step optimization of the GSPN CUDA kernel. Starting from GSPN-1 baseline (71.4 ms), we apply cumulative system optimizations: (1) Single fused kernel eliminates thousands of micro-launches (1.2× speedup → 57.4 ms); (2) Coalesced memory access maximizes bandwidth utilization (23.9× improvement → 2.4 ms); (3) Shared memory cache for hidden states reduces global memory traffic (1.1× → 2.2 ms); (4) 2D thread blocks improve thread organization and data locality (1.1× → 2.1 ms); (5) Compressive channels reduce parameter fetch overhead (1.1× → 1.9 ms). The fully optimized GSPN-2 achieves a cumulative 40.0× speedup (1.8 ms). Right: Memory throughput comparison across typical input configurations on A100 GPU. The system optimizations in GSPN-2 enable near-theoretical peak global-memory bandwidth (91.5%–93.3% efficiency) across diverse batch sizes, spatial resolutions, and channel counts, representing a substantial improvement over the initial GSPN-1 implementation.

Runtime Performance


Comprehensive profiling across various input configurations reveals GSPN-2's consistent performance advantages. For large image sizes with channel counts, GSPN-2 achieves more obvious speedups for both forward and backward passes. When processing inputs with high channel dimensions, enabling practical deployment in compute-intensive applications such as real-time video processing and multimodal foundation models.

Runtime comparison of GSPN-1 vs GSPN-2
Runtime Comparison: Forward and backward pass execution times (in milliseconds) across different spatial resolutions, batch sizes, and channel counts. The system-level optimizations in GSPN-2 deliver consistent speedups ranging from 2× to 48× depending on configuration, demonstrating the effectiveness of the algorithmic and kernel co-design approach.

ImageNet Classification


GSPN-2 models incorporate shared propagation weights across all channels with a compressive proxy dimension of 2, allowing saved parameters to be reallocated for deeper or wider network architectures.

GSPN-2-T achieves competitive 83.0% accuracy with significantly fewer parameters (24M vs. 30M for GSPN-T) and lower computational cost (4.2G MACs vs. 5.3G), outperforming state-space models such as Vim-S (80.5%), VMamba-T (82.2%), and LocalVMamba-T (82.7%) while remaining competitive with leading ConvNets and Transformers.

GSPN-2-S achieves an impressive 84.4% accuracy, marking a significant +0.6% improvement over GSPN-S (83.8%) with only a marginal increase in MACs (9.2G vs 9.0G), surpassing strong competitors like MambaOut-Small (84.1%) and UniFormer-B (83.9%).

GSPN-2-B reaches 84.6% accuracy with a +0.3% improvement over GSPN-B (84.3%) while reducing MACs from 15.9G to 14.2G, demonstrating the benefits of the shared-affinity design and proxy compression strategy.

ImageNet classification results table
ImageNet Performance: Comparison of GSPN-2 models against ConvNets, Transformers, and raster-scan (RS) methods at the resolution of 224². GSPN-2 matches or exceeds prior approaches while using fewer parameters and lower MACs across Tiny, Small, and Base scales.

Text-to-Image Generation


Building upon GSPN-1, GSPN-2 incorporates the system enhancements and proxy dimension compression to 1/8 of the original channel dimension (Cproxy = C/8). Compared to baseline SDXL, GSPN-2 achieves a 32× speedup in 4K image generation. For ultra-high-resolution 16K images, GSPN-2 further improves, reducing inference time by 93× compared to GSPN-1's 84× improvement.

  • Channel proxy compression keeps the kernel below the GPU concurrency ceiling across high-resolution tiles.
  • Spatial fidelity is preserved through dense, four-direction propagation with shared affinity matrices.
  • Inference benefits translate directly to batchable production pipelines for photorealistic generation.
Text-to-image samples produced by GSPN-2 SDXL
Figure (vis1): Qualitative text-to-image results generated from our GSPN-2 SDXL model. We enable generation up to 16K resolution on a single A100 GPU while reducing inference time by up to 93× on the SDXL model.

BibTeX

@inproceedings{wang2025gspn2,
    author    = {Wang, Hongjun and Jiang, Yitong and McCarthy, Collin and Wehr, David and Ye, Hanrong and Li, Xinhao and Cheung, Ka Chun and Byeon, Wonmin and Gu, Jinwei and Chen, Ke and Han, Kai and Yin, Hongxu and Molchanov, Pavlo and Kautz, Jan and Liu, Sifei},
    title     = {GSPN-2: Efficient Parallel Sequence Modeling},
    booktitle = {NeurIPS},
    year      = {2025}
}

Acknowledgement

This web page is modified based on the template from nerfies. Thanks for their great work.