Visualizing Complex Dynamics In Many-Core Accelerator Architectures

Wilson Fung

Visualizing Complex Dynamics In Many-Core Accelerator Architectures

Wilson Fung

2010, Proceedings of the IEEE …

Sign up for access to the world's latest research

checkGet notified about relevant papers

checkSave papers to use in your research

checkJoin the discussion with peers

checkTrack your impact

Abstract
AI

The increasing trend towards many-core architectures, particularly GPUs, poses unique challenges in understanding and optimizing performance due to their parallel nature and complex dynamic behaviors. This paper explores how traditional performance tuning methods may overlook critical optimization opportunities because they rely on fixed performance metrics. It proposes detailed measurement metrics that can capture dynamic behaviors more effectively, allowing for better performance analysis and optimization of applications using many-core architectures.

Figures (14)

Fig. 1. Example of a many-core accelerator architecture: The CUDA GPU.

Fig. 2. Intermittent DRAM congestion concealed with traditional perfor- mance counter. See Section III-C for more detail.

The rest of this paper is organized as follows. Section II introduces AerialVision and describes its key features in detail. Section III describes the implementation and simulation methodology used in this paper and presents several case studies highlighting the key features of AerialVision. Sec- tion IV discusses related work on performance visualization and performance tuning. Section V concludes the paper.

Fig. 4. Example illustrating basic plot formats provided by time-lapse view. (shown in Figure 3(a)) allows the user to plot several per- formance metrics versus time so they may be compared. The source code view (shown in Figure 3(b)) displays performance statistics associated with individual lines in the application source code which helps guide the user toward bottlenecks in the source code that may require optimization.

Fig. 5. Runtime warp divergence breakdown.

Fig. 6. PC-Histogram of a SM running ray tracing. White = Instruction has not been touched by any thread during the sampling period. Color = One or more threads have touched the instruction during the sampling period.

AerialVision uses debug information (. loc tags) in the PTX assembly code to aggregate the statistics collected for each PTX instruction into its associated statement in the CUDA source code. Table I lists the performance statistics currently available in source code view. AerialVision also provides the ability to visualize the ratios of two different performance Statistics. For example, the ratio of Non-Coalesced DRAM Traffic and Non-Coalesced Warp Count reveals the average number of non-coalesced memory accesses generated by each warp executing a PTX instruction. This feature gives the user the ability to uncover further insights from existing statistics.

PERFORMANCE STATISTICS AVAILABLE IN SOURCE CODE VIEW

Fig. 8. Dynamic IPC of each SM for LIBOR with two different inter- connection arbitration schemes. Top: Baseline Parallel Iterative Matching arbitration. Middle: Hold-Grant arbitration proposed by Yuan et al. to preserve row locality of memory access [44]. Bottom: Hold-Grant with time-out mechanism.

Fig. 10. Dynamic DRAM row activation count of each DRAM channel for BFS with three different address mappings. Top: DRAM channel specified by bits 6 to 8. Middle: DRAM channel specified by bits 8 to 10. Bottom: DRAM channel specified by bits 11 to 13, which shows the best performance of all three mappings. Fig. 9. Intermittent DRAM congestion removed with a different address mapping.

Fig. 11. The average shared memory access cycle of each CUDA source line for two versions of StoreGPU. Top: Initial version unaware of shared memory bank conflicts. Bottom: Revised version with the bank conflicts optimized out.

Nirmal Saxena

Proceedings of the 2023 ACM/SPEC International Conference on Performance Engineering

GPUs have become common in HPC systems to accelerate scientific computing and machine learning applications. Efficiently mapping these applications to rapid evolutions of GPU architectures for high performance is a well-known challenge. Various performance inefficiencies exist in GPU kernels that impede applications from obtaining bare-metal performance. While existing tools are able to measure these inefficiencies, they mostly focus on data collection and presentation, requiring significant manual efforts to understand the root causes for actionable optimization. Thus, we develop DrGPU, a novel profiler that performs top-down analysis to guide GPU code optimization. As its salient feature, DrGPU leverages hardware performance counters available in commodity GPUs to quantify stall cycles, decompose them into various stall reasons, pinpoint root causes, and provide intuitive optimization guidance. With the help of DrGPU, we are able to analyze important GPU benchmarks and applications and obtain nontrivial speedups-up to 1.77× on V100 and 2.03× on GTX 1650. CCS CONCEPTS • Software and its engineering → Compilers; General programming languages; • General and reference → Measurement; Metrics.

Log In

Visualizing Complex Dynamics In Many-Core Accelerator Architectures

Sign up for access to the world's latest research

AbstractAI

Related papers

Related topics

Abstract
AI