Search Program
Organizations
Contributors
Presentations
Workshop
Livestreamed
Recorded
TP
W
DescriptionTraditionally, research in physics and engineering involves formulating and solving partial differential equations—a task that can take years to master. However, by rethinking these problems as N-body simulations, students can engage with them using only a freshman-level understanding of mathematics, physics, and computer science—along with a fast computer and vivid imagination. In the past, such simulations required expensive, high-performance machines. But with the power of modern GPUs, a gaming desktop can now perform meaningful N-body computations, making advanced, independent research accessible to undergraduate students.
Workshop
Not Livestreamed
Not Recorded
Partially Livestreamed
Partially Recorded
TP
W
DescriptionDigital twins are emerging as a transformative concept for the design, operation, and optimisation of complex scientific instruments. In this plenary, I will present recent progress on integrating in-situ visualization and interactive steering into the open-source accelerator simulation framework OPAL. The motivation is rooted in the way modern particle accelerators are operated—where real-time monitoring, adaptive decision-making, and iterative optimisation are indispensable.
By coupling high-fidelity simulations with in-situ analysis and steering, we demonstrate a new paradigm: optimisation during simulation. This approach reduces turnaround time, enables exploration of vast parameter spaces, and creates opportunities for interactive workflows that mirror experimental practice. Beyond accelerator science, these developments resonate strongly with other domains that rely on large-scale digital twins, such as fusion energy, materials science, and climate modelling.
I will discuss the architectural and HPC challenges of enabling such capabilities, including integration with exascale systems, data movement constraints, and scalability considerations. I will conclude with a perspective on how in-situ visualization and steering can reshape scientific computing more broadly, supporting interactive, real-time science at unprecedented scales.
Exhibitor Forum
Quantum & Other Post Moore Computing Technologies
Software Tools
Livestreamed
Recorded
TP
XO/EX
DescriptionThe subgraph isomorphism problem—finding pattern graphs within larger data graphs—is central to many HPC applications but remains computationally challenging due to its NP-complete nature. Traditional algorithms rely on backtracking strategies that resist effective parallelization, limiting their utility on GPU architectures and large CPUs.
Motivated by quantum circuit compilation challenges, we developed a fundamentally different approach that reformulates subgraph isomorphism by representing graphs in unified tabular formats, decomposing them into fundamental building blocks called motifs, and expressing the algorithm using standard tabular operations like filters and merges. This approach, implemented through NVIDIA RAPIDS, enables massive parallelization using NVIDIA GPUs.
Our approach achieves speedups of up to 595x on a single NVIDIA H200 GPU, with many benchmarks exceeding 100x, while democratizing high-performance graph processing. Practitioners can now leverage familiar data science tools and open-source libraries to achieve efficient parallel graph analysis, making advanced subgraph isomorphism accessible to a broader community.
Motivated by quantum circuit compilation challenges, we developed a fundamentally different approach that reformulates subgraph isomorphism by representing graphs in unified tabular formats, decomposing them into fundamental building blocks called motifs, and expressing the algorithm using standard tabular operations like filters and merges. This approach, implemented through NVIDIA RAPIDS, enables massive parallelization using NVIDIA GPUs.
Our approach achieves speedups of up to 595x on a single NVIDIA H200 GPU, with many benchmarks exceeding 100x, while democratizing high-performance graph processing. Practitioners can now leverage familiar data science tools and open-source libraries to achieve efficient parallel graph analysis, making advanced subgraph isomorphism accessible to a broader community.
SCinet Network Research Exhibition
Not Livestreamed
Not Recorded
Flash Session
Not Livestreamed
Not Recorded
TP
XO/EX
DescriptionGet the first look at Packet Power’s newest innovation, the High-Density Power Monitor. At just 3 inches, it’s the smallest and most scalable multi-circuit power monitoring system on the market, capable of tracking 120 circuits in a space smaller than what’s inside a standard light switch. Whether managing a single rack or thousands of devices, Packet Power ensures monitoring 1 device is as easy as monitoring 1,000.
In this session, Packet Power’s Founder & CTO, Paul Bieganski, will be giving the first introduction to this small & mighty sensor that is redefining what’s possible in energy monitoring. The High-Density Power Monitor eliminates bulky hardware, complex wiring, and lengthy installations. It’s plug-and-play simple, seamlessly integrates with Packet Power’s EMX software or any third-party monitoring platform, and supports both wired and wireless connectivity—including secure, air-gapped environments.
Today’s computing environments are experiencing an energy density arms race—with systems consuming megawatts of power in a single cabinet. New cooling methods, extreme power densities, and evolving form factors demand monitoring solutions that can keep up. Packet Power’s new High-Density Power Monitor meets that challenge head-on, offering the scalability, adaptability, and visibility needed to manage energy use in the AI era.
Join us for this informative session, or visit PacketPower.com to learn more.
Monitoring Made Easy®
In this session, Packet Power’s Founder & CTO, Paul Bieganski, will be giving the first introduction to this small & mighty sensor that is redefining what’s possible in energy monitoring. The High-Density Power Monitor eliminates bulky hardware, complex wiring, and lengthy installations. It’s plug-and-play simple, seamlessly integrates with Packet Power’s EMX software or any third-party monitoring platform, and supports both wired and wireless connectivity—including secure, air-gapped environments.
Today’s computing environments are experiencing an energy density arms race—with systems consuming megawatts of power in a single cabinet. New cooling methods, extreme power densities, and evolving form factors demand monitoring solutions that can keep up. Packet Power’s new High-Density Power Monitor meets that challenge head-on, offering the scalability, adaptability, and visibility needed to manage energy use in the AI era.
Join us for this informative session, or visit PacketPower.com to learn more.
Monitoring Made Easy®
Flash Session
Not Livestreamed
Not Recorded
TP
XO/EX
DescriptionGet the first look at Packet Power’s newest innovation, the High-Density Power Monitor. At just three inches, it’s the smallest and most scalable multi-circuit power monitoring system on the market, capable of tracking 120 circuits in a space smaller than what’s inside a standard light switch. Whether managing a single rack or thousands of devices, Packet Power ensures monitoring one device is as easy as monitoring 1,000.
In this session, Packet Power’s founder and CTO, Paul Bieganski, will be giving the first introduction to this small and mighty sensor that is redefining what’s possible in energy monitoring. The High-Density Power Monitor eliminates bulky hardware, complex wiring, and lengthy installations. It’s plug-and-play simple, seamlessly integrates with Packet Power’s EMX software or any third-party monitoring platform, and supports both wired and wireless connectivity—including secure, air-gapped environments.
Today’s computing environments are experiencing an energy density arms race—with systems consuming megawatts of power in a single cabinet. New cooling methods, extreme power densities, and evolving form factors demand monitoring solutions that can keep up. Packet Power’s new High-Density Power Monitor meets that challenge head-on, offering the scalability, adaptability, and visibility needed to manage energy use in the AI era.
Join us for this informative session, or visit PacketPower.com to learn more.
Monitoring Made Easy®
In this session, Packet Power’s founder and CTO, Paul Bieganski, will be giving the first introduction to this small and mighty sensor that is redefining what’s possible in energy monitoring. The High-Density Power Monitor eliminates bulky hardware, complex wiring, and lengthy installations. It’s plug-and-play simple, seamlessly integrates with Packet Power’s EMX software or any third-party monitoring platform, and supports both wired and wireless connectivity—including secure, air-gapped environments.
Today’s computing environments are experiencing an energy density arms race—with systems consuming megawatts of power in a single cabinet. New cooling methods, extreme power densities, and evolving form factors demand monitoring solutions that can keep up. Packet Power’s new High-Density Power Monitor meets that challenge head-on, offering the scalability, adaptability, and visibility needed to manage energy use in the AI era.
Join us for this informative session, or visit PacketPower.com to learn more.
Monitoring Made Easy®
Workshop
Livestreamed
Recorded
TP
W
DescriptionWith increasing levels of parallelism, increased heterogeneity, energy and memory constraints coupled with emerging post-Moore computing architectures like quantum and neuromorphic, a reevaluation of current approaches for extreme-scale operating systems and runtime environments is needed.
ROSS is a workshop aimed at identifying looming problems and discussing promising research solutions in the area of runtime and operating systems for extreme-scale supercomputers. Specifically, ROSS focuses on principles and techniques to design, implement, optimize, or operate runtime and operating systems for extreme-scale supercomputers and cloud environments. In addition to typical workshop publications, we encourage novel and possibly immature ideas, provided that they are interesting and on-topic. Well argued position papers are also welcome.
ROSS is a workshop aimed at identifying looming problems and discussing promising research solutions in the area of runtime and operating systems for extreme-scale supercomputers. Specifically, ROSS focuses on principles and techniques to design, implement, optimize, or operate runtime and operating systems for extreme-scale supercomputers and cloud environments. In addition to typical workshop publications, we encourage novel and possibly immature ideas, provided that they are interesting and on-topic. Well argued position papers are also welcome.
Workshop
Performance Evaluation, Scalability, & Portability
Livestreamed
Recorded
TP
W
DescriptionThe Performance, Portability, and Productivity in HPC workshop aims to bring together developers and researchers with an interest in practical solutions, technologies, tools, and methodologies that enable the development of performance-portable applications across a diverse set of current and future high‑performance computers.
The topic of Performance, Portability, and Productivity focuses on enabling applications and libraries to run across multiple architectures without significant impact on achieved performance and with the goal of maintaining developer productivity. This workshop provides a forum for discussions of successes and failures in tackling the compelling problems that lie at the intersection of performance, portability, and productivity in high-performance computing. This area touches on many aspects of HPC/AI software development and the workshop program is expected to reflect a wide range of experiences and perspectives, including those of compiler, language and runtime experts; applications developers and performance engineers; and domain scientists.
For more information see: https://p3hpc.org/workshop/2025/
The topic of Performance, Portability, and Productivity focuses on enabling applications and libraries to run across multiple architectures without significant impact on achieved performance and with the goal of maintaining developer productivity. This workshop provides a forum for discussions of successes and failures in tackling the compelling problems that lie at the intersection of performance, portability, and productivity in high-performance computing. This area touches on many aspects of HPC/AI software development and the workshop program is expected to reflect a wide range of experiences and perspectives, including those of compiler, language and runtime experts; applications developers and performance engineers; and domain scientists.
For more information see: https://p3hpc.org/workshop/2025/
Workshop
Livestreamed
Recorded
TP
W
DescriptionScientific workflows have underpinned some of the most significant discoveries of the past several decades. Workflow management systems (WMS) provide abstraction and automation that enable researchers to easily define sophisticated computational processes, and to then execute them efficiently on parallel and distributed computing systems. As workflows have been adopted by multiple scientific communities, they are becoming more complex and require more sophisticated workflow management capabilities.
This workshop focuses on the many facets of scientific workflow composition, management, sustainability, and application to domain sciences in an increasingly diverse landscape. The workshop covers a broad range of topics in the scientific workflow lifecycle that include: reproducible research with workflows; workflow execution in distributed and heterogeneous environments; application of AI/ML in workflow management; workflow provenance; serverless workflows; exascale computing with workflows; stream-processing, interactive, adaptive and data-driven workflows; workflow scheduling and resource management; workflow fault-tolerance, debugging, performance analysis/modeling; big data and AI workflows, etc.
This workshop focuses on the many facets of scientific workflow composition, management, sustainability, and application to domain sciences in an increasingly diverse landscape. The workshop covers a broad range of topics in the scientific workflow lifecycle that include: reproducible research with workflows; workflow execution in distributed and heterogeneous environments; application of AI/ML in workflow management; workflow provenance; serverless workflows; exascale computing with workflows; stream-processing, interactive, adaptive and data-driven workflows; workflow scheduling and resource management; workflow fault-tolerance, debugging, performance analysis/modeling; big data and AI workflows, etc.
Birds of a Feather
Performance Measurement, Modeling, & Tools
Livestreamed
Recorded
TP
XO/EX
DescriptionData intensive supercomputer applications are increasingly important workloads, especially for “Big Data” problems, but are ill suited for most of today’s computing platforms (at any scale!). The Graph500 list has grown to over 273 entries and has demonstrated the challenges of even simple analytics. The SSSP kernel introduced at SC17 has increased the benchmark’s overall difficulty. This BoF will unveil the latest Graph500 lists, provide in-depth analysis of the kernels and machines, and enhance the new energy metrics of the Green Graph500. It will offer a forum for community and provide a rallying point for data-intensive supercomputing problems.
Workshop
Livestreamed
Recorded
TP
W
DescriptionUnderstanding program behavior is critical to overcome the expected architectural and programming complexities, such as limited power budgets, heterogeneity, hierarchical memories, shrinking I/O bandwidths, and performance variability, that arise on modern HPC platforms. To do so, HPC software developers need intuitive support tools for debugging, performance measurement, analysis, and tuning of large-scale HPC applications. Moreover, data collected from these tools, such as hardware counters, communication traces, and network traffic, can be far too large and too complex to be analyzed in a straightforward manner. We need new automatic analysis and visualization approaches to help application developers intuitively understand the multiple, interdependent effects that algorithmic choices have on application correctness or performance.
The ProTools workshop brings together HPC application developers, tool developers, and researchers from the visualization, performance, and program analysis fields for an exchange of new approaches to assist developers in analyzing, understanding, and optimizing programs for extreme-scale platforms.
The ProTools workshop brings together HPC application developers, tool developers, and researchers from the visualization, performance, and program analysis fields for an exchange of new approaches to assist developers in analyzing, understanding, and optimizing programs for extreme-scale platforms.
Workshop
Livestreamed
Recorded
TP
W
DescriptionEnsuring correctness in HPC applications is one of the fundamental challenges that the HPC community faces today. While significant advances in verification, testing, and debugging have been made to isolate software defects in the context of non-HPC software, several factors make achieving correctness in HPC applications and systems much more challenging than in general systems software: growing heterogeneity (CPUs, GPUs, and special purpose accelerators), massive scale computations, use of combined parallel programming models (e.g., MPI+X), new scalable numerical algorithms (e.g., to leverage reduced precision in floating-point arithmetic), and aggressive compiler optimizations/transformations are some of the challenges that make correctness harder in HPC. As the complexity of future architectures, algorithms, and applications increases, the ability to fully exploit exascale systems will be limited without correctness. The goal of this workshop is to bring together researchers and developers to present and discuss novel ideas to address the problem of correctness in HPC.
Workshop
Livestreamed
Recorded
TP
W
DescriptionOptimizing cache efficiency is critical for mitigating the performance gap between CPUs and memory systems. However, the interaction between application data access patterns and the cache hierarchy complicates debugging cache performance. Existing profiling approaches can identify program performance bottlenecks but leave root-cause diagnosis of cache inefficiencies to programmer expertise. This paper presents an innovative approach that not only precisely captures where cache inefficiencies exist but also helps identify why they occur. Our methodology captures the causal relationship of cache interactions among array variables by tracking cache line evictions, while simultaneously measuring both temporal and spatial locality and classifying cache miss types. We use a new visualization technique (a Cache Interaction Graph), to correlate these metrics with detected interaction patterns to identify root causes and appropriate optimizations. Our Cachegrind extension provides variable-level data locality analysis with acceptable overhead. Our real-world cases demonstrate the effectiveness of our approach.
Workshop
Livestreamed
Recorded
TP
W
DescriptionWe present a framework for developing compute graph-based applications targeting the AI Engine (AIE) array of AMD Versal SoCs. This framework enables users to embed AIE-based dataflow graph prototypes directly within existing C++ applications and automatically transform them into deployable AIE graph projects. It thereby eliminates the need to manually separate host and accelerator codebases, as required by the standard AMD Vitis workflow. The framework comprises two core components: (1) a compute graph simulation library that can be linked into existing C++ programs, and (2) a Clang-based source-to-source translator that extracts simulator-defined graphs and prepares them for compilation with AMD’s AIE toolchain. We evaluate our framework using AMD’s official example graphs and show that our generated AIE code achieves performance comparable to hand-optimized Vitis implementations. Additionally, we demonstrate how C++ compile-time code execution can be leveraged to simplify the implementation of source-to-source translation and static source analysis.
Workshop
AI, Machine Learning, & Deep Learning
Clouds & Distributed Computing
Performance Evaluation, Scalability, & Portability
Scientific & Information Visualization
Partially Livestreamed
Partially Recorded
TP
W
DescriptionEfficient parallel I/O is essential for large-scale scientific workflows, particularly in in situ visualization pipelines where output sizes and data distribution can vary significantly over time.
We present a new data-size based aggregation strategy for the ADIOS2 BP5 engine, designed to improve parallel I/O performance under load imbalance. Whereas existing aggregation strategies for ADIOS group writers based on compute node assignment, our approach dynamically balances subfile sizes according to the amount of data each process will write. We evaluate this strategy using a synthetic workload under low and severe load imbalance. Results show that the data-size based aggregation matches or outperforms existing strategies. These findings highlight the potential of adaptive aggregation strategies to improve I/O performance for imbalanced scientific workloads.
We present a new data-size based aggregation strategy for the ADIOS2 BP5 engine, designed to improve parallel I/O performance under load imbalance. Whereas existing aggregation strategies for ADIOS group writers based on compute node assignment, our approach dynamically balances subfile sizes according to the amount of data each process will write. We evaluate this strategy using a synthetic workload under low and severe load imbalance. Results show that the data-size based aggregation matches or outperforms existing strategies. These findings highlight the potential of adaptive aggregation strategies to improve I/O performance for imbalanced scientific workloads.
Workshop
Livestreamed
Recorded
TP
W
DescriptionMojo is a novel programming language to be open-sourced by 2026 that closes performance gaps in the Python ecosystem. We present an initial look of its GPU performance portable capabilities - since June 2025 - for four science workloads: the memory-bound Babelstream and Seven-point stencil, the compute-bound miniBUDE and Hartree-Fock (including atomic operations). Results indicate that memory-bound kernels are on par, while gaps exist on compute-bound kernels when compared to NVIDIA’s CUDA on H100 and AMD’s HIP on MI300A GPUs, respectively. Thus, Mojo proposes unifying AI workflows by combining Python interoperability at run-time with MLIR-compiled performant portable code.
Best Poster Presentations (Research, ACM SRC Grad/Undergrad)
Research & ACM SRC Posters
TP
DescriptionModern high performance computing increasingly relies on hardware accelerators like NVIDIA Tensor Cores, which employ non-standard internal arithmetic that can evolve between hardware generations. This non-standard approach can violate the fundamental mathematical property of monotonicity, leading to incorrect outputs where adding a larger number produces a smaller result. To address this, we introduce a formal framework using satisfiability modulo theories to analyze the hardware design space by systematically varying hardware features (e.g., number of terms [𝑛], internal padding bits [𝑝]) within a custom bitvector encoding. We derive a precise condition for guaranteed monotonicity, proving that non-monotonicity can only occur when 𝑝 ≤ ⌊log2 (𝑛 − 1) − 2⌋. We also derive a formula for the maximum magnitude of error when non-monotonicity can occur. Our results provide hardware architects with provably correct design parameters to eliminate such anomalies, ensuring greater numerical stability.
Doctoral Showcase
Research & ACM SRC Posters
Livestreamed
Recorded
TP
DescriptionQuantum computing has emerged as a transformative technology capable of solving complex problems beyond classical systems' limits. However, present-day quantum systems face critical bottlenecks, including limited qubit counts, brief coherence intervals, and high error susceptibility, which obstruct the execution of large, complex circuits. The rapid development of quantum processors has led to the proliferation of cloud-based quantum computing services from platforms like IBM, Google, and Amazon, introducing unique challenges in resource allocation, job scheduling, and multi-device orchestration as quantum workloads increase in complexity.
We present a comprehensive digital twin framework for quantum cloud infrastructures designed to model and simulate real quantum cloud systems while addressing distributed computing challenges. Developed in Python using the SimPy discrete-event simulation library, our framework replicates key aspects of quantum cloud environments including detailed quantum device modeling, job lifecycle management, and noise-aware fidelity estimation—making it the first to simulate superconducting gate-based quantum cloud systems at an administrative level with job fidelity. The framework supports distributed scheduling and concurrent execution of quantum jobs on networked quantum processors (QPUs) connected via real-time classical channels. It models circuit decomposition for workloads exceeding individual QPU limits, enabling parallel execution through inter-processor communication. We evaluate four distinct scheduling techniques, including a reinforcement learning-informed model, across metrics including runtime efficiency, fidelity preservation, and communication costs. Our analysis demonstrates how parallelized, noise-aware scheduling can improve computational throughput in distributed quantum infrastructures, providing proof-of-concept that our quantum cloud simulation framework can effectively serve as a digital twin for modeling and implementing practical quantum systems.
We present a comprehensive digital twin framework for quantum cloud infrastructures designed to model and simulate real quantum cloud systems while addressing distributed computing challenges. Developed in Python using the SimPy discrete-event simulation library, our framework replicates key aspects of quantum cloud environments including detailed quantum device modeling, job lifecycle management, and noise-aware fidelity estimation—making it the first to simulate superconducting gate-based quantum cloud systems at an administrative level with job fidelity. The framework supports distributed scheduling and concurrent execution of quantum jobs on networked quantum processors (QPUs) connected via real-time classical channels. It models circuit decomposition for workloads exceeding individual QPU limits, enabling parallel execution through inter-processor communication. We evaluate four distinct scheduling techniques, including a reinforcement learning-informed model, across metrics including runtime efficiency, fidelity preservation, and communication costs. Our analysis demonstrates how parallelized, noise-aware scheduling can improve computational throughput in distributed quantum infrastructures, providing proof-of-concept that our quantum cloud simulation framework can effectively serve as a digital twin for modeling and implementing practical quantum systems.
Panel
Architectures
Cloud, Data Center, & Distributed Computing
Livestreamed
Recorded
TP
DescriptionHPC has leveraged open-source software for many years, and indeed many of our activities rely upon these technologies. However, key parts of our ecosystem remain closed, which include vendor-specific tooling and the underlying hardware that HPC relies on. Recent advances, and culture shifts in the industry, mean that we are now at a stage where an entirely open and community-driven HPC ecosystem, built upon an open technology stack and open standards, is a more realistic proposition than ever before. Benefits of this include providing transparency and community involvement in key technologies, and indeed initiatives such as RISC-V mean that openness can permeate much deeper down the stack.
The time is now to push towards a fully open, community-driven HPC ecosystem, and this panel will explore the opportunities and challenges associated with such an ambition, and how we as a community can encourage progress towards this goal.
The time is now to push towards a fully open, community-driven HPC ecosystem, and this panel will explore the opportunities and challenges associated with such an ambition, and how we as a community can encourage progress towards this goal.
Workshop
Livestreamed
Recorded
TP
W
DescriptionApproximate and low-precision computing are essential for modern applications, and effectively leveraging available precision options can deliver substantial gains in performance and energy efficiency.
We focus on the Fast Fourier Transform (FFT), a representative function used in scientific computing, and propose a wrapper library to exploit these options.
Using multiple GPU-accelerated FFT libraries, we observe that different libraries excel in different regions of the performance–accuracy space and that these sweet spots depend on transform size and input content.
Guided by these insights, we propose a framework that selects the best kernel (library and precision) on the fly to minimize runtime or energy while satisfying a specified error threshold.
A lightweight machine learning model predicts per-kernel error at runtime from sampled input features. Experiments show over 98% selection accuracy and mean speedups exceeding 40\% compared to a double precision baseline.
The framework integrates seamlessly with existing workflows as a wrapper library.
We focus on the Fast Fourier Transform (FFT), a representative function used in scientific computing, and propose a wrapper library to exploit these options.
Using multiple GPU-accelerated FFT libraries, we observe that different libraries excel in different regions of the performance–accuracy space and that these sweet spots depend on transform size and input content.
Guided by these insights, we propose a framework that selects the best kernel (library and precision) on the fly to minimize runtime or energy while satisfying a specified error threshold.
A lightweight machine learning model predicts per-kernel error at runtime from sampled input features. Experiments show over 98% selection accuracy and mean speedups exceeding 40\% compared to a double precision baseline.
The framework integrates seamlessly with existing workflows as a wrapper library.
Workshop
Livestreamed
Recorded
TP
W
DescriptionRandom sketching is a dimensionality reduction technique that approximately preserves norms and singular values up to some O(1) distortion factor with high probability. The most popular sketches in literature are the Gaussian sketch and the subsampled randomized Hadamard transform, while the CountSketch has lower complexity. Combining two sketches, known as multisketching, offers an inexpensive means of quickly reducing the dimension of a matrix by combining a CountSketch and Gaussian sketch.
However, there has been little investigation into high performance CountSketch implementations. In this work, we develop an efficient GPU implementation of the CountSketch, and demonstrate the performance of multisketching using this technique. We also demonstrate the potential for using this implementation within a multisketched least squares solver that is up to 77% faster than the normal equations with significantly better numerical stability, at the cost of an O(1) multiplicative factor introduced into the relative residual norm.
However, there has been little investigation into high performance CountSketch implementations. In this work, we develop an efficient GPU implementation of the CountSketch, and demonstrate the performance of multisketching using this technique. We also demonstrate the potential for using this implementation within a multisketched least squares solver that is up to 77% faster than the normal equations with significantly better numerical stability, at the cost of an O(1) multiplicative factor introduced into the relative residual norm.
Research and ACM SRC Posters
Research & ACM SRC Posters
TP
DescriptionInexpensive DNA sequencing [1] has opened new windows into biological complexity. These include metagenomics: the ability to catalog a microbial ecosystem by extracting and sequencing DNA directly from an environment. Analyzing metagenomic-scale datasets often requires exascale computing. Such computing platforms are heterogeneous, encompassing CPUs, GPUs, FPGAs or other co-processors. This heterogeneity presents complications for software design. The programming models can require code rewrites when hardware changes. Moreover, achieving adequate performance requires understanding the interaction between hardware and programming models. Computational biology codes suffer particularly from these problems because they are poorly studied [2] [3]. Here we describe a proxy application based on a metagenome assembler which allows both machine profiling and studying co-processor behavior for biology codes.
Workshop
Livestreamed
Recorded
TP
W
DescriptionIncreasing workload demands and emerging technologies necessitate the use of various memory and storage tiers in computing systems. This paper presents results from a CXL-based Experimental Memory Request Logger that reveals precise memory access patterns at runtime without interfering with the running workloads. By combining reactive placement based on data address monitoring, proactive data movement, and compiler hints, a Hotness Monitoring Unit (HMU) within memory modules can greatly improve memory tiering solutions. Analysis of page placement using profiled access counts on a Deep Learning Recommendation Model (DLRM) indicates a potential 1.94x speedup over Linux NUMA balancing tiering, and only a 3% slowdown compared to Host-DRAM allocation while offloading over 90% of pages to CXL memory. The study underscores the limitations of existing tiering strategies in terms of coverage and accuracy, and makes a strong case for programmable, device-level telemetry as a scalable and efficient solution for future memory systems.
Workshop
Livestreamed
Recorded
TP
W
DescriptionThis paper presents outcomes and insights from a one-week workshop designed to teach biologists essential skills in high-performance computing (HPC), machine learning (ML), and deep learning (DL). Participants with little or no prior experience with HPC learned how to navigate file systems via a command-line interface, launch jobs with SLURM, and apply ML and DL techniques to real-world biological datasets. Hands-on activities were delivered with accessible technologies such as Jupyter Notebooks, graphical desktop interfaces (DCV), and software containers, all deployed on HPC systems with minimal user setup required. We propose this workshop model as an adaptable framework for training domain scientists how to effectively use HPC resources to advance scientific discovery, and we present survey data demonstrating its effectiveness in improving participant skills.
Workshop
Livestreamed
Recorded
TP
W
DescriptionHigh-performance computing (HPC) has become a critical enabler of scientific advancement across many research domains. As the HPC user community continues to expand, there is an increasing need to reduce barriers to entry and improve accessibility for researchers with varying levels of computational expertise. This paper presents a modular, responsive, and user-friendly web-based dashboard built upon the Open OnDemand framework. The system integrates several redesigned functional pages, providing streamlined access to announcements, system status, job information, and resource usage. By consolidating these capabilities into an intuitive interface with a modern design, the dashboard enhances user experience, reduces reliance on command-line tools, and facilitates more efficient interaction with HPC resources.
Paper
Algorithms
Applications
Architectures & Networks
Livestreamed
Recorded
TP
DescriptionLow-precision computing is essential for efficiently utilizing memory bandwidth and computing cores. While many mixed-precision algorithms have been developed for iterative sparse linear solvers, effectively leveraging half-precision (fp16) arithmetic remains challenging. This study introduces a novel nested Krylov approach that integrates the flexible GMRES and Richardson methods in a deeply nested structure, progressively reducing precision from double-precision to fp16 toward the innermost solver. To avoid meaningless computations beyond precision limits, the low-precision inner solvers perform only a few iterations per invocation, while the nested structure ensures their frequent execution. Numerical experiments show that incorporating fp16 into the approach directly enhances solver performance without compromising convergence, achieving speedups of up to 2.42 and 1.65 over double-precision and double-single mixed-precision implementations, respectively. Furthermore, the proposed method outperforms conventional mixed-precision Krylov solvers, CG, BiCGStab, and restarted FGMRES, by factors of up to 2.47, 2.74, and 69.10, respectively.
Workshop
Livestreamed
Recorded
TP
W
DescriptionSimulations based on the structured grid parallel computational pattern are approachable to undergraduate students in a parallel and distributed computing course. I present an assignment using a structured grid to model the growth of Mushroom Fairy Rings. The cells in the grid can be in one of several states, are updated over iterations representing time steps, and change values based on the probability of a state change occurring. This assignment thus goes beyond the simple game of life cellular automata, providing students with the challenge of properly using a random number generator inside parallel loops. Students parallelize a base sequential version using OpenMP, then conduct experiments to determine its strong and weak scalability. Scaffolding for success is provided in the form of additional freely available prerequisite class activities.
Workshop
Livestreamed
Recorded
TP
W
DescriptionAn accurate measure of communication performance is a key component of optimizing large scale high performance computing applications. This paper presents a model for the peak performance of all-to-all communication, in the context of systems composed of a hierarchy of interconnect bandwidths; a common trait of multi-GPU per node systems. We demonstrate an application of the model to distributed transposes, such as those encountered in distributed three-dimensional Fast Fourier Transforms. The model is validated on three different network architectures, using a variety of communication libraries, by measuring all-to-all and distributed transpose performance in a pseudo-spectral code for direct numerical simulations of 3D fluid turbulence. Both the model and the validation results provide insights into the impact of fast communication links located lower in the network hierarchy, the expected scaling for all-to-all bound problems, and performance considerations when selecting slab (1D) or pencil (2D) domain decompositions.
Workshop
Livestreamed
Recorded
TP
W
DescriptionQuantum computing has the potential to transform computational problem-solving by leveraging quantum mechanical principles of superposition and entanglement. This capability is particularly important for the numerical solution of complex and/or multidimensional partial-differential-equations (PDEs). The existing quantum PDE solvers, particularly those based on variational-quantum-algorithms (VQAs) suffer from limitations such as low accuracy, high execution times, and low scalability. In this work, we propose an efficient and scalable algorithm for solving multidimensional PDEs. We present two variants of our algorithm: the first leverages finite-difference-method (FDM), classical-to-quantum (C2Q) encoding, and numerical instantiation, while the second employs FDM, C2Q, and column-by-column decomposition (CCD). We have validated our proposed algorithm by solving several practically useful PDEs such as Poisson, heat, Black-Scholes, and Navier-Stokes equations. Our results demonstrate higher accuracy, higher scalability, and faster execution times compared to VQA-based solvers on noise-free and noisy quantum simulators from IBM and achieved promising results on real quantum hardware.
Art of HPC
Art of HPC
Not Livestreamed
Not Recorded
TP
W
TUT
XO/EX
Description𝘈 𝘗𝘶𝘭𝘴𝘦 𝘪𝘯 𝘚𝘱𝘢𝘤𝘦 by Amy Karle (2025) is a sister work to 𝘈 𝘗𝘶𝘭𝘴𝘦 𝘪𝘯 𝘵𝘩𝘦 𝘚𝘵𝘳𝘦𝘢𝘮, extending the human–environment–data cycle beyond Earth to compose a living “heartbeat” of the planet in dialogue with space and time. The image shown is a data-driven visualization in process: concentric point fields and linework orbit a central void, modulating in amplitude and phase as streams of geophysical and human signals converge. Rhythms derived from Earth-system activity (e.g., environmental and space-weather indices) and opt-in human heartbeats are mapped to wave interference, particle advection, and harmonic envelopes, producing a toroidal aura: an imprint of expanding intelligence at this moment in time.
Conceptually, the work frames Earth as a co-creative instrument in a larger field of intelligence: what we sense, compute, and transmit shapes the informational ecosystem that will persist beyond us. By staging sensing → simulation → visualization across Earth/near-space contexts, A Pulse in Space proposes an interface between terrestrial life and the larger cosmos, inviting reflection on authorship, responsibility, and stewardship of information that may echo across space and time. Planned deployments include the International Space Station (late 2025-26) and a lunar mission (2026), carrying excerpts of the pulse as a cultural and temporal marker. www.amykarle.com
Conceptually, the work frames Earth as a co-creative instrument in a larger field of intelligence: what we sense, compute, and transmit shapes the informational ecosystem that will persist beyond us. By staging sensing → simulation → visualization across Earth/near-space contexts, A Pulse in Space proposes an interface between terrestrial life and the larger cosmos, inviting reflection on authorship, responsibility, and stewardship of information that may echo across space and time. Planned deployments include the International Space Station (late 2025-26) and a lunar mission (2026), carrying excerpts of the pulse as a cultural and temporal marker. www.amykarle.com
Art of HPC
Art of HPC
Not Livestreamed
Not Recorded
TP
W
TUT
XO/EX
Description𝘈 𝘗𝘶𝘭𝘴𝘦 𝘪𝘯 𝘵𝘩𝘦 𝘚𝘵𝘳𝘦𝘢𝘮 流息脉动 by Amy Karle (2025) 艾米·卡尔 is an interactive, large-scale artwork that visualizes the metabolism of an AI data center as public art. The work integrates real-time inputs: environmental conditions, building processes, server activity, and human heartbeats into a GPU-accelerated generative system that animates the full façade of the Beijing Digital Economy AIDC in Beijing, China. Data from sensing streams (e.g., power load, thermal flux, airflow, exterior weather, particulate levels, and participatory heart-rate sensors) compose evolving fields, waves, and harmonics that depict the physics of information flow. Generative linework “breathes” with the site, revealing otherwise invisible currents of energy, information, and human presence as a coherent, ever-changing flow. Perceptible pulses and synchronized “heartbeats” appear when on-site participants opt in with biometric sensors, layering and accreting into continually transforming visuals. As we generate and contribute to this data stream, we transform it - and in turn, it transforms us, reminding us that at the convergence of humans, machines, nature, and information, we stand at the threshold of an ever-expanding future, and we have a hand in shaping it.
By staging sensing → simulation → in-situ visualization in public space, the piece recasts the data center as an interface for collective intelligence, where the ethics and aesthetics of computation become visible and where our choices become the code that writes tomorrow.
www.amykarle.com
By staging sensing → simulation → in-situ visualization in public space, the piece recasts the data center as an interface for collective intelligence, where the ethics and aesthetics of computation become visible and where our choices become the code that writes tomorrow.
www.amykarle.com
Research and ACM SRC Posters
Research & ACM SRC Posters
TP
DescriptionQuantum computing is consistently becoming transformational for computational problem-solving. This capability appears particularly suited for numerical solution of multidimensional partial-differential-equations (PDEs). Although many quantum techniques are currently available for solving PDEs, these algorithms, particularly the ones based on variational-quantum-algorithms (VQAs), suffer from low accuracy, high execution times, and low scalability. In this work, we propose an efficient and scalable algorithm targeting multidimensional PDEs. We present two variants of the algorithm, that differ on how the final quantum circuit is generated. While both utilize finite-difference-method (FDM) and classical-to-quantum (C2Q) encoding as the initial steps, the first variant uses numerical-instantiation and the second uses column-by-column-decomposition (CCD) for quantum-circuit-synthesis. Our proposed algorithm has been validated by various case studies such as Poisson, Heat, Black-Scholes, and Navier-Stokes equations. The results demonstrate better accuracy and scalability with faster execution times compared to VQA-based solvers on noise-free and noisy quantum simulators and promising results on real-quantum-hardware.
Workshop
Livestreamed
Recorded
TP
W
DescriptionModern scientific discovery increasingly integrates simulations, data, and AI models. Existing systems rarely let scientists compose expressive queries that retrieve multi-modal datasets and invoke complex simulations or AI inferences. We introduce the Intelligent Data Search (IDS) framework to bridge this gap. IDS extends the Cray Graph Engine to provide a scalable in-memory datastore (feature, vector, and knowledge graph), a unified query engine combining keyword, set-theoretic, and linear-algebraic operators, a model repository for UDFs and pre-trained AI models, and a distributed multi-tier cache for intermediate and simulation outputs. We evaluate IDS on a life-sciences workflow with the NCNPR, integrating AlphaFold, AutoDock Vina, and Smith–Waterman within a single query. Results show strong HPC scaling, a complex “what-could-be” query executing millions of searches and thousands of inferences in seconds, and 5–15× end-to-end speedup from caching. IDS empowers scientists to ask and iterate model-driven “what-if” questions over petascale data with minimal latency.
Art of HPC
Art of HPC
Not Livestreamed
Not Recorded
TP
W
TUT
XO/EX
DescriptionGlobal view of super-Eddington accretion onto a spinning stellar-mass black hole. The black hole is surrounded by a geometrically thick, dense accretion disk of hot plasma, mostly supported by radiation pressure against gravity. This puffy disk forms a spiraling structure as it falls inward and emits X-ray radiation. Relativistic jets, driven by the interaction between electromagnetic fields and black hole spin, are launched from the inner polar region and can propagate over extreme distances. The system resembles ultraluminous X-ray sources (ULXs), which are thought to be powered by such super-Eddington disks.
Awards and Award Talks
Livestreamed
Recorded
TP
DescriptionThis talk opens with a brief retrospective on the speaker’s path through computer architecture and HPC—highlighting the lessons from early work on energy-efficient computing, co-design campaigns for exascale/petascale computing, procurement models anchored in real scientific workflows, and cross-disciplinary work that now shape his outlook. Against that backdrop, he argues that energy continues to be the dominant design constraint, stalling a decade of once-relentless performance gains in HPC and creating an energy crisis in hyperscale AI.
Leveraging the latest technology roadmaps, the speaker updates what “post-exascale” systems are likely to look like and why the historical drivers of improvement are becoming more challenging to sustain. He then reframes co-design: from narrow kernel–hardware tuning toward an end-to-end, materials-to-systems and workflow-aware practice that couples advanced packaging with targeted architectural specialization. Finally, he lays out strategies for continued progress without the aid of the old scaling playbook—focusing on energy proportionality, domain-specific acceleration, and evolving the definition of co-design to exploit the mathematical structure of the algorithms being accelerated.
He also lays out opportunities for deeper collaboration with our colleagues in hyperscale computing by leveraging their supply chain and through organizations such as the Open Compute Project. The thesis is that sustained performance growth will come from pragmatic, cross-layer co-design and smaller, workflow-effective systems rather than ever-larger monolithic machines.
Leveraging the latest technology roadmaps, the speaker updates what “post-exascale” systems are likely to look like and why the historical drivers of improvement are becoming more challenging to sustain. He then reframes co-design: from narrow kernel–hardware tuning toward an end-to-end, materials-to-systems and workflow-aware practice that couples advanced packaging with targeted architectural specialization. Finally, he lays out strategies for continued progress without the aid of the old scaling playbook—focusing on energy proportionality, domain-specific acceleration, and evolving the definition of co-design to exploit the mathematical structure of the algorithms being accelerated.
He also lays out opportunities for deeper collaboration with our colleagues in hyperscale computing by leveraging their supply chain and through organizations such as the Open Compute Project. The thesis is that sustained performance growth will come from pragmatic, cross-layer co-design and smaller, workflow-effective systems rather than ever-larger monolithic machines.
Workshop
Education & Workforce Development
Livestreamed
Recorded
TP
W
DescriptionThe Centre for High Performance Computing (CHPC), South Africa’s national supercomputing facility, launched the Student Cluster Competition (SCC) in 2012 to build HPC awareness and skills among undergraduates. Twenty teams of four students began with an intensive week of training in Linux, cluster design, and system administration. Finalists advanced to a live challenge with self-built clusters, and the top teams represented South Africa at the International SCC at ISC in Germany. From the outset, the program emphasized diversity and inclusion, recruiting from historically disadvantaged communities and addressing key knowledge gaps in HPC system design, administration, and optimization. This rapid teaching model enabled students with little prior exposure to achieve international success: South African teams placed in the top three globally for seven consecutive years. This paper details the competition’s structure, training methods, and outcomes, highlighting its evolution into a recognized platform for inclusive HPC education and global competitiveness.
Workshop
Livestreamed
Recorded
TP
W
DescriptionMulti-word arithmetic plays a critical role in high-performance computing (HPC) as it enables arithmetic on operands exceeding a processor’s native word size. For example, many cryptographic kernels, such as number theoretic transform, rely on multi-word arithmetic to compute log₂(𝑞)-bit integer arithmetic, accelerating mod-𝑞 polynomial multiplication in post-quantum cryptography. To mitigate carry-propagation bottlenecks in multi-word arithmetic, prior work proposed code-generation approaches targeting GPUs and domain-specific accelerators (DSAs) with native large-integer support. However, GPU-based approaches tend to be less energy-efficient, while DSA designs incur non-trivial non-recurring engineering. Therefore, our work evaluates the potential for RISC-V in HPC and explores multi-word arithmetic using RISC-V. We propose a general modeling-based multi-word extension on the RISC-V Vector (RVV) ISA. Furthermore, we develop comprehensive performance models to analyze performance consistency across host vector processing systems with diverse microarchitectural configurations. Our work demonstrates that targeted architectural extensions can further saturate the pipeline by enhancing RVV’s carry-propagation support.
Paper
HPC for Machine Learning
Programming Frameworks
Livestreamed
Recorded
TP
DescriptionDynamic-shape tensor computation poses challenges for shape-specific compilation due to variable input dimensions. Existing compilers rely on shape samples, incurring high tuning costs and degraded performance on unseen inputs.
We present Helix, a dynamic tensor framework with sample-free and architecture-guided compilation for compilation efficiency and shape-general performance. To avoid shape sampling, Helix constructs shape-agnostic compilation by decomposing computations across architectural layers. A bidirectional strategy combines top-down abstraction, aligning tensor computations with architectural hierarchies, and bottom-up kernel construction, building efficient execution strategies from reusable, architecture-aligned micro-kernels. A hybrid analyzer ensures accuracy through profiling at lower architectural levels, and achieves scalability through architecture-informed modeling at higher levels and runtime.
This hierarchical design eliminates shape-specific tuning and enables shape-adaptive execution. Evaluations on x86 CPUs, ARM CPUs, and NVIDIA GPUs demonstrate that Helix reduces compilation time by 174x over existing compilers and delivers 2.26x and 3.29x speedups over vendor libraries and dynamic-shape compilers, respectively.
We present Helix, a dynamic tensor framework with sample-free and architecture-guided compilation for compilation efficiency and shape-general performance. To avoid shape sampling, Helix constructs shape-agnostic compilation by decomposing computations across architectural layers. A bidirectional strategy combines top-down abstraction, aligning tensor computations with architectural hierarchies, and bottom-up kernel construction, building efficient execution strategies from reusable, architecture-aligned micro-kernels. A hybrid analyzer ensures accuracy through profiling at lower architectural levels, and achieves scalability through architecture-informed modeling at higher levels and runtime.
This hierarchical design eliminates shape-specific tuning and enables shape-adaptive execution. Evaluations on x86 CPUs, ARM CPUs, and NVIDIA GPUs demonstrate that Helix reduces compilation time by 174x over existing compilers and delivers 2.26x and 3.29x speedups over vendor libraries and dynamic-shape compilers, respectively.
Research and ACM SRC Posters
Research & ACM SRC Posters
TP
DescriptionQuantum computing promises exponential speedups over classical computing by leveraging quantum-mechanical properties like superposition and entanglement. As quantum algorithms grow in complexity, classical simulation remains essential for evaluating correctness, scalability, and resource demands. This work focuses on studying the scalability of structured quantum algorithms such as the Quantum Haar Transform (QHT), usually used for reducing data dimensionality in signal/image processing and remote-sensing hyperspectral imagery. We simulate QHT circuits on high performance computing (HPC) systems by constructing unitary models that mirror the transform’s hierarchical decomposition. Simulations track performance metrics such as circuit width, circuit depth, and execution time. Our results provide insight into the practical implementation of structured quantum circuits and serve as a reference for validating algorithmic correctness and guiding future quantum algorithm design.
Workshop
Livestreamed
Recorded
TP
W
DescriptionHPC administrators and staff scientists are often responsible for installing and managing scientific software for users—a task complicated by complex dependencies, non-privileged environments, and the need for relocatable installations. Containers simplify this process, and many labs now distribute software as container images. However, this approach requires end users to learn how to run containers.
Here, I present a simple method for installing and managing containerized software while keeping the container approach transparent to users. It requires no special tooling aside from Apptainer and has been used successfully for years at the National Institutes of Health. The method’s benefits and limitations will be discussed.
Here, I present a simple method for installing and managing containerized software while keeping the container approach transparent to users. It requires no special tooling aside from Apptainer and has been used successfully for years at the National Institutes of Health. The method’s benefits and limitations will be discussed.
Workshop
Livestreamed
Recorded
TP
W
DescriptionThe convergence of quantum computing and high-performance computing (HPC) is giving rise to hybrid cloud platforms that coordinate workflows across classical and quantum resources. These heterogeneous environments introduce new challenges for workload management, including device variability, noise-prone quantum processors, and the need for synchronized orchestration across distributed systems. This survey reviews state-of-the-art techniques for scheduling, resource allocation, and workflow management in classical HPC, quantum clouds, and emerging hybrid infrastructures. A taxonomy is proposed to categorize workload management approaches by scheduling strategy, allocation model, orchestration mechanism, and performance metric. We present \textit{HybridCloudSim}, a novel simulation framework designed to model hybrid quantum–HPC cloud environments and optimize resource usage under realistic constraints. HybridCloudSim is a conceptual foundation and a practical tool for advancing workload management in hybrid quantum–HPC systems.
Workshop
Livestreamed
Recorded
TP
W
DescriptionNeuromorphic computing is a popular technology for the future of computing. Much of the focus in neuromorphic computing research and development has focused on new architectures, devices, and materials, rather than in the software, algorithms, and applications of these systems. In this talk, I will overview the field of neuromorphic computing with a particular focus on challenges and opportunities in using neuromorphic computers as co-processors, as well as the current state of software for neuromorphic computing hardware systems. I will discuss neuromorphic applications for both machine learning and non-machine learning use cases.
Paper
Architectures & Networks
HPC for Machine Learning
Performance Measurement, Modeling, & Tools
Programming Frameworks
Livestreamed
Recorded
TP
DescriptionDataflow accelerators can provide energy-efficient and high-performance alternatives to current popular architectures. However, little work has been done to enable accelerator-initiated, scalable collective communication for these architectures. We develop a high-level synthesis (HLS) interface to bridge this gap through software-hardware co-design. Given the tendency of dataflow applications to use reads and writes to streams to express data transfer, we develop a streaming interface implementing fine-grained transfers to the host processor. Data can then be communicated through MPI and transferred to the receiving accelerator. As a result, the interface uses few hardware resources for communication. As a case study, we enhance the HPL benchmark from HPCC_FPGA with our contributions. We evaluate our final design on up to 16 FPGAs, achieving up to 18% improvement in application throughput and 36% reduced latency in kernel execution. Additionally, we design a stencil benchmark that showcases superlinear speedup in a strong scaling scenario.
Workshop
Livestreamed
Recorded
TP
W
DescriptionCompared to CUDA, SYCL is a portable programming model for
various hardware accelerators. In this paper, we study
performance portability of low-bit fused general matrix-vector
multiplication kernels in SYCL on vendors’ graphics processing
units (GPUs). We introduce the use case, explain the kernel
implementations in details, evaluate the performance of the
CUDA, HIP, and SYCL kernels on datacenter, desktop, and
laptop GPUs, and investigate the causes of performance gaps.
We find that loop unrolling, kernel dispatch overhead, and sum
reduction contribute to the gaps. We hope that the findings
provide valuable feedback for the development of the SYCL
ecosystem.
various hardware accelerators. In this paper, we study
performance portability of low-bit fused general matrix-vector
multiplication kernels in SYCL on vendors’ graphics processing
units (GPUs). We introduce the use case, explain the kernel
implementations in details, evaluate the performance of the
CUDA, HIP, and SYCL kernels on datacenter, desktop, and
laptop GPUs, and investigate the causes of performance gaps.
We find that loop unrolling, kernel dispatch overhead, and sum
reduction contribute to the gaps. We hope that the findings
provide valuable feedback for the development of the SYCL
ecosystem.
Exhibitor Forum
Data Analytics
Livestreamed
Recorded
TP
XO/EX
DescriptionOrganizations deploying LLM inference often face critical decisions about hardware procurement, software stack selection, and deployment configurations. Today, these decisions are frequently made through ad-hoc testing, which consumes significant GPU resources and often leads to suboptimal outcomes. The diversity of deployment environments, model architectures, inference frameworks, and accelerator hardware makes exhaustive benchmarking impractical.
FMwork is a systematic benchmarking methodology that addresses this challenge by narrowing both the input configuration space and the output metrics space to focus on the most informative parameters and indicators. This targeted approach accelerates evaluation, reduces resource waste, and enables consistent, reproducible comparisons across platforms.
In a representative study, FMwork achieved over an order-of-magnitude reduction in total benchmarking time compared to a naïve exhaustive sweep, while capturing the key trends needed for deployment decisions across NVIDIA, AMD, and Intel GPUs. By providing an open, extensible framework, FMwork benefits the broader HPC and AI community through more efficient, sustainable performance evaluation.
FMwork is a systematic benchmarking methodology that addresses this challenge by narrowing both the input configuration space and the output metrics space to focus on the most informative parameters and indicators. This targeted approach accelerates evaluation, reduces resource waste, and enables consistent, reproducible comparisons across platforms.
In a representative study, FMwork achieved over an order-of-magnitude reduction in total benchmarking time compared to a naïve exhaustive sweep, while capturing the key trends needed for deployment decisions across NVIDIA, AMD, and Intel GPUs. By providing an open, extensible framework, FMwork benefits the broader HPC and AI community through more efficient, sustainable performance evaluation.
Research and ACM SRC Posters
Research & ACM SRC Posters
TP
DescriptionEfficient load balancing is critical for the scalability of distributed scientific applications. However, there are several challenges for applications to test new balancing strategies, including the need for an easy workflow to validate different algorithms. This work aims to tackle this particular challenge by presenting a toolbox designed to streamline the collection and analysis of load balancing data from WarpX, an advanced, fully kinetic particle-in-cell (PIC) code, based on the AMReX framework. Our toolbox simplifies the extraction of data from WarpX and enables developers to conduct statistical load balancing inferences over real data efficiently. We demonstrate such applicability by conducting a study with a laser-ion acceleration simulation: we collect simulation data and compare six different load balancing approaches, two in-production algorithms, and four under investigation for future use.
Workshop
Livestreamed
Recorded
TP
W
DescriptionDrug response prediction is a promising approach to apply machine learning to the development of drugs for a range of cancer types. This method can be used to pre-screen potential drugs, perform high-throughput screening of drug databases, or perform more generalized tasks in machine learning. In an idealized real-world clinical situation, the overall solution must produce a short list of the most promising drugs for a particular patient medical situation. Promising drugs for a given case, however, are very rare, making model performance in this space very difficult. Thus, a great deal of supporting infrastructure must be developed to make this possible, including obtaining and curating datasets, large cross-validation training studies, and post-training inference and analysis. Herein, we describe a new approach for dealing with the rare drug problem, and implement a portable workflow that explores one proposed strategy for addressing it, with results from the exascale supercomputer Aurora.
ACM Gordon Bell Finalist
Awards and Award Talks
Applications
Livestreamed
Recorded
TP
DescriptionDesigning nanoscale electronic devices, such as the currently manufactured nanoribbon field-effect transistors (NRFETs), requires advanced modeling tools capturing all relevant quantum mechanical effects. State-of-the-art approaches combine the non-equilibrium Green's function (NEGF) formalism and density functional theory (DFT). However, as device dimensions do not exceed a few nanometers anymore, electrons are confined in ultra-small volumes, giving rise to strong electron-electron interactions. To account for these critical effects, DFT+NEGF solvers should be extended with the GW approximation, which massively increases their computational intensity. Here, we present the first implementation of the NEGF+GW scheme capable of handling NRFET geometries with dimensions comparable to experiments. This package, called QuaTrEx, makes use of a novel spatial domain decomposition scheme; can treat devices made of up to 84,480 atoms; scales very well on the Alps and Frontier supercomputers (>80% weak scaling efficiency); and sustains an exascale FP64 performance on 42,240 atoms (1.15 Eflop/s).
Awards and Award Talks
Livestreamed
Recorded
TP
DescriptionScientific computing has evolved through successive layers of abstraction—from machine code to workflows, from manual scheduling to autonomous orchestration. Each innovation focused on making complexity manageable while preserving scientific intent. The Pegasus Workflow Management System was built in this tradition: translating high-level scientific descriptions into efficient and reliable execution across diverse high-performance and distributed systems.
Today, as artificial intelligence reshapes both systems and science, systems are no longer just executing—they are reasoning, predicting, and adapting. Scientific discovery is enhanced with automation throughout the scientific lifecycle: from hypothesis generation to result interpretation and publication.
This enhanced automation opens transformative possibilities for new discoveries and technological advances. However, it also requires us to confront deeper questions about transparency, trust, and the role of human judgment in an increasingly automated world.
This talk reflects on the enduring design principles that have sustained Pegasus through decades of technological change, explores how AI is redefining the balance between abstraction and understanding, and emphasizes the need for critical thinking and creativity while scientists are increasingly relying on cognitive automation.
Today, as artificial intelligence reshapes both systems and science, systems are no longer just executing—they are reasoning, predicting, and adapting. Scientific discovery is enhanced with automation throughout the scientific lifecycle: from hypothesis generation to result interpretation and publication.
This enhanced automation opens transformative possibilities for new discoveries and technological advances. However, it also requires us to confront deeper questions about transparency, trust, and the role of human judgment in an increasingly automated world.
This talk reflects on the enduring design principles that have sustained Pegasus through decades of technological change, explores how AI is redefining the balance between abstraction and understanding, and emphasizes the need for critical thinking and creativity while scientists are increasingly relying on cognitive automation.
Tutorial
Livestreamed
Recorded
TUT
DescriptionAccelerated quantum supercomputing (AQSC) tightly integrates quantum computing with classical accelerated supercomputing via low-latency interconnects. This is crucial for hybrid quantum-classical workflows, enabling scalable quantum algorithms, real-time quantum error correction (QEC), and fast feedback control.
Participants will gain hands-on experience by building hybrid applications using the Python API of CUDA-Q, NVIDIA’s open-source development platform that unifies QPU, CPU, and GPU compute. The primary focus is on scalable hybrid algorithms like the generative quantum eigensolver (GQE), emphasizing AI integration and parallelization. Live demonstrations on NERSC’s Perlmutter supercomputer and Infleqtion’s Sqale neutral-atom QPU will showcase GPU-accelerated workflows. Practical examples include GPU-accelerated decoders and the demonstration of logical qubits using VQE for a material science application. Notebooks for advanced participants will cover algorithms like contextual machine learning (CML), QAOA-GPT, and Auxiliary-Field Quantum Monte Carlo (AFQMC).
Participants will leave with practical skills in building hybrid applications, an understanding of performance-critical AQSC components, and familiarity with emerging techniques in scalable quantum algorithm design. Dedicated compute on Perlmutter and Infleqtion's Sqale simulator/hardware will be provided. The tutorial content will be made public a week before the tutorial: https://github.com/NERSC/SC25-quantum-tutorial.
Participants will gain hands-on experience by building hybrid applications using the Python API of CUDA-Q, NVIDIA’s open-source development platform that unifies QPU, CPU, and GPU compute. The primary focus is on scalable hybrid algorithms like the generative quantum eigensolver (GQE), emphasizing AI integration and parallelization. Live demonstrations on NERSC’s Perlmutter supercomputer and Infleqtion’s Sqale neutral-atom QPU will showcase GPU-accelerated workflows. Practical examples include GPU-accelerated decoders and the demonstration of logical qubits using VQE for a material science application. Notebooks for advanced participants will cover algorithms like contextual machine learning (CML), QAOA-GPT, and Auxiliary-Field Quantum Monte Carlo (AFQMC).
Participants will leave with practical skills in building hybrid applications, an understanding of performance-critical AQSC components, and familiarity with emerging techniques in scalable quantum algorithm design. Dedicated compute on Perlmutter and Infleqtion's Sqale simulator/hardware will be provided. The tutorial content will be made public a week before the tutorial: https://github.com/NERSC/SC25-quantum-tutorial.
Paper
Applications
Architectures & Networks
HPC for Machine Learning
Livestreamed
Recorded
TP
DescriptionMultivariate Gaussian processes (GPs) offer a powerful probabilistic framework to represent complex interdependent phenomena. They pose, however, significant computational challenges in high-dimensional settings, which frequently arise in spatio-temporal applications. We present DALIA, a highly scalable framework for performing Bayesian inference tasks on spatio-temporal multivariate GPs, based on the methodology of integrated nested Laplace approximations. Our approach relies on a sparse inverse covariance matrix formulation of the GP, puts forward a GPU-accelerated block-dense approach, and introduces a hierarchical, triple-layer, distributed-memory parallel scheme. We showcase weak-scaling performance surpassing the state of the art by two orders of magnitude on a model whose parameter space is 8$\times$ larger and measure strong-scaling speedups of three orders of magnitude when running on 496 GH200 superchips on the Alps supercomputer. Applying DALIA to an air pollution study over northern Italy spanning 48 days, we showcase refined spatial resolutions over the aggregated pollutant measurements.
Workshop
Livestreamed
Recorded
TP
W
DescriptionSynchrotron light sources support a wide array of techniques to investigate materials, often producing complex, high-volume data that challenge traditional workflows. At the Advanced Light Source (ALS), we developed infrastructure to move microtomography data over ESnet to ALCF and NERSC, where CPU- and GPU-based algorithms generate 3D reconstructed volumes of experimental samples. We employ two data movement and reconstruction models: real-time processing as data streams directly to NERSC compute nodes, and automated file transfer to NERSC and ALCF file systems. The streaming pipeline provides users with feedback in under ten seconds, while the file-based workflow produces high-quality reconstructions suitable for deeper analysis in 20-30 minutes. This infrastructure allows users to leverage HPC resources without direct access to backend systems. We plan to extend this architecture to more endstations, supporting our beamline scientists and users.
Research and ACM SRC Posters
Research & ACM SRC Posters
TP
DescriptionWe present MOFAI, an agentic AI scientist coupled with high performance computing (HPC) resources for the generation and property prediction of metal-organic frameworks (MOFs). MOFAI exhibits autonomous agents to enable tool-calling of linker generation, MOF assembly, molecular dynamics simulation, and deep learning-based property prediction in asynchronous and distributed fashion. MOFAI demonstrates success in leveraging multi-node computation to progressively discover stable MOFs with high $CO_2$ adsorption by learning from past successful MOFs. This seamless fusion of agentic reasoning with HPC demonstrates a new paradigm for automated scientific discovery, where AI scientists dynamically coordinate computationally intensive, domain-specific tools.
Tutorial
Livestreamed
Recorded
TUT
DescriptionPython is powering breakthrough exascale scientific discoveries—come learn how to program the world's largest supercomputers with it! In this interactive tutorial you’ll learn how to write, debug, profile, and optimize high-performance, multi-node GPU applications in Python. You'll learn and master: CuPy for drop-in GPU acceleration of NumPy workflows; Numba for writing custom kernels that match the performance of C++ and Fortran; and mpi4py for scaling across thousands of nodes. Along the way we’ll learn how to profile our code, debug tricky kernels, and leverage foundational and domain-specific accelerated libraries. Everything is hands-on: short and interactive lectures by expert instructors will be paired with guided Jupyter Notebooks that introduce each concept and have you immediately apply it in bite-sized exercises (2D heat equation, SVD image compression, Mandelbrot, etc.). The labs culminate in porting the miniWeather mini-app from serial Python to a hybrid MPI + GPU implementation. We'll use a web-based environment that requires no installs or setup—just a laptop and a browser. Whether you’re a domain scientist seeking faster turnaround or a software engineer evaluating portable acceleration strategies, you’ll leave with a roadmap, skills, and code for bringing scalable Python practices back to your own HPC facility.
Exhibitor Forum
Software Tools
Livestreamed
Recorded
TP
XO/EX
DescriptionThis talk demonstrates how intuitive, cross-platform tools can accelerate time-to-science. We present two real-world case studies. The first case study demonstrates how Linaro DDT was used by VSC to instantly pinpoint an "illegal memory access" error that would have been cumbersome to find without adequate tooling. The second case study focuses on how Linaro MAP was used to speed up the Fire Dynamics Simulator (FDS) application by pinpointing a performance issue that developers of the code had overlooked. We present best practices and performance methodologies that prevent performance regressions and narrow down bugs at scale, which saves invaluable research time.
Workshop
Data Analytics
High Performance I/O, Storage, Archive, & File Systems
Storage
Livestreamed
Recorded
TP
W
DescriptionThe rapid growth of multimodal data from large-scale simulations and experimental instruments is overwhelming traditional storage and analysis workflows. Post hoc, disk-based methods suffer from latency, bandwidth bottlenecks, and inefficient resource use, slowing scientific insight. This work explores a hybrid in-situ and in-transit framework that embeds computation within the memory and storage hierarchy of HPC systems. In-situ processing performs filtering, reduction, or analysis directly at the data source using node-local memory and accelerators. In-transit processing complements this by leveraging intermediate layers such as burst buffers or dedicated resources for asynchronous analytics, balancing simulation and analysis.
Our architecture integrates Apache Ignite’s in-memory data grid with Apache Spark’s distributed computing and containerized microservices to enable real-time ingestion, fusion, and ML-driven analysis. Our preliminary results show reduced latency, efficient CPU–memory utilization, and strong scalability. Case studies on NWChem molecular dynamics and E3SM climate simulations demonstrate adaptability across domains, advancing data-aware, exascale-class discovery.
Our architecture integrates Apache Ignite’s in-memory data grid with Apache Spark’s distributed computing and containerized microservices to enable real-time ingestion, fusion, and ML-driven analysis. Our preliminary results show reduced latency, efficient CPU–memory utilization, and strong scalability. Case studies on NWChem molecular dynamics and E3SM climate simulations demonstrate adaptability across domains, advancing data-aware, exascale-class discovery.
Workshop
Livestreamed
Recorded
TP
W
DescriptionAlthough originally developed primarily for artificial intelligence workloads, RISC-V-based accelerators are also emerging as attractive platforms for high-performance scientific computing. In this work, we present our approach to accelerating an astrophysical N-body code on the RISC-V-based Wormhole n300 card developed by Tenstorrent. Our results show that this platform can be highly competitive for astrophysical simulations employing this class of algorithms, delivering more than a 2x speedup and approximately 2x energy savings compared to a highly optimized CPU implementation of the same code.
Flash Session
Not Livestreamed
Not Recorded
TP
XO/EX
DescriptionAs simulation, modeling, and engineering AI workloads grow in scale and complexity, accelerating time to results is crucial for innovation and market competition. Learn how Altair and Siemens’ simulation solutions, including uFX and StarCCM+, achieve high performance with next-generation NVIDIA GPUs such as Blackwell and Hopper, on Oracle Cloud Infrastructure running Altair’s HPCWorks management platform.
Workshop
Livestreamed
Recorded
TP
W
DescriptionOptimizing GPU-to-GPU communication is a key challenge for improving performance in MPI-based HPC applications, especially when utilizing multiple communication paths. This paper presents a novel performance model for intra-node multi-path GPU communication within the MPI+UCX framework, aimed at determining the optimal configuration for distributing a single P2P communication across multiple paths. By considering factors such as link bandwidth, pipeline overhead, and stream synchronization, the model identifies an efficient path distribution strategy, reducing communication overhead and maximizing throughput. Through extensive experiments on various topologies, we demonstrate that our model accurately finds theoretically optimal configurations, achieving significant improvements in performance, with the average of less than 6\% error in predicting the optimal configuration for very large messages.
Best Poster Presentations (Research, ACM SRC Grad/Undergrad)
Research & ACM SRC Posters
TP
DescriptionThe Cholesky decomposition is a critical performance bottleneck in engineering simulations. To accelerate these simulations, we present a novel, nested recursive Cholesky algorithm implemented in Julia. The algorithm restructures the problem into recursive TRSM (triangular solve) and SYRK (symmetric rank-k update) sub-problems, maximizing the use of highly parallel GEMM (general matrix-matrix multiply) operations that are highly efficient on GPUs. This approach leverages a custom recursive data structure that enables layered, mixed-precision arithmetic on modern NVIDIA H200 GPUs. By strategically using fast, low-precision FP16 computations on large, off-diagonal matrix blocks via Tensor Cores, while preserving high-precision on the critical diagonal blocks, we achieve a speedup of 5.32x over the standard cuSOLVER FP64 implementation. This method is 100x more accurate than a pure FP16 approach while retaining over 88\% of its speedup. Our work demonstrates a practical path to significantly reducing computation time for large-scale scientific problems with minimal accuracy loss.
Research and ACM SRC Posters
Research & ACM SRC Posters
TP
DescriptionThe optimization of computation kernels is central to high performance computing, directly impacting applications from scientific computing to artificial intelligence (AI). In experimental workflows with high-throughput or streaming data, software-only execution often becomes a bottleneck, motivating custom hardware accelerators. Field-programmable gate arrays and application-specific integrated circuits excel at these workloads by exploiting parallelism, pipelining, and low latency. Yet, mapping optimized kernels to hardware with high-level synthesis (HLS) requires significant manual effort. To address this, we propose a large language model (LLM)-driven optimization approach. Our method leverages the MLIR compiler infrastructure and modern LLMs’ capability to synthesize code to create tailored optimization strategies for hardware targets through HLS. This approach achieved 2.7x speedup for an electron energy loss spectroscopy autoencoder model targeting the Virtex UltraScale+. These results show that LLM-driven optimization offers a low-effort, high-performance alternative to manual workflows, paving the way for agentic AI in compilers and high performance computing.
Doctoral Showcase
Research & ACM SRC Posters
Livestreamed
Recorded
TP
DescriptionSparse tensor contractions (SpTC) are a bottleneck for several algorithms in scientific computing, data science, artificial intelligence and graphics. The SpTC operation is any expression of the form R(l0,l1, r0) = X(l0, l1, c0) * Y(r0, c0) where two tensors are multiplied along several dimensions to form a multidimensional result. Sparse tensor networks are an extension of this problem such that there are more than two inputs. This thesis aims to accelerate both—the SpTC primitive and sparse tensor networks with multiple SpTC terms. We develop kernels and IR optimizations to improve code-generation for sparse tensor networks.
To generate efficient code for a sparse tensor network, several inter-dependent optimizations must be made on the intermediate representation (IR). This includes sparse tensor mode order, and loop fusion to reduce intermediate tensors. Correctness requirements impose constraints on these variables.
We develop CoNST, a code-generator that co-optimizes these variables. An integer constraint system is solved by the Z3 SMT solver and the result lowers to a unique fused loop structure and tensor mode layouts for the entire contraction tree. CoNST outperforms state-of-the-art compilers by orders of magnitude in run-time.
To accelerate the SpTC operation, we perform the first analysis of data-access costs and memory requirements for loop orders. We develop FaSTCC, a hash-based parallel implementation of the SpTC operation that uses the fastest loop order with minimal memory overhead. FaSTCC introduces a new 2D tiled contraction-index-outer scheme and a corresponding tile-aware design. It outperforms previous state-of-the-art by 2-5x on up to 64 CPU threads.
To generate efficient code for a sparse tensor network, several inter-dependent optimizations must be made on the intermediate representation (IR). This includes sparse tensor mode order, and loop fusion to reduce intermediate tensors. Correctness requirements impose constraints on these variables.
We develop CoNST, a code-generator that co-optimizes these variables. An integer constraint system is solved by the Z3 SMT solver and the result lowers to a unique fused loop structure and tensor mode layouts for the entire contraction tree. CoNST outperforms state-of-the-art compilers by orders of magnitude in run-time.
To accelerate the SpTC operation, we perform the first analysis of data-access costs and memory requirements for loop orders. We develop FaSTCC, a hash-based parallel implementation of the SpTC operation that uses the fastest loop order with minimal memory overhead. FaSTCC introduces a new 2D tiled contraction-index-outer scheme and a corresponding tile-aware design. It outperforms previous state-of-the-art by 2-5x on up to 64 CPU threads.
Workshop
Data Analytics
High Performance I/O, Storage, Archive, & File Systems
Storage
Livestreamed
Recorded
TP
W
DescriptionGraphics Processing Units (GPUs) have become essential for scientific data analysis, yet they remain constrained by traditional I/O architectures that rely on data movement initiated by the CPU. While recent GPU-initiated I/O systems like BaM and GeminiFS partially address this limitation, they do not support access to complex serialized data formats such as HDF5, NetCDF, and ADIOS within GPU kernels. These formats are ubiquitous in scientific computing but would require prohibitive reimplementation of existing I/O libraries for direct GPU access.
This work explores a hybrid approach that enables GPU kernels to access serialized data formats through GPU-initiated I/O transfers to a specialized CPU runtime. Our design preserves the rich functionality of existing data format ecosystems while enabling GPU kernels to perform I/O. Our evaluations demonstrate minimal overhead compared to the traditional CPU-initiated approach. As future work, we are exploring reimplementation of I/O libraries to bypass the CPU runtime when possible.
This work explores a hybrid approach that enables GPU kernels to access serialized data formats through GPU-initiated I/O transfers to a specialized CPU runtime. Our design preserves the rich functionality of existing data format ecosystems while enabling GPU kernels to perform I/O. Our evaluations demonstrate minimal overhead compared to the traditional CPU-initiated approach. As future work, we are exploring reimplementation of I/O libraries to bypass the CPU runtime when possible.
Community Engagement and Support
Not Livestreamed
Not Recorded
TP
W
TUT
XO/EX
Paper
Architectures & Networks
BP
Livestreamed
Recorded
TP
DescriptionWhile traditional datacenters rely on static, electrically switched fabrics, Optical Circuit Switch (OCS)-enabled reconfigurable networks offer dynamic bandwidth allocation and lower power consumption. This work introduces a quantitative framework for evaluating reconfigurable networks in large-scale AI systems, guiding the adoption of various OCS and link technologies by analyzing trade-offs in reconfiguration latency, link bandwidth provisioning, and OCS placement. Using this framework, we develop two in-workload reconfiguration strategies and propose an OCS-enabled, multi-dimensional all-to-all topology that supports hybrid parallelism with improved energy efficiency. Our evaluation demonstrates that with state-of-the-art per-GPU bandwidth, the optimal in-workload strategy achieves up to 2.3x improvement over the commonly used one-shot approach when reconfiguration latency is low (<100μs). However, with sufficiently high bandwidth, one-shot reconfiguration can achieve comparable performance without requiring in-workload reconfiguration. Additionally, our proposed topology improves performance–power efficiency, achieving up to 1.75x better trade-offs than Fat-Tree and 3D-Torus–based OCS network architectures.
Workshop
Livestreamed
Recorded
TP
W
DescriptionWe present an automated framework for online task scheduling on heterogeneous distributed systems, building on a modular parametric scheduler that enables dynamic scheduling decisions based on evolving execution states. Inspired by classical list-scheduling strategies such as HEFT and CPoP, our online scheduler simulates real-time task scheduling using only partial task graph knowledge. We evaluate our online scheduler variants against both their traditional offline baselines and a naive online strategy using a large-scale benchmark suite of real-world scientific workflows. Experimental results across different estimation methods and compute-to-communication ratio (CCR) settings show that our adaptive online schedulers consistently outperform the naive approach, achieving performance within approximately 3-5% of an ideal offline scheduler that has full future knowledge (compared to the approximately 10% overhead for the naive baseline).
Workshop
Livestreamed
Recorded
TP
W
DescriptionThe realization of real-time data processing near X-ray detectors presents ongoing challenges due to long ASIC development cycles and the limited computational capacity of near-detector FPGAs. We propose a hybrid solution that streams data directly to a deterministic tensor processing unit (i.e., Groq AI accelerator), enabling low-latency, high-throughput inference. This paper describes the system architecture, supporting software stack, and performance projections, demonstrating the advantages of this hybrid platform for future X-ray imaging systems. This integration shows promise for advancing real-time edge computing and enabling intelligent control in photon science experiments. A single inference on a 128 × 128 image, including image transfer time, completes in 156.06 𝜇s, enabling approximately 6.4 kHz processing with the edgePtychoNN model and improving experimental-in-the-loop computing. Using this system, we achieve a 3.6× speedup over previous systems, highlighting the potential of this approach
Paper
State of the Practice
Livestreamed
Recorded
TP
DescriptionThe high performance computing (HPC) community has adopted incentive structures to motivate reproducible research, with major conferences awarding badges to papers that meet reproducibility requirements. Yet, many papers do not meet such requirements. The uniqueness of HPC infrastructure and software, coupled with strict access requirements, may limit opportunities for reproducibility. In the absence of resource access, we believe that regular documented testing, through continuous integration (CI), coupled with complete provenance information, can be used as a substitute. Here, we argue that better HPC-compliant CI solutions will improve reproducibility of applications. We present a survey of reproducibility initiatives and describe the barriers to reproducibility in HPC. To address existing limitations, we present a GitHub Action, CORRECT, that enables secure execution of tests on remote HPC resources. We evaluate CORRECT's usability across three different types of HPC applications, demonstrating the effectiveness of using CORRECT for automating and documenting reproducibility evaluations.
Birds of a Feather
Emerging Hardware & Software Technologies
Livestreamed
Recorded
TP
XO/EX
DescriptionTestbeds play a vital role in assessing the readiness of novel architectures for upcoming supercomputers for the exascale and post-exascale era. These testbeds also act as co-design hubs, enabling the collection of application operational requirements, while identifying critical gaps that need to be addressed for an architecture to become viable for HPC. Various research centers are actively deploying testbeds, and our aim is to build a community that facilitates the sharing of information, encouraging collaboration and understanding of the available evaluation resources. This BoF will facilitate the exchange of best practices, including testbed design, benchmarking, system evaluation, and availability.
SCinet Network Research Exhibition
Not Livestreamed
Not Recorded
DescriptionNo NRI
Tutorial
Livestreamed
Recorded
TUT
DescriptionThe vast majority of production parallel scientific applications today use MPI and run successfully on the largest systems in the world. Parallel system architectures are evolving to include complex, heterogeneous nodes comprising general-purpose CPUs as well as accelerators such as GPUs. At the same time, the MPI standard itself is evolving to address the needs and challenges of future extreme-scale platforms as well as applications. This tutorial will cover several advanced features of MPI that can help users program modern systems effectively. Using code examples based on scenarios found in real applications, we will cover several topics including one-sided communication, hybrid programming (MPI + threads, shared memory, GPUs), neighborhood and nonblocking collectives, some of the new performance-oriented features in MPI-4, and the new ABI (application binary interface) in MPI-5. Attendees will leave the tutorial with an understanding of how to use these advanced features of MPI and guidelines on how they might perform on different platforms and architectures.
Doctoral Showcase
Research & ACM SRC Posters
Livestreamed
Recorded
TP
DescriptionModern data center workloads demand substantial server resources, motivating the adoption of data processing units (DPUs) for improved efficiency. Despite increasing deployment, systematic characterization of SoC-based DPUs remains limited. We present a rigorous evaluation of NVIDIA’s BlueField-1, BlueField-2 (BF-2), and BlueField-3 (BF-3) across 15 benchmarks, revealing key idiosyncrasies in network, DMA, and memory. We further provide design recommendations and release our artifacts to the community. Additionally, naively integrating DPUs into workloads often reduces server resource usage without necessarily delivering high performance. In-memory key-value stores (KVS) are widely used for edge data storage, where low latency and high throughput are essential. We explore fine-grained offloading of in-memory CPU-based KVS to SoC-based DPUs by decomposing KVS and offloading the communication engine, the most CPU-intensive component, to enhance performance. We also propose a series of performance optimizations, such as overlapped request/response handling, reduced DMA operations, and dual communication engines. Our design achieves up to 68% lower latency and 36% higher throughput compared to CPU-only or coarse-grained offloading.
Applications in containers or VMs commonly rely on TCP/IP for communication in HPC clouds and data centers, yet TCP/IP introduces significant bottlenecks for NVMe-over-Fabrics I/O in disaggregated storage. We propose NVMe-over-Adaptive-Fabric (NVMe-oAF), an adaptive communication channel that leverages locality awareness and optimized shared memory/TCP paths to accelerate I/O-intensive workloads. Co-designed with Intel’s SPDK, NVMe-oAF achieves up to 7.1x higher bandwidth and 4.2x lower latency compared to TCP/IP over commodity Ethernet (10–100 Gbps), while delivering up to 7x bandwidth gains for HDF5 applications when integrated with H5bench.
Applications in containers or VMs commonly rely on TCP/IP for communication in HPC clouds and data centers, yet TCP/IP introduces significant bottlenecks for NVMe-over-Fabrics I/O in disaggregated storage. We propose NVMe-over-Adaptive-Fabric (NVMe-oAF), an adaptive communication channel that leverages locality awareness and optimized shared memory/TCP paths to accelerate I/O-intensive workloads. Co-designed with Intel’s SPDK, NVMe-oAF achieves up to 7.1x higher bandwidth and 4.2x lower latency compared to TCP/IP over commodity Ethernet (10–100 Gbps), while delivering up to 7x bandwidth gains for HDF5 applications when integrated with H5bench.
Research and ACM SRC Posters
Research & ACM SRC Posters
TP
DescriptionElectroencephalography (EEG) is widely used in brain–computer interfaces, but movement-related signals are weak, variable, and often buried in noise. Classical pipelines, such as Random Forests trained on PCA+CSP features, work fairly well but can miss cross-channel patterns. Quantum machine learning offers a different approach by embedding features in high-dimensional Hilbert spaces. In this study, we built a 10-qubit variational quantum classifier (VQC) in Qiskit and compared it with a tuned Random Forest baseline using a curated subset of the PhysioNet Motor Movement dataset. Each EEG window was compressed to 10 dimensions using PCA and CSP. Over 40 repeated simulations on GPU backends, the VQC reached stronger best-case performance (macro-F1 up to 0.95 versus 0.70 for Random Forest) and much higher recall on movement detection, albeit with greater variance. These results point to the potential of compact quantum classifiers for EEG and the open challenge of variance.
Workshop
Education & Workforce Development
Livestreamed
Recorded
TP
W
DescriptionLarge Artificial Intelligence (AI) and generative large language models (LLM) are key computational drivers. For researchers developing new tools or incorporating LLMs into their processing pipeline, the scale of data and models require supercomputing resources which can only be met through cloud or High Performance Computing (HPC) architectures. Many of these researchers have deep experience with AI, LLMs, and their research area but are new to HPC concepts, challenges, tools, and practices. To assist this researcher community, the Research Facilitation Teams at MIT Office of Research Computing and Data (ORCD) and the MIT Lincoln Laboratory Supercomputing Center (LLSC) have developed tutorial materials to teach researchers how to build their own Retrieval Augmented Generation (RAG) workflows. This work details LLM-RAG implementation concerns on two different systems, the design decisions associated with developing the examples, deployment of the workshop training, and the feed- back received from the participants.
ACM Gordon Bell Climate Modeling Finalist
ACM Gordon Bell Finalist
Awards and Award Talks
Applications
GB
Livestreamed
Recorded
TP
DescriptionAdvanced ab initio materials simulations face growing challenges as increasing systems and phenomena complexity requires higher accuracy, driving up computational demands. Quantum many-body GW methods are state-of-the-art for treating electronic excited states and couplings, but often hindered due to the costly numerical complexity. Here, we present innovative implementations of advanced GW methods within the BerkeleyGW package, enabling large-scale simulations on Frontier and Aurora exascale platforms. Our approach demonstrates exceptional versatility for complex heterogeneous systems with up to 17,574 atoms, along with achieving true performance portability across GPU architectures. We demonstrate excellent strong and weak scaling to thousands of nodes, reaching double-precision core-kernel performance of 1.069 ExaFLOP/s on Frontier (9,408 nodes) and 707.52 PetaFLOP/s on Aurora (9,600 nodes), corresponding to 59.45% and 48.79% of peak, respectively. Our work demonstrates a breakthrough in utilizing exascale computing for quantum materials simulations, delivering unprecedented predictive capabilities for rational designs of future quantum technologies.
Workshop
Livestreamed
Recorded
TP
W
DescriptionEnabling automated data annotation and efficient search is critical to workflow automation at experimental user facilities such as the Linac Coherent Light Source (LCLS) that produce large amounts of data annually. Current data annotation methods are primarily manual, limiting scalability. To this end, we investigate the potential of using automated ML pipelines such as Sciencesearch for the task of generating metadata from unstructured text sources such as experiment descriptions and logbook entries. Early results demonstrate that natural language processing pipelines can effectively produce good keywords, paving the way for making light source data searchable. We identify critical challenges — data sharing policies that hinder access to data, lack of heterogeneity in logbook formats, vocabulary drift, and the evolving role of generative AI - that must be addressed. We also propose some potential short and long-term solutions to these challenges, with the long-term goal of improving metadata management for AI-enabled workflows.
Research and ACM SRC Posters
Research & ACM SRC Posters
TP
DescriptionThis study introduces FoodSAFE, a novel high performance computing (HPC)-based distributed data poisoning (DDP) framework designed to benchmark adversarial resilience and training performance. The framework is tested across eight distinct configurations—seven distributed frameworks and one non-distributed baseline. FoodSAFE evaluates three diverse food-related datasets and four model architectures, ranging from small neural networks to large-scale transformers. The framework integrates eight advanced adversarial attacks: FGSM, PGD, DeepFool, One-Pixel, Universal, Carlini-Wagner, Trojan, and Boundary. It investigates how data, model, and hybrid parallelization strategies affect scalability, memory constraints, and vulnerability under real-world conditions. Additionally, the study presents the AdversaGuard app to enable live testing of these DDP techniques. Results indicate that while some architectures show greater tolerance to adversarial poisoning, larger models often exhibit heightened vulnerability, highlighting the critical need for adaptive and scalable defense strategies in modern AI systems.
ACM Gordon Bell Climate Modeling Finalist
ACM Gordon Bell Finalist
Awards and Award Talks
Applications
GBC
Livestreamed
Recorded
TP
DescriptionGenerative machine learning offers new opportunities to better understand complex Earth system dynamics. Recent diffusion-based methods address spectral biases and improve ensemble calibration in weather forecasting compared to deterministic methods, yet have so far proven difficult to scale stably at high resolutions. We introduce AERIS, a 1.3 to 80B parameter pixel-level Swin diffusion transformer to address this gap, and SWiPe, a generalizable technique that composes window parallelism with sequence and pipeline parallelism to shard window-based transformers without added communication cost or increased global batch size. On Aurora (10,080 nodes), AERIS sustains 10.21 ExaFLOPS (mixed precision) and a peak performance of 11.21 ExaFLOPS with 1x1 patch size on the 0.25 degree ERA5 dataset, achieving 95.5% weak scaling efficiency, and 81.6% strong scaling efficiency. AERIS outperforms the IFS ENS and remains stable on seasonal scales to 90 days, highlighting the potential of billion-parameter diffusion models for weather and climate prediction.
Early Career
Not Livestreamed
Not Recorded
TP
W
TUT
XO/EX
Workshop
Livestreamed
Recorded
TP
W
Workshop
Livestreamed
Recorded
TP
W
Workshop
Data Analytics
High Performance I/O, Storage, Archive, & File Systems
Storage
Livestreamed
Recorded
TP
W
Workshop
Partially Livestreamed
Partially Recorded
TP
W
Workshop
Performance Evaluation, Scalability, & Portability
Livestreamed
Recorded
TP
W
Workshop
Livestreamed
Recorded
TP
W
DescriptionWorkflows play critical role in science and engineering. Development of robust simulations workflows is challenging due to the need to link multiple models, often coming from different sources and in absence of data exchange standards. We propose an agentic AI framework for building simulation workflows using large language models (LLMs) with domain knowledge driving algorithms discovery and code generation. We believe that such approach can achieve significant reduction in workflows development time and more generally be used in automated and autonomous scientific discovery.
Workshop
Livestreamed
Recorded
TP
W
DescriptionHigh Performance Computing (HPC) applications rely heavily on code optimizations to achieve good performance on modern CPU and GPU architectures. Traditional Machine Learning autotuning approaches have demonstrated success in exploring high-dimensional spaces, but they often require expensive compile-run evaluations and lack adaptability for large HPC applications. The recent advances in Large Language Models (LLMs) and Agentic AI systems raise intriguing questions about the potential of these approaches to address specific optimization methodologies. This work aims to answer an essential question for the HPC community: "How Agentic AI Systems Compare to Traditional ML Autotuning Techniques?"
To address this question, we present a comparative analysis between a traditional ML-based optimization approach and an Agentic AI system, evaluating their respective capabilities and limitations for loop-level optimization. In addition, we introduced a new Agentic AI system named LoopGen-AI using three different Large Language Models: GPT-4.1, Claude 4.0, and Gemini 2.5.
To address this question, we present a comparative analysis between a traditional ML-based optimization approach and an Agentic AI system, evaluating their respective capabilities and limitations for loop-level optimization. In addition, we introduced a new Agentic AI system named LoopGen-AI using three different Large Language Models: GPT-4.1, Claude 4.0, and Gemini 2.5.
SCinet Network Research Exhibition
Not Livestreamed
Not Recorded
DescriptionNRI103,NRI104,NRI106
Paper
Programming Frameworks
Livestreamed
Recorded
TP
DescriptionGraphics processing units have become essential for computationally intensive applications. However, emerging workloads often involve processing data exceeding GPU on-chip memory capacity. To mitigate this issue, existing solutions enable GPUs to use CPU DRAM or SSDs as external memory. Among them, the GPU-centric approach lets GPU threads directly access SSDs, eliminating CPU intervention overhead over traditional methods. However, the existing work adopts a synchronous model, and threads must tolerate the long communication latency before starting any tasks.
In this work, we propose AGILE, a lightweight and efficient asynchronous library allowing GPU threads to access SSDs asynchronously. We demonstrate that AGILE achieves up to 1.88x improvement in workloads with different CTCs. Additionally, AGILE achieves 1.75x performance improvement on DLRMs against the SOTA work BaM. AGILE also exhibits low API overhead on graph applications. Lastly, AGILE consumes fewer registers and requires up to 1.32x fewer registers in CUDA kernels.
In this work, we propose AGILE, a lightweight and efficient asynchronous library allowing GPU threads to access SSDs asynchronously. We demonstrate that AGILE achieves up to 1.88x improvement in workloads with different CTCs. Additionally, AGILE achieves 1.75x performance improvement on DLRMs against the SOTA work BaM. AGILE also exhibits low API overhead on graph applications. Lastly, AGILE consumes fewer registers and requires up to 1.32x fewer registers in CUDA kernels.
Birds of a Feather
Applications
Livestreamed
Recorded
TP
XO/EX
DescriptionAgriculture worldwide is facing massive challenges in production, distribution, pollution reduction, food security, and waste: In a $4 trillion global food production industry, < 40% of any crop is actually marketed. The farm, the oldest human-engineered system, produces the vast majority of human sustenance and consumes the majority of global freshwater. Its efficient operation is crucial—particularly when supply chains are disrupted by wars and pandemics. This third BoF will continue discussing how novel supercomputing technologies, AI, and related distributed heterogeneous systems are empowering the primary sector and, as a result, stop operating in a needlessly fragile and inefficient way.
Workshop
Livestreamed
Recorded
TP
W
DescriptionThis paper presents a modular architecture for enabling autonomous cross-facility scientific experimentation using AI agents at ORNL's HPC and manufacturing user facilities. The proposed system integrates a natural language interface powered by an LLM, a multi-agent framework for decision making, programmable facility APIs, and a provenance-aware infrastructure to support adaptive, explainable, and reproducible workflows. We demonstrate how AI agents can orchestrate and optimize additive manufacturing experiments through near real-time coordination between experimental and HPC resources. The architecture is evaluated through a realistic end-to-end workflow that employs a simulated version of the manufacturing facility, showing that the approach reduces coordination overhead and accelerates the scientific discovery process.
Tutorial
Livestreamed
Recorded
TUT
DescriptionKubernetes has emerged as the leading container orchestration solution (maintained by the Cloud Native Computing Foundation) that works on resources ranging from on-prem clusters to commercial clouds. Kubernetes capabilities are available on Expanse, Voyager, and Prototype National Research Platform (PNRP) Nautilus clusters at SDSC. These clusters support AI and scientific computing research workloads. Recently there has also been rapid growth in the use of AI resources for educational purposes. Several institutions have incorporated LLMs into their curriculum, leveraging Nautilus services and resources. This tutorial aims to educate AI and computational science researchers on the capabilities of Kubernetes as a resource management system, compared with traditional batch systems; provide information on useful IO/storage options, and optimal use strategies for AI workloads; and demonstrate the use of Kubernetes-based solutions integrating LLM inference use for classroom use via JupyterHub. Attendees will get an overview of the Kubernetes architecture, typical job and workflow submission procedures, use various storage options, run AI and scientific research software using Kubernetes using both CPU and GPU resources, learn about optimal I/O strategies for AI, and run examples leveraging LLM inference services on Nautilus. Theoretical information will be paired with hands-on sessions operating on the PNRP production cluster Nautilus.
Community Engagement and Support
Not Livestreamed
Not Recorded
TP
XO/EX
DescriptionI will present the experience of the Recod.ai in the development of AI models and how we improved our HPC architecture to optimize the performance in terms of availability. We started with a manual datasheet to reserve the resources, until now…
Birds of a Feather
Democritization of HPC
Livestreamed
Recorded
TP
XO/EX
DescriptionThis interactive panel will explore how AI factories—from 130 kW pods to gigawatt-scale campuses—are emerging as the unit of compute, requiring modular design, standardization, and reference architectures to enable rapid, scalable deployment. Panelists will share lessons learned implementing these diverse solutions, addressing power, cooling, and service strategies needed to democratize AI at scale. The session will encourage audience participation to discuss practical approaches for designing, deploying, and operating AI infrastructure of all sizes to meet growing demands efficiently and sustainably.
Panel
AI, Machine Learning, & Deep Learning
High Performance I/O, Storage, Archive, & File Systems
Livestreamed
Recorded
TP
DescriptionAI infrastructure demands a rethink of everything we know about data architecture. In this panel, we’ll explore how traditional HPC systems fall short in supporting AI factories, and look at how object interfaces, key-value caching, and global data mobility take center stage. We’ll examine why the old paradigms of file-based I/O, tight coupling, and centralized metadata can't meet the latency and concurrency needs of modern AI workflows—and how new approaches are emerging to address the unique lifecycle of AI data. If HPC was built for simulation, AI factories are being built for adaptation. The tools—and rules—are changing fast.
Birds of a Feather
Practitioners in HPC
Livestreamed
Recorded
TP
XO/EX
DescriptionIn recent years, the HPC ecosystem has undergone profound changes. In Europe, EuroHPC JU invested a lot to develop a world-class supercomputing ecosystem with a very strong focus on AI. The objective of this BoF is to give an overview of the current state of HPC activities from Europe, Japan, and the U.S., with a particular focus on investments in public AI infrastructure. We will present and discuss with the different international HPC stakeholders the current state of play, future plans, challenges, and analyze critically the impact of an AI-focused strategy for the HPC ecosystem in general.
After a general introduction, there will be short contributions from the international HPC organizations presenting the current development of the HPC and AI ecosystem in their respective regions. A discussion with the audience will show different perspectives from industry and research. We will address hardware, software challenges, and identify gaps. The BoF will be closed with a summary of the results for preparing recommendations.
After a general introduction, there will be short contributions from the international HPC organizations presenting the current development of the HPC and AI ecosystem in their respective regions. A discussion with the audience will show different perspectives from industry and research. We will address hardware, software challenges, and identify gaps. The BoF will be closed with a summary of the results for preparing recommendations.
Doctoral Showcase
Research & ACM SRC Posters
Livestreamed
Recorded
TP
DescriptionShared HPC centers are often underutilized because jobs are commonly mis-specified for walltime, memory, and accelerators. This mis-specification causes queue churn, idle hardware, and long turnaround times. The main challenge is structural: researchers face a steep learning curve across different nodes, policies, and cost models. As a result, they often "guess and submit" with limited guidance. This work introduces the following center-focused solutions that use predictive models to guide scheduling.
(A) Estimators (black-box + white-box): Two complementary predictors estimate runtime and memory usage based on hardware and configuration. Black-box learners fit from prior runs; white-box models use operator/graph features and scaling laws to generalize. When modelled together, they predict resources with limited training data.
(B) HARP framework: HARP systematizes data generation, model building, and selection. It selects estimators based on measured error under site policy (queue limits, billing), resulting in a policy-compliant plan for walltime, memory, and devices.
(C) Estimator with Scheduler integration: A scheduler composes estimator outputs with TAPIS to produce valid submissions, select queues/partitions, and trade off time and cost. Supports resubmission strategies and “what-if” planning.
(D) Closed-Loop Orchestration and Path to Agentic Scheduler: Kafka streams job and filesystem signals to the Intelligence Plane, where estimators enforce policies that drive scheduler daemons, data-generation, and orchestration tasks. Future work extends this loop with goal/constraint inference, as well as drift-triggered self-updates, enabling autonomous model training. This is accompanied by an optional LLM for user interaction and decision explanation, as well as an MCP-ready design for adaptive scheduling and planning.
(A) Estimators (black-box + white-box): Two complementary predictors estimate runtime and memory usage based on hardware and configuration. Black-box learners fit from prior runs; white-box models use operator/graph features and scaling laws to generalize. When modelled together, they predict resources with limited training data.
(B) HARP framework: HARP systematizes data generation, model building, and selection. It selects estimators based on measured error under site policy (queue limits, billing), resulting in a policy-compliant plan for walltime, memory, and devices.
(C) Estimator with Scheduler integration: A scheduler composes estimator outputs with TAPIS to produce valid submissions, select queues/partitions, and trade off time and cost. Supports resubmission strategies and “what-if” planning.
(D) Closed-Loop Orchestration and Path to Agentic Scheduler: Kafka streams job and filesystem signals to the Intelligence Plane, where estimators enforce policies that drive scheduler daemons, data-generation, and orchestration tasks. Future work extends this loop with goal/constraint inference, as well as drift-triggered self-updates, enabling autonomous model training. This is accompanied by an optional LLM for user interaction and decision explanation, as well as an MCP-ready design for adaptive scheduling and planning.
Doctoral Showcase
Interactive Research e-Poster
Research & ACM SRC Posters
Not Livestreamed
Not Recorded
TP
DescriptionHigh-quality, ethically-governed, and efficiently structured data is important for effective AI. However, organizations often lack a unified method to assess whether datasets are ready for AI modeling. AIDRIN (AI Data Readiness Inspector) provides a comprehensive, multi-pillar framework that quantifies AI data readiness across six dimensions: Quality, Impact on AI, Understandability and Usability, Fairness and Bias, Structure and Organization, and Governance. The tool enables data teams to identify issues early, prioritize remediation, and make informed modeling decisions. AIDRIN is accessible as a web application, a Python package on PyPI, and openly developed on GitHub for community use and contribution, making it flexible for various workflows. Its interactive visualizations and interpretable reports help both technical and non-technical users understand dataset strengths and weaknesses. We extend AIDRIN by adding a customizability module, allowing users to define their own metrics and remedies to evaluate and prepare data for AI.
Research and ACM SRC Posters
Research & ACM SRC Posters
TP
DescriptionMany complex systems across diverse domains can be represented as dynamic networks, where entities are modeled as time-varying nodes and interactions among these entities are modeled as evolving edges. Analyzing such networks provides insights into the underlying temporal characteristics of the system and supports informed decision-making. However, timely and resource-efficient analysis of large, complex networks is challenging without specialized approaches, as it requires continuous updates to graph properties. Previously, we presented CANDY (Cyberinfrastructure for Accelerating Innovation in Network Dynamics), a scalable platform for modeling, managing, and analyzing large dynamic networks.
Here, we showcase real-world applications in transportation, social networks, and public safety, modeled as dynamic networks and analyzed using CANDY.
Here, we showcase real-world applications in transportation, social networks, and public safety, modeled as dynamic networks and analyzed using CANDY.
Birds of a Feather
Community Meetings
Education & Workforce Development
Livestreamed
Recorded
TP
XO/EX
DescriptionThe SC25 Americas HPC Collaboration BoF will primarily focus on HPC-related training efforts across the continent. Selected successful training initiatives will be showcased, including the CyberColombia and Santos Dumont summer school series, the DevOps for HPC School at CARLA, Mexico’s CONACyT-sponsored school, and the new HPC curriculum development at UPR Arecibo.
The expected outcomes are: expansion of continent-wide educational initiatives, including summer schools, hackathons, and bootcamps; and the formal launch of the Americas HPC Collaboration—Education and Workforce Development Chapter as a collaborative effort to share experiences, training materials, and infrastructure among institutions across the Americas.
The expected outcomes are: expansion of continent-wide educational initiatives, including summer schools, hackathons, and bootcamps; and the formal launch of the Americas HPC Collaboration—Education and Workforce Development Chapter as a collaborative effort to share experiences, training materials, and infrastructure among institutions across the Americas.
Paper
Algorithms
Livestreamed
Recorded
TP
DescriptionMesh partitioning is critical for scalable distributed PDE solvers. Traditional methods like spatial ordering and multi-level graph partitioning have significant tradeoffs between partition quality and parallel scalability. We present AMRaCut, a distributed-parallel mesh partitioner that bridges this gap using parallel label propagation and graph diffusion. It operates mostly locally on initial partitions, limiting inter-process communications to neighboring processes. This locality is especially effective in AMR, where mesh evolves dynamically with mostly local changes.
AMRaCut achieves 5x-10x speedups over multi-level partitioners (ParMETIS, PT-Scotch) while producing partitions of comparable quality and minimized boundaries. Its efficiency is comparable to sorting-based methods like space-filling curves. AMRaCut maintains maximum partition load within 2x of optimal, sufficient for distributed scalability.
We verify that AMRaCut is effective in downstream tasks by evaluating a Finite Element Model SpMV operation. Despite the 2x imbalance, AMRaCut partitions perform on par with parMETIS/PT-Scotch partitions, outperforming spatially ordered partitions.
AMRaCut achieves 5x-10x speedups over multi-level partitioners (ParMETIS, PT-Scotch) while producing partitions of comparable quality and minimized boundaries. Its efficiency is comparable to sorting-based methods like space-filling curves. AMRaCut maintains maximum partition load within 2x of optimal, sufficient for distributed scalability.
We verify that AMRaCut is effective in downstream tasks by evaluating a Finite Element Model SpMV operation. Despite the 2x imbalance, AMRaCut partitions perform on par with parMETIS/PT-Scotch partitions, outperforming spatially ordered partitions.
Research and ACM SRC Posters
Research & ACM SRC Posters
TP
DescriptionFecal microbial transplant (FMT) is an effective procedure for restoring gut microbiome balance in patients with Clostridioides difficile infection by introducing healthy donor microbes. Tracking viral genomes during FMT provides insight into microbial community transfer and recovery. We developed a viral detection workflow that processes metagenomic samples to identify, dereplicate, cluster, and annotate viral sequences using GeNomad, CheckV, MMseqs2, and BLAST. The workflow links viral sequences to donor and patient samples, enabling longitudinal tracking. Traditionally, such workflows run sequentially with predefined tools and steps. We compare this workflow against an agent-based workflow that selects the viral detection tool dynamically based on the sequence quality and database match scores of prior samples. Scaling experiments show that parallelizing the workflow using Parsl reduces runtime by over 50%. Tool comparison demonstrates trade-offs in speed, quality, and match ratio, demonstrating the benefits of adaptive, agent-driven workflows for scalable viral detection in microbiome studies.
Workshop
Livestreamed
Recorded
TP
W
DescriptionWhile data modalities like scRNASeq, histology, and DNA methylation offer valuable insights into cellular responses to external perturbations, learning from such datasets is often limited by the user’s ability to analyze large data, and familiarity with the existing knowledge base and tools. Moreover, there is a tendency to favor well established mechanisms even for understanding new biology, which can limit the exploration of novel or unexpected biological pathways. Manual curation of literature, pathway databases, and public datasets is time consuming, and traditional analysis pipelines are typically static, tool specific, and lack self correcting capabilities, and are thus difficult to scale.
To overcome these challenges, we present Agentic Lab, an AI agentic framework for accelerating biomedical discovery through automated, collaborative scientific inquiry via a set of specialized agents. Agents are entities that use prompts for understanding their tasks, LLMs for reasoning over those, and tools for interaction with the outside environment. Unlike conventional linear workflows, these agents continuously reason, search, reflect, and adapt. For example, an unexpected gene expression pattern can automatically trigger new literature searches, hypothesis refinement, or reanalysis of data, and coding errors and missing packages can be automatically detected and fixed. Agentic Lab formulates a research workflow by first using a Principal Investigator (PI) Agent as the entry point, which interprets the user defined task and, with the assistance of a Browsing Agent that retrieves knowledge from scientific repositories, user provided files, and web links, formulates a research workflow. The PI Agent then assigns specific tasks to specialized agents. Code Writer and Executor Agents generate, run, and debug codes, and a Critic Agent ensures robustness through continuous evaluation of results and processes. This framework integrates literature curation, hypothesis generation, code development, and data analysis in iterative cycles, with the option for the user to intervene at any point. The framework is driven entirely by open-weight LLMs that can be hosted locally using limited resources, enabling local, privacy-preserving execution without reliance on costly APIs. Our approach combines smart prompting, tool augmentation, and human-in-the-loop validation to maximize the performance of smaller models in complex biomedical discovery
We apply this framework to study low-dose (LD) radiation effects, where the carcinogenic risks below 10 mGy remain poorly understood despite widespread exposure from natural background (e.g., radon, cosmic rays), medical imaging, and nuclear industries. Using scRNA-seq data from the human lung epithelial BEAS-2B cell line exposed to Cs-137 gamma radiation at low (10 mGy), medium (100 mGy), and high (1 Gy) doses, we investigate transcriptional changes across dose levels to identify differences in underlying biological mechanisms. We use Geneformer, a pre-trained transformer-based single cell foundation model, to generate contextual gene and cell embeddings for in-silico perturbation (ISP) studies and identify key drivers of cell state transitions associated with LD exposure. By analyzing shifts in the latent embedding space, we map dysregulated genes and pathways implicated in stress response, and early malignant transformation. Agentic Lab interacts with HPC environments to submit jobs to carry out fine tuning of pretrained Genefomer models and ISP.
To overcome these challenges, we present Agentic Lab, an AI agentic framework for accelerating biomedical discovery through automated, collaborative scientific inquiry via a set of specialized agents. Agents are entities that use prompts for understanding their tasks, LLMs for reasoning over those, and tools for interaction with the outside environment. Unlike conventional linear workflows, these agents continuously reason, search, reflect, and adapt. For example, an unexpected gene expression pattern can automatically trigger new literature searches, hypothesis refinement, or reanalysis of data, and coding errors and missing packages can be automatically detected and fixed. Agentic Lab formulates a research workflow by first using a Principal Investigator (PI) Agent as the entry point, which interprets the user defined task and, with the assistance of a Browsing Agent that retrieves knowledge from scientific repositories, user provided files, and web links, formulates a research workflow. The PI Agent then assigns specific tasks to specialized agents. Code Writer and Executor Agents generate, run, and debug codes, and a Critic Agent ensures robustness through continuous evaluation of results and processes. This framework integrates literature curation, hypothesis generation, code development, and data analysis in iterative cycles, with the option for the user to intervene at any point. The framework is driven entirely by open-weight LLMs that can be hosted locally using limited resources, enabling local, privacy-preserving execution without reliance on costly APIs. Our approach combines smart prompting, tool augmentation, and human-in-the-loop validation to maximize the performance of smaller models in complex biomedical discovery
We apply this framework to study low-dose (LD) radiation effects, where the carcinogenic risks below 10 mGy remain poorly understood despite widespread exposure from natural background (e.g., radon, cosmic rays), medical imaging, and nuclear industries. Using scRNA-seq data from the human lung epithelial BEAS-2B cell line exposed to Cs-137 gamma radiation at low (10 mGy), medium (100 mGy), and high (1 Gy) doses, we investigate transcriptional changes across dose levels to identify differences in underlying biological mechanisms. We use Geneformer, a pre-trained transformer-based single cell foundation model, to generate contextual gene and cell embeddings for in-silico perturbation (ISP) studies and identify key drivers of cell state transitions associated with LD exposure. By analyzing shifts in the latent embedding space, we map dysregulated genes and pathways implicated in stress response, and early malignant transformation. Agentic Lab interacts with HPC environments to submit jobs to carry out fine tuning of pretrained Genefomer models and ISP.
Workshop
Livestreamed
Recorded
TP
W
DescriptionIsambard-AI is a UK National AI Research Resource (AIRR) that has been formally launched as a service in July 2025. In preparation for the launch, multiple co-design efforts were undertaken to address accessibility challenges faced by diverse AI research teams and to align with performance and environmental sustainability objectives. This talk highlights the application of AI frameworks, including the Model Context Protocol (MCP), together with operational data collection methods to evaluate the effectiveness of our data-driven co-design strategies.
Research and ACM SRC Posters
Research & ACM SRC Posters
TP
DescriptionPerformance-portability libraries such as RAJA enable single-source applications to run on diverse architectures, but performance often depends on compiler decisions that are hard to observe. Existing tools either show compiler activity without runtime context or runtime performance without compiler provenance. We present an approach that integrates compiler optimization data into runtime profiles, allowing developers to link specific optimizations to their performance impact. We demonstrate this approach through a case study where we determine the compiler requirements of kernels from the RAJA Performance Suite.
Workshop
Livestreamed
Recorded
TP
W
DescriptionDeveloping efficient hardware accelerators for mathematical kernels used in scientific applications and machine learning has traditionally been a labor-intensive task. Accelerator development typically requires low-level programming in Verilog alongside significant manual optimization effort. Recently, to alleviate this challenge, high-level hardware design tools like Chisel and High-Level Synthesis have emerged. However, as with any compiler, some of the generated Verilog may be suboptimal compared to expert-crafted designs. Understanding where these inefficiencies arise is crucial, as it provides valuable insights for both users and tool developers. In this paper, we propose a methodology to hierarchically decompose mathematical kernels—such as Fourier transforms, matrix multiplication, and QR factorization—into a set of common building blocks/primitives. Then the primitives are implemented in the different programming environments, and the larger algorithms get assembled. Furthermore, we employ an automatic approach to investigate the achievable frequency and required resources at each level identifying key locations where designs may deviate
Research and ACM SRC Posters
Research & ACM SRC Posters
TP
DescriptionTransformer-based large language models (LLMs) have demonstrated remarkable capabilities in natural language processing (NLP) tasks. The transformer layer in LLM involves substantial general matrix multiplication (GEMM). However, the sequence length variability leads to redundant computation and hardware resource overhead in the GEMM with a uniform-size padding approach, leading to reduced inference speed.
This work proposes an efficient GEMM acceleration method for LLM inference with variable-length sequences. First, a fused parallel prefix scan design is developed to capture the matrix dimension distribution. Second, an efficient various-size tile kernel is implemented based on Matrix Core, with an analysis of the hardware resource requirements in the computation process. Third, a hardware-aware tiling algorithm is designed to select the optimal tiling scheme based on thread parallelism and hardware resources. The experimental results show that the proposed approach achieves performance improvements of 3.10x and 2.99x (up to 4.44x and 4.27x) over hipBLAS and rocBLAS.
This work proposes an efficient GEMM acceleration method for LLM inference with variable-length sequences. First, a fused parallel prefix scan design is developed to capture the matrix dimension distribution. Second, an efficient various-size tile kernel is implemented based on Matrix Core, with an analysis of the hardware resource requirements in the computation process. Third, a hardware-aware tiling algorithm is designed to select the optimal tiling scheme based on thread parallelism and hardware resources. The experimental results show that the proposed approach achieves performance improvements of 3.10x and 2.99x (up to 4.44x and 4.27x) over hipBLAS and rocBLAS.
Workshop
Livestreamed
Recorded
TP
W
DescriptionThe pay-as-you-go cost model of cloud resources has necessitated the development of specialized programming models and schedulers for HPC jobs for efficient utilization of cloud resources. A key aspect of efficient utilization is the ability to rescale applications on the fly to maximize the utilization of cloud resources. Most commonly used parallel programming models, like MPI, have traditionally not supported autoscaling either in a cloud environment or on supercomputers. Charm++ is a parallel programming model that natively supports dynamic rescaling through its migratable objects paradigm. We present a Kubernetes operator to run Charm++ applications on a Kubernetes cluster. We also present a priority-based elastic job scheduler that can dynamically rescale jobs based on the state of the cluster to maximize cluster utilization while minimizing response time for high-priority jobs. We show that our elastic scheduler demonstrates significant performance improvements over traditional static schedulers.
Workshop
Livestreamed
Recorded
TP
W
DescriptionWe present a blueprint for a quantum middle layer that supports applications across various quantum technologies. Inspired by concepts and abstractions from HPC libraries and middleware, our design is backend-neutral and context-aware. A program only needs to specify its intent once, as typed data and operator descriptors. It declares what the quantum registers mean and which logical transformations are required. Such execution details are carried separately in a context descriptor and can change per backend without modifying the intent artifacts.
We develop a proof-of-concept implementation that uses JSON files for the descriptors and two backends: a gate-model path realized with IBM Qiskit Aer simulator and an annealing path realized with D-Wave Ocean’s simulated annealer. On a Max-Cut problem instance, the same typed problem runs on both backends by varying only the operator formulation (Quantum Approximated Optimization Algorithm formulation vs. Ising Hamiltonian formulation) and the context.
We develop a proof-of-concept implementation that uses JSON files for the descriptors and two backends: a gate-model path realized with IBM Qiskit Aer simulator and an annealing path realized with D-Wave Ocean’s simulated annealer. On a Max-Cut problem instance, the same typed problem runs on both backends by varying only the operator formulation (Quantum Approximated Optimization Algorithm formulation vs. Ising Hamiltonian formulation) and the context.
Birds of a Feather
Composable Systems
Livestreamed
Recorded
TP
XO/EX
DescriptionThe Sunfish Composable Disaggregated Infrastructure framework, combined with a deep reinforcement learning agent for scheduling, integrates with both HPC workload managers and container orchestrators to reduce application run-time latency, increase data center batch run efficiency, dynamically create ephemeral IO burst buffers, and mitigate problems from degraded hardware. Managing disaggregated resource pools with Sunfish minimizes idle resources and allows burst buffer allocations that create optimized execution environments for modern workloads, such as MOD/SIM and AI/ML. We will disclose our work integrating Sunfish with the Flux workload manager on a national lab testbed and discuss additional use cases within the industry.
Workshop
Livestreamed
Recorded
TP
W
DescriptionHigh-performance computing (HPC) education is at an inflection point, driven by agentic systems and “prompt-engineering” as a form of programming. We describe an interactive tutor built from autonomous LLM-based agents, each with a narrow role: planning lessons, explaining concepts, scaffolding code, and executing runs. Using open-source toolkits and locally hosted models on leadership-class supercomputers, the tutor lets educators generate and refine parallel-programming examples in real time without external APIs or subscription fees. Complex workflows are composed through structured prompts rather than traditional source code, while per-agent history summarization prevents context-window overflow and enables self-correcting code generation. Requiring no proprietary services, the platform is immediately deployable in institutional HPC environments and scales from single-user sessions to classroom labs. Beyond a teaching aid, it illustrates how prompt-driven, multi-agent software can deliver dynamic, personalized, and extensible learning experiences across technical domains.
Workshop
Livestreamed
Recorded
TP
W
DescriptionWith the slowing of Moore’s Law, heterogeneous computing platforms such as Field-Programmable Gate Arrays (FPGAs) have gained increasing interest for accelerating HPC workloads. In this work we present, to the best of our knowledge, the first implementation of selective code offloading to FPGAs via the OpenMP target directive within MLIR. Our approach combines the MLIR OpenMP dialect with a High-Level Synthesis (HLS) dialect to provide a portable compilation flow targeting FPGAs. Unlike prior OpenMP FPGA efforts that rely on custom compilers, by contrast we integrate with MLIR and-so support any MLIR-compatible front end, demonstrated here with Flang. Building upon a range of existing MLIR building blocks significantly reduces the effort required and demonstrates the composability benefits of the MLIR ecosystem. Our approach supports manual optimisation of offloaded kernels through standard OpenMP directives, and this work establishes a flexible and extensible path for directive-based FPGA acceleration integrated within the MLIR ecosystem.
Workshop
Livestreamed
Recorded
TP
W
DescriptionThis paper presents an efficient implementation of a linear-solver kernel optimized for a range of block sizes, commonly used in large-scale computational fluid dynamics (CFD) simulations. The implementation targets Aurora, the Argonne Leadership Computing Facility's (ALCF) exascale machine featuring Intel Data Center Max 1550 GPUs. The linear solver's performance is memory bandwidth-bound due to its low arithmetic intensity. The primary performance challenges stem from variable matrix row lengths and indirect memory access patterns inherent in unstructured-grid applications. Variable block sizes introduce additional complexity through differing levels of intra-block parallelism and the constraint of efficiently utilizing 512-bit vector registers. We propose an optimized implementation using ESIMD APIs that efficiently vectorize memory loads for block-sparse vector computations. We demonstrate that performance on the Intel 1550 GPU is within 10% of its bandwidth benchmark peak. We also compare the performance of the ESIMD kernels on Intel GPUs with CUDA-optimized implementations on NVIDIA GPUs.
Workshop
Livestreamed
Recorded
TP
W
DescriptionAI training and inference impose sustained, fine-grained I/O that stresses host-mediated, TCP-based storage paths. We revisit POSIX-compatible object storage for GPU-centric pipelines and present ROS2, an RDMA-first design that offloads the DAOS client to an NVIDIA BlueField-3 SmartNIC while leaving the server-side DAOS I/O engine unchanged. ROS2 splits a lightweight gRPC control plane from a high-throughput data plane (UCX/libfabric over RDMA or TCP), removing host mediation from the data path. Using FIO/DFS across local and remote settings, we show that on server-grade CPUs RDMA consistently outperforms TCP for large sequential and small random I/O. When the client is offloaded to BlueField-3, RDMA performance matches the host; TCP on the SmartNIC lags, underscoring RDMA’s advantage for offloaded deployments. We conclude that an RDMA-first, SmartNIC-offloaded object store is a practical foundation for LLM data delivery; optional GPUDirect placement is left for future work.
Workshop
Livestreamed
Recorded
TP
W
DescriptionTPC Leaders will discuss progress on organizing and incubating collaborative TPC initiatives that will drive hackathons in 2026, including how these initiatives inter-relate, a strategy for ensuring that they are application-driven, and how to get involved.
Doctoral Showcase
Research & ACM SRC Posters
Livestreamed
Recorded
TP
DescriptionAs scientific applications tackle more complex problems, data movement has also grown in complexity to the point of slowing execution time and compromising time-to-solution, hindering the pace of scientific discovery. In this work, we claim that, to continue to accelerate scientific discovery in the exascale era and beyond, we need a general-purpose, adaptable analytic framework for optimizing data movement in both monolithic and modular workflow-based applications. To design this framework, we study data movement across three diverse HPC applications, deriving three key lessons learned that guide the optimization of application I/O. First, profile-level performance analysis can be extended to reveal detailed data movement patterns. Second, middleware can substantially improve data movement efficiency for workflows by aligning I/O with workflow execution patterns. Third, matching I/O phases to targeted storage systems can yield substantial performance gains, but requires phase-aware monitoring and tuning. We use these lessons learned to design features—fine-grained I/O filtering, middleware-level workflow analysis, and dynamic phase-to-storage mapping—that we integrate into the general-purpose Analytics4X (A4X) framework to optimize performance across a wide range of applications and I/O patterns.
Research and ACM SRC Posters
Research & ACM SRC Posters
TP
DescriptionIn high energy physics (HEP), large-scale experiments produce enormous data volumes that are distributed across global storage systems. To reduce redundant transfers and improve efficiency, disk caching systems such as XCache are deployed, but their effectiveness depends on good caching policy. Our research asks: can we find patterns and reliably predict dataset popularity? This work investigates dataset-level “pinning,” where sets of files are retained in cache to improve hit rates. We explore the use of Hawkes processes, a statistical model, to model bursty, event-driven dataset popularity, a novel approach compared to previous efforts. Preliminary results suggest this framework improves predictability of future access patterns, thereby guiding more effective caching strategies. The poster will present our methodology, experimental setup, and early evaluation results, highlighting both the promise and current limitations of this approach.
Workshop
Livestreamed
Recorded
TP
W
DescriptionKubernetes is a container orchestration system that offers reliable deployment of containerized applications. However, its steep learning curve and complex configuration requirements present barriers to its adoption. To address these challenges, we present AnvilOps, a platform-as-a-service designed to streamline the deployment and management of applications on a Kubernetes cluster. AnvilOps is developed at the Rosen Center for Advanced Computing (RCAC) to simplify application deployment on Anvil's Composable Kubernetes Subsystem. Through a web interface, AnvilOps enables users to deploy applications from GitHub repositories (including GitHub Enterprise) or publicly available images with minimal configuration, abstracting the details of image building and Kubernetes resource definitions. It also supports continuous deployment through GitHub webhooks, automatically redeploying apps on CI events. This paper will provide an architectural overview of AnvilOps, then discuss the results and future work.
Birds of a Feather
State of the Practice
Livestreamed
Recorded
TP
XO/EX
DescriptionFortran plays a crucial role in numerous applications. This BoF provides a forum for Fortran developers to engage with the language's modern programming features. With features introduced in recent language revisions, Fortran 2023 supports modern programming practices and high performance computing (HPC). This BoF gathers developers from diverse domains to share experiences and explore Fortran's evolving capabilities. After some brief presentations, the session will focus on an interactive discussion where audience members will be encouraged to share their own experiences and ask questions of our experts.
Research and ACM SRC Posters
Research & ACM SRC Posters
TP
DescriptionGraph neural networks (GNNs) are a state-of-the-art machine learning model for processing graph-structured data. The growing complexity of GNNs and size of real-world graphs have increased the memory requirements of GNN training and popular training platforms, like GPU, have memory capacity on the scale of tens of GB.
In this work, we study scientific floating-point lossy compressors applied to GNN training memory reduction. We develop a framework for GNN activation lossy compression, analyze lossy compression and other data reduction techniques, and explore methods to leverage GNN data features to improve compression. This work is ongoing and will encompass more compression optimizations in the future.
The poster session will provide an overview of GNN training and opportunities for compression, followed by an analysis of cuSZp, a scientific float-point lossy compressor, GNN performance against quantization and reduced precision, and lastly, preliminary exploration of leveraging GNN attributes for compression with top-k methods.
In this work, we study scientific floating-point lossy compressors applied to GNN training memory reduction. We develop a framework for GNN activation lossy compression, analyze lossy compression and other data reduction techniques, and explore methods to leverage GNN data features to improve compression. This work is ongoing and will encompass more compression optimizations in the future.
The poster session will provide an overview of GNN training and opportunities for compression, followed by an analysis of cuSZp, a scientific float-point lossy compressor, GNN performance against quantization and reduced precision, and lastly, preliminary exploration of leveraging GNN attributes for compression with top-k methods.
Workshop
Livestreamed
Recorded
TP
W
DescriptionEffective in-situ analysis of target variables in scientific simulations is often constrained by its tight coupling with simulation timesteps, which can degrade performance and limit adaptive control of analysis quality. This work presents a surrogate modeling approach that decouples data collection and analysis from simulation execution. The surrogate models of targeted variables are trained online using early-stage simulation data, and once well-trained, replace the simulation for targeted in-situ analysis, enabling asynchronous analysis and early termination decisions without pretraining or manual tuning. We evaluate it on various applications for different data analysis tasks. Considering training processes during early stages of the simulation, we still achieve speed-ups of 1.20x–3.51x compared to traditional in-situ tracking. At meantime, we keep accuracies of 83.33%–99.60% comparing with original simulation.
Workshop
Livestreamed
Recorded
TP
W
DescriptionTensor Cores (TCs) are specialized hardware units designed for efficient matrix multiplication and are widely utilized in deep learning workloads. However, their adoption in more irregular high-
performance computing (HPC) applications remains limited. This paper presents a methodology for effectively integrating TCs into a representative HPC application: molecular docking with AutoDock-GPU. The irregular computational patterns and strict accuracy requirements of this application pose significant challenges for TC utilization. To address these, we adopt a twofold strategy: (i) accelerating sum reduction operations using TCs, and (ii) applying state-of-the-art numerical error correction (EC) techniques to maintain accuracy. Experimental evaluations on NVIDIA A100, H100, and B200 GPUs show that our CUDA-based implementation consistently outperforms the baseline while preserving algorithmic accuracy.
performance computing (HPC) applications remains limited. This paper presents a methodology for effectively integrating TCs into a representative HPC application: molecular docking with AutoDock-GPU. The irregular computational patterns and strict accuracy requirements of this application pose significant challenges for TC utilization. To address these, we adopt a twofold strategy: (i) accelerating sum reduction operations using TCs, and (ii) applying state-of-the-art numerical error correction (EC) techniques to maintain accuracy. Experimental evaluations on NVIDIA A100, H100, and B200 GPUs show that our CUDA-based implementation consistently outperforms the baseline while preserving algorithmic accuracy.
Art of HPC
Birds of a Feather
Art of HPC
Livestreamed
Recorded
TP
XO/EX
DescriptionJoin us for an engaging Birds of a Feather session at SC25, centered around the Art of HPC. This interactive gathering aims to broaden the participation of the arts in high performance computing discussions. Our distinguished SC25 Invited Artists will reflect on this year's content and discuss innovative possibilities for inspiring attendees and integrating artistic perspectives into HPC. Whether you're a seasoned expert or new to the field, this session offers a unique opportunity to influence the future of the Art of HPC at SC and beyond.
Birds of a Feather
Artificial Intelligence & Machine Learning
Livestreamed
Recorded
TP
XO/EX
DescriptionModern HPC systems generate massive amounts of monitoring and performance data daily, making manual analysis increasingly impractical. AI and machine learning are emerging as powerful tools to extract insights, detect anomalies, and optimize workload and resource behavior. This BoF brings together experts from HPC, AI, and data science to share current practices, challenges, and emerging solutions in the field. The session aims to foster collaboration and highlight real-world applications of AI/ML for improving system efficiency, reliability, and user understanding in large-scale computing environments.
Workshop
Livestreamed
Recorded
TP
W
DescriptionSparse tensor computations suffer from irregular memory access patterns that degrade cache performance.
While software prefetching can mitigate this, existing compiler approaches lack the semantic insight needed for effective optimization.
We present \textbf{ASaP}, an automatic software prefetching framework integrated within MLIR's sparse tensor dialect.
By leveraging semantic information—tensor formats and loop structure—available during sparsification,
ASaP determines accurate buffer bounds and injects prefetches in both innermost and outer loops, achieving broader coverage than prior work.
Evaluated on SuiteSparse matrices, ASaP demonstrates significant performance gains for unstructured matrices.
For SpMV with innermost-loop prefetching, ASaP achieves 1.38× speedup over Ainsworth \& Jones.
For SpMM with outer-loop prefetching, ASaP achieves 1.28× speedup while Ainsworth \& Jones fails to generate prefetches.
Our experiments reveal that disabling inaccurate hardware prefetchers frees critical resources for software prefetching,
suggesting future architectures should expose prefetcher control as an optimization interface.
While software prefetching can mitigate this, existing compiler approaches lack the semantic insight needed for effective optimization.
We present \textbf{ASaP}, an automatic software prefetching framework integrated within MLIR's sparse tensor dialect.
By leveraging semantic information—tensor formats and loop structure—available during sparsification,
ASaP determines accurate buffer bounds and injects prefetches in both innermost and outer loops, achieving broader coverage than prior work.
Evaluated on SuiteSparse matrices, ASaP demonstrates significant performance gains for unstructured matrices.
For SpMV with innermost-loop prefetching, ASaP achieves 1.38× speedup over Ainsworth \& Jones.
For SpMM with outer-loop prefetching, ASaP achieves 1.28× speedup while Ainsworth \& Jones fails to generate prefetches.
Our experiments reveal that disabling inaccurate hardware prefetchers frees critical resources for software prefetching,
suggesting future architectures should expose prefetcher control as an optimization interface.
Workshop
Livestreamed
Recorded
TP
W
DescriptionWe introduce ASCRIBE-XR, an immersive software application designed to accelerate the visualization and exploration of 3D dense arrays and mesh files from scientific experiments. Based on Godot and PC-VR technologies, the platform enables users to dynamically load and manipulate scientific records to dive into the structure of data. The
novelty lies in the unique integration at the system level, combining disparate technologies, such as VR, HPC, and AI-driven object modeling for scientific visualization. Its integration with HPC resources grants remote processing of large-scale data with results streamed directly into the VR environment. The program's multi-user capabilities, enabled through WebRTC and MQTT, allow multiple users to share data and visualize together in real-time, promoting a more interactive and engaging research experience. We describe the design and implementation of ASCRIBE-XR, highlighting its key features and capabilities. We also include examples of its application and discuss the potential benefits to the scientific community.
novelty lies in the unique integration at the system level, combining disparate technologies, such as VR, HPC, and AI-driven object modeling for scientific visualization. Its integration with HPC resources grants remote processing of large-scale data with results streamed directly into the VR environment. The program's multi-user capabilities, enabled through WebRTC and MQTT, allow multiple users to share data and visualize together in real-time, promoting a more interactive and engaging research experience. We describe the design and implementation of ASCRIBE-XR, highlighting its key features and capabilities. We also include examples of its application and discuss the potential benefits to the scientific community.
Workshop
Livestreamed
Recorded
TP
W
DescriptionHigh-Performance Computing (HPC) systems in the exascale era are increasingly heterogeneous, requiring users to navigate diverse tools, configurations, and best practices. However, essential information is often scattered across fragmented, multimodal documentation, making it difficult and time-consuming to locate. To address this, we present AskHPC, an intelligent question-answering ChatBot that delivers accurate, timely, and accessible information through a unified conversational interface. Built on a curated knowledge base integrating user guides, scheduler manuals, and programming documentation, AskHPC leverages Large Language Models (LLMs) within a Retrieval-Augmented Generation (RAG) framework. It employs two key techniques to improve HPC query responses: a modality-aware document parsing pipeline that preserves multimodal structure, and a dual-context strategy combining retrieved content (e.g., complete code blocks) with LLM-generated semantics. Evaluation, including a real-world user study, shows AskHPC outperforms direct LLM queries and vanilla RAG systems, enhancing user support and accelerating HPC software development.
Workshop
Livestreamed
Recorded
TP
W
DescriptionHigh-fidelity nuclear reactor simulations using Monte Carlo
neutron transport face bottlenecks in neutron cross-section
lookups. We designed a custom RISC-V accelerator for
RSBench, integrated with a RocketCore in Chipyard. The
pipelined design employs Humlíček’s 8th-order rational ap-
proximation to efficiently compute the Faddeeva function
during Doppler broadening of multipole cross section data,
which forms the inner loop of the program, achieving a 17×
cycle reduction for Faddeeva computations and 34.2× for
the loop compared to RISC-V software, with similar gains
over Intel Core i5 Tiger Lake-U, yielding a 10× wall-clock
time improvement. Chipyard’s software-based environment
benefits the HPC community by enabling accelerator eval-
uation. These results highlight RISC-V’s potential for HPC
through custom accelerators and code portability. Future work
will optimize operating frequency and extend to additional
simulation kernels.
neutron transport face bottlenecks in neutron cross-section
lookups. We designed a custom RISC-V accelerator for
RSBench, integrated with a RocketCore in Chipyard. The
pipelined design employs Humlíček’s 8th-order rational ap-
proximation to efficiently compute the Faddeeva function
during Doppler broadening of multipole cross section data,
which forms the inner loop of the program, achieving a 17×
cycle reduction for Faddeeva computations and 34.2× for
the loop compared to RISC-V software, with similar gains
over Intel Core i5 Tiger Lake-U, yielding a 10× wall-clock
time improvement. Chipyard’s software-based environment
benefits the HPC community by enabling accelerator eval-
uation. These results highlight RISC-V’s potential for HPC
through custom accelerators and code portability. Future work
will optimize operating frequency and extend to additional
simulation kernels.
Workshop
Livestreamed
Recorded
TP
W
DescriptionPage reclamation is critical for system performance. However, understanding and quantifying inner workings of reclamation policies remains challenging. In this work, we present a comprehensive analysis of Linux's Two-Queue LRU (2QLRU) and Multi-Generational LRU (MGLRU) policies using a lightweight profiler that measures decision overhead and reclamation effectiveness. Across seven diverse applications under various memory constraints, we quantify how MGLRU achieves better reclamation decisions despite its higher decision making cost, while its performance benefits diminish under severe memory constraint. These insights provide guidance for future memory management optimizations.
Paper
BSP
Performance Measurement, Modeling, & Tools
System Software and Cloud Computing
Livestreamed
Recorded
TP
DescriptionNetwork simulators play a crucial role in evaluating the performance of large-scale systems. However, most existing simulators rely heavily on synthetic microbenchmarks or narrowly focus on a specific domain. In this paper, we introduce ATLAHS, a flexible, extensible, and open-source toolchain designed to trace real-world applications and accurately simulate their network behavior. ATLAHS leverages the GOAL format to efficiently model communication and computation patterns in AI, HPC, and distributed storage applications. It supports multiple network simulation backends and natively handles multi-job and multi-tenant scenarios. Through extensive validation, we demonstrate that ATLAHS achieves high accuracy in simulating real application workloads (consistently less than 5% error), while significantly outperforming AstraSim, the current state-of-the-art AI systems simulator. We further illustrate ATLAHS's utility via case studies, highlighting the impact of congestion control algorithms on distributed storage performance, as well as the influence of job-placement strategies on application performance within computing clusters.
Art of HPC
Art of HPC
Not Livestreamed
Not Recorded
TP
W
TUT
XO/EX
DescriptionA participant interacting with ATLAS in silico on the Varrier™ 60 LCD tile, semi-circular, 100-million pixel autostereographic display located at the UC San Diego Calit2 Immersive Visualization Laboratory. The interactive virtual environment is created utilizing the complete first release of the Global Ocean Sampling Expedition (GOS) oceanic microorganism metagenomics dataset collected by the J. Craig Venter Institute and computed on the CAMERA (Community Cyberinfrastructure for Advanced Marine Microbial Ecology Research and Analysis) HPC cluster. Participants are immersed in a dream-like, highly abstract, data-driven virtual world instantiated as a scalable metadata environment populated with meta shape grammar objects and associated scalable auditory data signatures. Sponsors include: NSF IIS 084103, CAMERA, Calit2, CPNAS/NAKFI, Da-Lite, NVIDIA, Mechdyne/VRco, Meyer Sound. Video available at: https://vimeo.com/xrezlab/atlasinsilico?share=copy
Paper
Post-Moore Computing
Quantum Computing
Livestreamed
Recorded
TP
DescriptionClassically simulating quantum systems is challenging, as even noiseless $n$-qubit quantum states scale as $2^n$. The complexity of noisy quantum systems is even greater, requiring $2^n \times 2^n$-dimensional density matrices. Various approximations reduce density matrix overhead, including quantum trajectory-based methods, which instead use an ensemble of $m \ll 2^n$ noisy states. While this method is dramatically more efficient, current implementations use unoptimized sampling, redundant state preparation, and single-shot data collection. In this manuscript, we present the Pre-Trajectory Sampling technique, increasing the efficiency and utility of trajectory simulations by tailoring error types, batching sampling without redundant computation, and collecting error information. We demonstrate the effectiveness of our method with both a mature statevector simulation of a 35-qubit quantum error-correction code and a preliminary tensor network simulation of 85 qubits, yielding speedups of up to $10^6$x and $16$x, as well as generating massive datasets of one trillion and one million shots, respectively.
Workshop
Livestreamed
Recorded
TP
W
DescriptionThe Aurora exascale system is the latest supercomputer deployed at the Argonne Leadership Computing Facility (ALCF). Successfully deploying a leadership class system is the result of years of effort both by the facility and the vendor. This extensive collaboration culminates with the successful completion of acceptance testing, a necessary step to prepare the system for general access, ensuring that the system is stable, accurate, and performant for scientific discovery.
The Aurora acceptance test process mimicked the real world utilization of the system, stressed the entire system as well as the individual components, and tracked the regressions that occurred. The open-source based acceptance test harness of the previously deployed ALCF system was extended for Aurora. This work describes this harness, its components, and its extensions. In addition, we discuss our experiences expanding the harness to support additional testing modes while highlighting the challenges encountered, lessons learned, and desires for future enhancement.
The Aurora acceptance test process mimicked the real world utilization of the system, stressed the entire system as well as the individual components, and tracked the regressions that occurred. The open-source based acceptance test harness of the previously deployed ALCF system was extended for Aurora. This work describes this harness, its components, and its extensions. In addition, we discuss our experiences expanding the harness to support additional testing modes while highlighting the challenges encountered, lessons learned, and desires for future enhancement.
Workshop
Livestreamed
Recorded
TP
W
DescriptionAs scientific knowledge grows at an unprecedented pace, evaluation benchmarks must evolve to reflect new discoveries and ensure language models are tested on current, diverse literature. We propose a scalable, modular framework for generating multiple-choice question-answering (MCQA) benchmarks directly from large corpora of scientific papers. Our pipeline automates every stage of MCQA creation, including PDF parsing, semantic chunking, question generation, and model evaluation. As a case study, we generate more than 16,000 MCQs from 22,000 open-access articles in radiation and cancer biology. We then evaluate a suite of small language models (1.1B–14B parameters) on these questions, comparing baseline accuracy with retrieval-augmented generation (RAG) from paper-derived semantic chunks and from reasoning traces distilled from GPT-4.1. We find that reasoning-trace retrieval consistently improves performance on both synthetic and expert-annotated benchmarks, enabling several small models to surpass GPT-4 on the 2023 Astro Radiation and Cancer Biology exam.
Paper
HPC for Machine Learning
Programming Frameworks
Livestreamed
Recorded
TP
DescriptionThe Fourier transform is an ubiquitous mathematical operation used in a multitude of scientific applications. Most distributed Fourier transform libraries provide rigid implementations that force developers of high performance applications to mold their code around the Fourier computation, omitting opportunities for minimizing communication across the Fourier transforms and the surrounding computation. In this work, we introduce a new automatic approach to generate distributed mappings for multi-dimensional Fourier operations, offering a solution to this problem. Our approach decides how to decompose, map, and schedule the computation as smaller and lower-dimensional parallel operations. We design and implement a novel non-linear iterative formulation that optimizes across Fourier and linear algebra operations. Our scheme leverages the Z3 SMT solver to minimize the number of communication steps across key MPI collectives, while selecting the grid shape. We evaluate the effectiveness of our new scheme and demonstrate 2x-31x speedups over coupled heFFTe and COSMA solutions.
Birds of a Feather
Community Meetings
Livestreamed
Recorded
TP
XO/EX
DescriptionThis BoF will explore accelerating science through a network of autonomous labs driven by intelligent agents, high performance computing, and interoperable infrastructure. Aligned with SC25’s HPC Ignites theme, we will discuss integrating HPC, AI, and lab automation to enable end‑to‑end orchestration of distributed experiments. Topics include cross‑institutional agent coordination, data management, reproducibility, and infrastructure and policy adaptations required to support this emerging ecosystem. Through interactive discussion, participants will collaboratively define community priorities and roadmap next steps for harnessing intelligent HPC‑powered labs to ignite scientific breakthroughs.
Research and ACM SRC Posters
Research & ACM SRC Posters
TP
DescriptionModern high performance computing increasingly relies on sophisticated graph-based models to represent and manipulate symbolic data. From bioinformatics and cyber security to inference of the AI model and text analytics, these applications often use directed graphs to capture complex dependencies and transitions between states. However, as data sets and patterns grow in complexity and size, graph representations—composed of nodes and edges—also expand dramatically, resulting in excessive memory usage, power consumption, routing congestion, and inefficiencies in hardware acceleration platforms such as field-programmable gate arrays (FPGAs).
Paper
HPC for Machine Learning
Performance Measurement, Modeling, & Tools
Livestreamed
Recorded
TP
DescriptionTo reduce computational and memory costs in large language models (LLMs), dynamic workload reduction schemes like Mixture of Experts (MoEs), parameter pruning, layer freezing, sparse attention, early token exit, and Mixture of Depths (MoDs) have emerged. However, these methods introduce severe workload imbalances, limiting their practicality for large-scale distributed training. We propose DynMo, an autonomous dynamic load-balancing solution that ensures optimal compute distribution when using pipeline parallelism in training dynamic models. DynMo adaptively balances workloads, dynamically packs tasks into fewer workers to free idle resources, and supports both multi-GPU single-node and multi-node systems. Compared to static training methods (Megatron-LM, DeepSpeed), DynMo accelerates training by up to 1.23x (MoEs), 3.18x (pruning), 2.23x (layer freezing), 4.02x (sparse attention), 4.52x (early exit), and 1.17x (MoDs).
Workshop
Partially Livestreamed
Partially Recorded
TP
W
DescriptionDecades of conflict and poverty have devastated
Afghanistan’s education system, leaving curricula outdated, poor
quality education and shorten human resources. Although Gen-
erative Artificial Intelligence (GenAI) learning platforms have
seen widespread global adoption, Afghan learners remain largely
excluded due to language barriers, limited awareness, low digital
literacy, lack of trust, financial constraint and the situation
worsened by the nationwide education ban on girls and women.
To address this gap, we introduce Bamaa, a platform to
support students in Afghanistan in improving their problem-
solving skills in STEM subjects. The platform support Dari and
Pashto (Afghanistan’s native languages) , enabling an inclusive
personalized learning and individual feedback and recommen-
dation learning experience for students navigating educational
barriers caused by years of political instability, poverty and the
restrictive policies imposed by the current regime on women.
Afghanistan’s education system, leaving curricula outdated, poor
quality education and shorten human resources. Although Gen-
erative Artificial Intelligence (GenAI) learning platforms have
seen widespread global adoption, Afghan learners remain largely
excluded due to language barriers, limited awareness, low digital
literacy, lack of trust, financial constraint and the situation
worsened by the nationwide education ban on girls and women.
To address this gap, we introduce Bamaa, a platform to
support students in Afghanistan in improving their problem-
solving skills in STEM subjects. The platform support Dari and
Pashto (Afghanistan’s native languages) , enabling an inclusive
personalized learning and individual feedback and recommen-
dation learning experience for students navigating educational
barriers caused by years of political instability, poverty and the
restrictive policies imposed by the current regime on women.
Workshop
Livestreamed
Recorded
TP
W
DescriptionMixture of Experts (MoE) models face computational bottlenecks on wafer-scale processors due to conflicting batch size requirements: attention mechanisms need smaller batches for memory constraints, while routable MLP layers require larger batches for optimal compute density.
We introduce \textbf{Batch Tiling on Attention (BTA)}, which decouples batch processing across MoE computation stages by applying dynamic tiling on attention's batch dimension. Our method processes attention operations at reduced batch size $B$ through tiled computation, then concatenates outputs to form larger batch size $\widetilde{B} = G \cdot B$ for MLP operations, where $G$ is a positive integer. This addresses attention memory limitations while maximizing hardware utilization in expert layers.
We demonstrate BTA's effectiveness on Cerebras wafer-scale engines using Qwen3-like models, achieving up to 5$\times$ performance improvements at higher sparsity levels compared to conventional uniform batching. Unlike existing GPU-focused solutions like FlashAttention and expert parallelism, BTA specifically targets wafer-scale processors' unique computational characteristics.
We introduce \textbf{Batch Tiling on Attention (BTA)}, which decouples batch processing across MoE computation stages by applying dynamic tiling on attention's batch dimension. Our method processes attention operations at reduced batch size $B$ through tiled computation, then concatenates outputs to form larger batch size $\widetilde{B} = G \cdot B$ for MLP operations, where $G$ is a positive integer. This addresses attention memory limitations while maximizing hardware utilization in expert layers.
We demonstrate BTA's effectiveness on Cerebras wafer-scale engines using Qwen3-like models, achieving up to 5$\times$ performance improvements at higher sparsity levels compared to conventional uniform batching. Unlike existing GPU-focused solutions like FlashAttention and expert parallelism, BTA specifically targets wafer-scale processors' unique computational characteristics.
Workshop
Livestreamed
Recorded
TP
W
DescriptionPredictive digital twins are poised to make an impact in the bourgeoning field of precision oncology by coupling mathematical and computational models with patient-specific data. The inherent inter-patient heterogeneity in cancer physiology and response to therapy hinders development of therapies at the population level that are effective for individual patient. While it is not feasible to perform multiple in vivo trials on an individual patient, computational models augment the traditional approach by enabling in silico assessment of potential interventions. A digital twin deployed in the clinic would calibrate models of disease progression with patient-specific data, predict patient outcomes, and inform treatment strategies, thereby tailoring care to the individual. Realizing such a digital twin will require scalable and efficient methods to integrate patient data with computational models. We develop an end-to-end framework that combines longitudinal magnetic resonance imaging (MRI) with mechanistic models of disease progression.
Paper
Energy Efficiency
Performance Measurement, Modeling, & Tools
Power Use Monitoring & Optimization
State of the Practice
Livestreamed
Recorded
TP
DescriptionAs advances in energy-efficiency become the primary limiter to increases in power-constrained supercomputing and machine learning performance, it is imperative that developers, architects, and practitioners understand how modern GPUs consume energy when running HPC and ML applications.
Rather than opaque coarse-grained metrics, in this paper, we develop an extensible, microbenchmark-parameterized energy model that is capable not only of attributing application energy by functional unit (FPU, tensor core, integer ALU) and memory level (L1, L2, HBM), but also of differentiating control energy from datapath energy.
We examine trends in energy per operation among four generations of GPUs and validate our results using supercomputing and ML/AI procurement workloads. Our insights and extrapolations can be used to drive the future of CMOS and memory technologies, computer architecture research, algorithmic innovation, optimizations for power-constrained and mobile environments, and data center operations.
Rather than opaque coarse-grained metrics, in this paper, we develop an extensible, microbenchmark-parameterized energy model that is capable not only of attributing application energy by functional unit (FPU, tensor core, integer ALU) and memory level (L1, L2, HBM), but also of differentiating control energy from datapath energy.
We examine trends in energy per operation among four generations of GPUs and validate our results using supercomputing and ML/AI procurement workloads. Our insights and extrapolations can be used to drive the future of CMOS and memory technologies, computer architecture research, algorithmic innovation, optimizations for power-constrained and mobile environments, and data center operations.
Workshop
Livestreamed
Recorded
TP
W
DescriptionCerebras CS-2 system is gathering attention to utilise it for scientific applications. The Cerebras CS-2 comes with the world's largest chip, the wafer-scale engine 2 (WSE-2). The WSE-2 has new characteristics that distinguish it from other computers, such as a massive number of small processing elements, low-latency 2D mesh topology, and a unique distributed memory architecture. By understanding this unique architecture, scientific applications can be accelerated significantly. However, its sustained performance and characteristics have not yet been fully understood. In this study, we give a benchmark study focusing on WSE-2. The objective is to examine the various performance characteristics of WSE-2 in detail, including inter-PE communication. In this paper, benchmarks of the effective computational performance and memory bandwidth are conducted, the Byte/Flop value is calculated, and a roofline model for the WSE-2 is built. Additionally, the effective communication latency of two distinct PEs and the bi-section bandwidth are measured.
Workshop
Livestreamed
Recorded
TP
W
DescriptionScientific error-bounded lossy compressors are widely used to reduce storage and I/O costs in large-scale scientific computing tasks. It is critical to benchmark those compressors to help users understand their performance. Nevertheless, when evaluating the decompressed data quality, existing benchmarks mainly focus on error-value-based data quality metrics such as PSNR, while correlational metrics such as SSIM, Error Autocorrelation, and the Pearson Coefficients are also important metrics. We benchmark seven compressors on six representative scientific datasets, evaluating diverse data quality metrics such as PSNR, SSIM, Error Autocorrelation, and Pearson Coefficient. Our results show that each existing compressor exhibits diverged performances over different metrics, and no single compressor can be advantageous on all metrics in one dataset or across all datasets in one metric. Comparing the performances of compressors on different quality metrics, we deliver some important takeaways and suggestions on how to select scientific error-bounded lossy compressors based on user requirements.
Workshop
Partially Livestreamed
Partially Recorded
TP
W
Tutorial
Livestreamed
Recorded
TUT
DescriptionProducing scientific software is a challenge. The high-performance modeling and simulation community, in particular, faces the confluence of disruptive changes in computing architectures and new opportunities (and demands) for greatly improved simulation capabilities, especially through coupling physics and scales. Simultaneously, computational science and engineering (CSE), as well as other areas of science, are experiencing an increasing focus on scientific reproducibility and software quality. Large language models (LLMs) can significantly increase developer productivity through judicious off-loading of tasks. However, models can hallucinate, therefore it is important to have a good methodology to get the most benefit out of this approach.
In this tutorial, attendees will learn about practices, processes, and tools to improve the productivity of those who develop CSE software, increase the sustainability of software artifacts, and enhance trustworthiness in their use. We will focus on aspects of scientific software development that are not adequately addressed by resources developed for industrial software engineering, offering a strategy for the responsible use of LLMs to enhance developer productivity in the context of scientific software development, incorporating testing strategies for the generated code, and discussing reproducibility considerations in the development and use of scientific software.
In this tutorial, attendees will learn about practices, processes, and tools to improve the productivity of those who develop CSE software, increase the sustainability of software artifacts, and enhance trustworthiness in their use. We will focus on aspects of scientific software development that are not adequately addressed by resources developed for industrial software engineering, offering a strategy for the responsible use of LLMs to enhance developer productivity in the context of scientific software development, incorporating testing strategies for the generated code, and discussing reproducibility considerations in the development and use of scientific software.
Research and ACM SRC Posters
Research & ACM SRC Posters
TP
DescriptionThe readiness of new 400Gbps Ethernet hardware was evaluated for potential production use in high performance computing (HPC) environments over a local area network (LAN) and a wide area network (WAN). The approach explored a range of data movement strategies, including parallelized transfer tools, in which warp-speed data transfer (WDT) yielded optimal results. Furthermore, communication protocols were tested, such as Transmission Control Protocol (TCP) and remote direct memory access (RDMA) over converged Ethernet (RoCE). Performance tests were conducted on bandwidth and latency to understand potential bottlenecks. Stress tests were run in Message Passing Interface (MPI) and other HPC-relevant environments. The research examined whether a 400Gbps pipeline can be saturated using current tools and methods, both locally and across geographically distributed environments. The findings provided recommendations for enhancing high-throughput data workflows in HPC settings.
Workshop
Livestreamed
Recorded
TP
W
DescriptionMultimodal large language models (MLLMs) are now widely used across many applications, including scientific question answering that requires combining visual and textual inputs. However, existing benchmarks in this area are mostly end-to-end, making it difficult to pinpoint where models fail. To address this gap, we design an evaluation framework that decomposes scientific question answering into subtasks for fine-grained assessment. We evaluate two MLLMs, Gemini 2.5 Pro and Qwen2.5-VL-32B-Instruct, on questions involving high-resolution visual data. Results show that accurate answers are unattainable without scripting or tool use. Although both models can solve individual subtasks, such as mapping cities to coordinates or computing pixel positions, they often fail to integrate these abilities in end-to-end reasoning, producing large deviations. Our findings highlight the importance of benchmarks that expose reasoning bottlenecks and suggest that agent-based or multi-model approaches may be required to achieve reliable performance on complex scientific tasks.
Workshop
Livestreamed
Recorded
TP
W
DescriptionProxy applications are targeted submodels of larger parent applications, designed to represent key characteristics such as the programming model, memory usage, or communication behaviors.
Proxy applications are valuable for system design and optimization, offering a more manageable and privacy-preserving alternative to analyzing the parent application, as long as they accurately capture the core characteristics.
Creating and maintaining this proxy-parent relationship has been challenging without a consistent means of quantifying proxy fidelity. We address this challenge with Calder, a robust algorithmic toolkit to characterize similarities in HPC application behavior.
Calder leverages similarity algorithms to compare application behaviors and uses Laplacian scores with a correlation filter to identify the most important application features.
We validate Calder's similarity results using proxy applications for kernel behavior, network counters, and cross-platform performance. Using the selected features, we show that 75\% of the proxies in our suite demonstrate highly convergent behavior with their parents.
Proxy applications are valuable for system design and optimization, offering a more manageable and privacy-preserving alternative to analyzing the parent application, as long as they accurately capture the core characteristics.
Creating and maintaining this proxy-parent relationship has been challenging without a consistent means of quantifying proxy fidelity. We address this challenge with Calder, a robust algorithmic toolkit to characterize similarities in HPC application behavior.
Calder leverages similarity algorithms to compare application behaviors and uses Laplacian scores with a correlation filter to identify the most important application features.
We validate Calder's similarity results using proxy applications for kernel behavior, network counters, and cross-platform performance. Using the selected features, we show that 75\% of the proxies in our suite demonstrate highly convergent behavior with their parents.
Community Engagement and Support
Not Livestreamed
Not Recorded
TP
XO/EX
DescriptionThis talk highlights ACE-Mali’s journey to establish HPC infrastructure for infectious disease research and training in West Africa, with lessons learned for sustainable, inclusive, global HPC capacity building.
Flash Session
Not Livestreamed
Not Recorded
TP
XO/EX
DescriptionAI is driving today’s power electronics to be faster, smarter, and more connected than ever. With that computational intensity comes higher thermal loads, so advanced cooling systems are needed to ensure optimal performance and reliability. Adding to the challenge is edge computing, which brings servers closer to people. So, how do you balance the demand for powerful and efficient cooling with the need to minimize noise?
In this session, we’ll explore the basics of fan noise, show how the decibel level alone is an incomplete measure of the overall noise profile, and reveal the biggest factor when it comes to how we perceive sound. We’ll also share how ebm-papst is addressing this issue with their latest noise mitigation strategies.
In this session, we’ll explore the basics of fan noise, show how the decibel level alone is an incomplete measure of the overall noise profile, and reveal the biggest factor when it comes to how we perceive sound. We’ll also share how ebm-papst is addressing this issue with their latest noise mitigation strategies.
Exhibitor Forum
Data Analytics
Livestreamed
Recorded
TP
XO/EX
DescriptionTraditional HPC and AI accelerators are limited by rigid architectures and closed ecosystems. NextSilicon's Maverick-2 Intelligent Compute Architecture (ICA) offers a new path forward with its software-defined architecture that dynamically optimizes execution without requiring code rewrites. Maverick-2 is a drop-in replacement for existing architectures, supporting C/C++, FORTRAN, OpenMP, Kokkos, and more. This session will detail how NextSilicon’s self-optimizing, software-defined data flow architecture will address the bottlenecks of traditional approaches. We will share key results and benchmarks, and discuss our ongoing collaborations with U.S. DOE lab and industry customers. A brief Q&A session will follow the presentation.
Paper
Architectures & Networks
HPC for Machine Learning
Performance Measurement, Modeling, & Tools
Programming Frameworks
Livestreamed
Recorded
TP
DescriptionCommunication locality plays a key role in the performance of collective operations on large HPC systems, especially on oversubscribed networks where groups of nodes are fully connected internally but sparsely linked through global connections. We present \Bine (\textit{\underline{bi}nomial \underline{ne}gabinary}) trees, a family of collective algorithms that improve communication locality.
\Bine trees maintain the generality of binomial trees and butterflies while cutting global-link traffic by up to $33\%$. We implement eight \Bine-based collectives and evaluate them on four large-scale supercomputers with Dragonfly, Dragonfly+, oversubscribed fat-tree, and torus topologies, achieving up to $5\times$ speedups and consistent reductions in global-link traffic across different vector sizes and node counts.
\Bine trees maintain the generality of binomial trees and butterflies while cutting global-link traffic by up to $33\%$. We implement eight \Bine-based collectives and evaluate them on four large-scale supercomputers with Dragonfly, Dragonfly+, oversubscribed fat-tree, and torus topologies, achieving up to $5\times$ speedups and consistent reductions in global-link traffic across different vector sizes and node counts.
Workshop
Livestreamed
Recorded
TP
W
DescriptionBiological research requires diverse reasoning modes, from phylogenetic analysis to mechanistic understanding, each demanding specific methods and data types. Current AI systems typically employ single methodologies, limiting effectiveness in complex biological domains. We present BioR5, a three-layer architecture implementing eleven distinct biological reasoning modes with intelligent triage and specialized tool integration. Layer A provides parametric memory via large language models, Layer B incorporates specialized foundation models for multimodal data, and Layer C connects external databases and computational tools. Our system features an intelligent reasoning mode selection combining keyword matching with LLM analysis to choose appropriate strategies automatically. We demonstrate the framework through toxicology specialization, integrating TX-Gemma predictions with PubChem, ToxCast, and ChEMBL data. The open-source implementation supports dynamic registration of new reasoning modes and tools, enabling collaborative development and community-driven expansion. BioR5 represents an architecture-first approach to developing reasoning-mode-aware AI systems that scale easily with new biological use cases.
Paper
Applications
Livestreamed
Recorded
TP
DescriptionThe Basic Local Alignment Search Tool (BLAST), often referred to as the Google of biological research, is widely used to query a large database to find homologous sequences. Though there have been attempts to accelerate protein BLAST on GPUs, they remain slower than multi-threaded implementations. In this paper, we introduce BLAZE, a GPU-accelerated drop-in replacement for protein BLAST that produces identical results while achieving speedups over multi-threaded and GPU-accelerated implementations.
BLAZE's three key innovations include: (1) the use of hybrid (fine-grained + coarse-grained) parallelism, (2) the use of size-customized kernels, unlike previous "one-size-fits-all" approaches, and (3) the use of common-case GPU optimizations that are difficult to support in the general case. On an 8-core system with an NVIDIA RTX 3080 GPU on the 266 GB nr database, BLAZE achieves 18.2x speedup over single-threaded BLASTP, 4.8x speedup over previous GPU-accelerated baselines, and 1.9x speedup over a 16-way multithreaded BLASTP, on average.
BLAZE's three key innovations include: (1) the use of hybrid (fine-grained + coarse-grained) parallelism, (2) the use of size-customized kernels, unlike previous "one-size-fits-all" approaches, and (3) the use of common-case GPU optimizations that are difficult to support in the general case. On an 8-core system with an NVIDIA RTX 3080 GPU on the 266 GB nr database, BLAZE achieves 18.2x speedup over single-threaded BLASTP, 4.8x speedup over previous GPU-accelerated baselines, and 1.9x speedup over a 16-way multithreaded BLASTP, on average.
Art of HPC
Art of HPC
Not Livestreamed
Not Recorded
TP
W
TUT
XO/EX
DescriptionThe video shows blood flow in the vasculature of zebrafish. Nanoparticles (blue), red blood cells (red), and the vasculature are visualized.
Paper
Architectures & Networks
System Software and Cloud Computing
Livestreamed
Recorded
TP
DescriptionMany inference systems leverage spatial multiplexing technologies, such as Multi-Process Service (MPS) and Multi-Instance GPU (MIG), to serve deep learning models concurrently on a single GPU. However, existing solutions suffer from interference under MPS and rigid partition sizes in MIG. To address these limitations, we propose BOER, a system that combines MPS atop MIG partitions to reduce interference and enhance GPU utilization.
BOER identifies key challenges in integrating MPS with MIG and introduces a hierarchical scheduling framework that jointly determines model colocation, workload distribution, MIG partitioning, and MPS configurations, while minimizing resource fragmentation and MIG reconfiguration overhead. Since MPS interference is difficult to predict accurately, BOER avoids performance models and instead employs a Bayesian optimization with tailored acceleration strategies to efficiently explore the MPS configuration space. Evaluation on a real testbed demonstrates that BOER outperforms state-of-the-art spatial multiplexing solutions, improving inference throughput by up to 46.04%–77.19% while preserving quality-of-service requirements.
BOER identifies key challenges in integrating MPS with MIG and introduces a hierarchical scheduling framework that jointly determines model colocation, workload distribution, MIG partitioning, and MPS configurations, while minimizing resource fragmentation and MIG reconfiguration overhead. Since MPS interference is difficult to predict accurately, BOER avoids performance models and instead employs a Bayesian optimization with tailored acceleration strategies to efficiently explore the MPS configuration space. Evaluation on a real testbed demonstrates that BOER outperforms state-of-the-art spatial multiplexing solutions, improving inference throughput by up to 46.04%–77.19% while preserving quality-of-service requirements.
Paper
Data Analytics, Visualization & Storage
Livestreamed
Recorded
TP
DescriptionAs high-performance computing architectures evolve, more scientific computing workflows are being deployed on advanced computing platforms such as GPUs. These workflows can produce raw data at extremely high throughputs, requiring urgent high-ratio and low-latency error-bounded data compression solutions. In this paper, we propose cuSZ-Hi, an optimized high-ratio GPU-based scientific error-bounded lossy compressor with a flexible, domain-irrelevant, and fully open-source framework design. Our novel contributions are: 1) we maximally optimize the parallelized interpolation-based data prediction scheme on GPUs to make it adaptive to diverse data characteristics; 2) we thoroughly explore and investigate lossless data encoding techniques, then craft and incorporate the best-fit lossless encoding pipelines for maximizing the compression ratio of cuSZ-Hi; 3) we systematically evaluate cuSZ-Hi on benchmarking datasets together with representative baselines. cuSZ-Hi can achieve up to 249% compression ratio improvement under the same error bound and up to 215% compression ratio improvement under the same decompression data PSNR.
Paper
State of the Practice
Livestreamed
Recorded
TP
DescriptionTo meet the increasing demands of parallel scientific applications, supercomputers continue to grow in both scale and complexity. The fastest supercomputer in the world, El Capitan, features over a million CPU cores and tens of thousands of GPUs. Applications running on such large-scale systems are particularly susceptible to system noise or interference caused by the operating system (OS) and other services running on the same compute nodes as the application.
In this paper, we address this critical performance and scalability challenge on El Capitan, enabling scientific applications to better leverage the benefits of the world's fastest supercomputer. Our strategy comprises two key components: (1) isolating system services from applications and (2) applying OS-level tuning to maintain minimal application interference. As part of this effort, we provide a distribution-independent tuning guide applicable to any Linux system, and we propose and evaluate general strategies for isolating system processes.
In this paper, we address this critical performance and scalability challenge on El Capitan, enabling scientific applications to better leverage the benefits of the world's fastest supercomputer. Our strategy comprises two key components: (1) isolating system services from applications and (2) applying OS-level tuning to maintain minimal application interference. As part of this effort, we provide a distribution-independent tuning guide applicable to any Linux system, and we propose and evaluate general strategies for isolating system processes.
Workshop
Livestreamed
Recorded
TP
W
DescriptionWe introduce a communication mechanism bridging accelerators like GPUs and PCIe-based FPGA devices using Programmed I/O as an alternative to Direct Memory Access data transmissions: less than 2 microseconds one-way latency for small message transfers is achieved when the FPGA operates as Network Interface Card (NIC).
Our prototype employs APEnetX, a custom FPGA-based NIC, and a CPU engine that atomically writes descriptors and payloads directly into the PCIe device Memory Mapped region using AVX-512 instructions. Additionally, a GPU peer-to-peer remapping technique enables the injections of data packets from the GPU memory into the NIC Memory Mapped aperture with no DMA-orchestrated data movements by the CPU. Microbenchmarks show lower latency than traditional RDMA for small packets with a simpler software stack. This method is not limited to APEnetX: it applies to any FPGA-based NIC or accelerator exposing a PCIe-mapped control aperture, provided the device can read and transmit data from memory.
Our prototype employs APEnetX, a custom FPGA-based NIC, and a CPU engine that atomically writes descriptors and payloads directly into the PCIe device Memory Mapped region using AVX-512 instructions. Additionally, a GPU peer-to-peer remapping technique enables the injections of data packets from the GPU memory into the NIC Memory Mapped aperture with no DMA-orchestrated data movements by the CPU. Microbenchmarks show lower latency than traditional RDMA for small packets with a simpler software stack. This method is not limited to APEnetX: it applies to any FPGA-based NIC or accelerator exposing a PCIe-mapped control aperture, provided the device can read and transmit data from memory.
Workshop
Performance Evaluation, Scalability, & Portability
Livestreamed
Recorded
TP
W
DescriptionBridging portability and scalability is essential for HPC applications. The Vector Particle-In-Cell (VPIC) code, widely used in plasma physics simulations, historically required extensive platform-specific optimizations to achieve high performance. VPIC 2.0 addresses this challenge by adopting Kokkos for performance portability, enabling it to scale effectively across diverse architectures, including CPUs and GPUs. However, the abstractions introduced by Kokkos can obscure hardware-specific capabilities and introduce performance overhead. In this work, we mitigate these overheads by enhancing vectorization and optimizing memory access patterns through platform-targeted particle sorting in VPIC 2.0. These optimizations enable VPIC 2.0 to match the performance of the highly tuned, hardware specific VPIC 1.2 on CPUs and to achieve superlinear scaling on GPUs.
Workshop
Livestreamed
Recorded
TP
W
DescriptionThe integration of quantum processing units (QPUs) with high-performance computing (HPC) infrastructures represents one of the most significant challenges in quantum computing today. As quantum systems scale beyond the NISQ era toward utility-scale applications, the need for robust, performance-optimized compilation toolchains becomes critical for realizing quantum advantage in real-world scientific computing workflows.
Quantum compilation fundamentally differs from classical compilation in ways that challenge traditional compiler design principles. While classical compilers optimize for performance metrics like instruction throughput and cache locality, quantum compilers must navigate the fragile nature of quantum superposition, where measurement destroys quantum states and decoherence imposes strict timing constraints. Quantum programs exhibit unique characteristics: some of their pieces (the quantum circuits) are inherently reversible, programs operate on exponentially large Hilbert spaces and require hardware-specific gate decompositions that vary dramatically across QPU architectures.
This talk presents some of the challenges in quantum compilation and CUDA-Q's approach to quantum-classical hybrid compilation through its sophisticated MLIR-based compiler infrastructure that seamlessly integrates quantum kernels with HPC environments. We will talk about how CUDA-Q leverages the Multi-Level Intermediate Representation (MLIR) framework to enable progressive lowering from high-level C++ and Python quantum programs through multiple abstraction layers.
Quantum compilation fundamentally differs from classical compilation in ways that challenge traditional compiler design principles. While classical compilers optimize for performance metrics like instruction throughput and cache locality, quantum compilers must navigate the fragile nature of quantum superposition, where measurement destroys quantum states and decoherence imposes strict timing constraints. Quantum programs exhibit unique characteristics: some of their pieces (the quantum circuits) are inherently reversible, programs operate on exponentially large Hilbert spaces and require hardware-specific gate decompositions that vary dramatically across QPU architectures.
This talk presents some of the challenges in quantum compilation and CUDA-Q's approach to quantum-classical hybrid compilation through its sophisticated MLIR-based compiler infrastructure that seamlessly integrates quantum kernels with HPC environments. We will talk about how CUDA-Q leverages the Multi-Level Intermediate Representation (MLIR) framework to enable progressive lowering from high-level C++ and Python quantum programs through multiple abstraction layers.
Workshop
Livestreamed
Recorded
TP
W
DescriptionRISC-V ISA-based processors have emerged as powerful, energy-efficient computing platforms, with the Milk-V Pioneer marking the first desktop-grade RISC-V system. Growing interest from academia and industry highlights their potential in high-performance computing (HPC). The open-source, FPGA-accelerated FireSim framework enables flexible architectural exploration using RISC-V cores, but systematic evaluation of its accuracy against real hardware remains limited.
This study models a commercially available single-board computer and a desktop-grade RISC-V CPU in FireSim. Benchmarks under single-core and four-core configurations were used to align simulation parameters with hardware behavior. Using the best-matching configuration, performance was assessed with a representative mini-application and the LAMMPS molecular dynamics code.
Results show that FireSim offers useful insights into architectural performance trends, but runtime discrepancies persist due to simulation limitations and incomplete CPU performance specifications, which constrain precise configuration matching.
This study models a commercially available single-board computer and a desktop-grade RISC-V CPU in FireSim. Benchmarks under single-core and four-core configurations were used to align simulation parameters with hardware behavior. Using the best-matching configuration, performance was assessed with a representative mini-application and the LAMMPS molecular dynamics code.
Results show that FireSim offers useful insights into architectural performance trends, but runtime discrepancies persist due to simulation limitations and incomplete CPU performance specifications, which constrain precise configuration matching.
Workshop
Livestreamed
Recorded
TP
W
DescriptionEfficient job scheduling in distributed systems faces exponential complexity growth as systems scale. While queue-based methods (e.g., FIFO) generate schedules rapidly but suboptimally, optimization tools achieve higher quality at significant computational cost. We propose a hybrid ant colony optimization (HACO) algorithm bridging this gap. HACO uses queue-based warm-start initialization for pheromone levels, constructs disjunctive graphs modeling precedence and resource constraints, and applies parallel local search on selected subgraphs to escape local optima. Our approach combines the speed of heuristics with optimization quality through strategic pheromone updates and OR-Tools integration. Experimental evaluation on job shop scheduling (JSSP), flexible job shop (FJSP), and synthetic large-scale problems demonstrates 3-5\% deviation from optimality with 5-10x speedup over state-of-the-art solvers. Results show consistent performance across varying problem scales, making HACO compelling for large-scale distributed scheduling where computational efficiency is critical.
Paper
State of the Practice
System Software and Cloud Computing
Livestreamed
Recorded
TP
DescriptionBinary package managers install software quickly but limit configurability due to rigid ABI requirements that ensure compatibility between binaries. Source package managers provide flexibility in building software, but compilation can be slow. For example, installing an HPC code with a new MPI implementation typically results in a full rebuild. Spack, a widely deployed, HPC-focused package manager, can use source and pre-compiled binaries, but without a binary compatibility model, it is unable to install binaries not built together. We present {\it splicing}, an extension to Spack that models binary compatibility between packages and allows seamless mixing of source and binary distributions. Splicing augments Spack's packaging language and dependency resolution engine to reuse compatible binaries while maintaining the flexibility of source builds. This extension incurs minimal installation-time overhead, and it allows rapid installation from binaries, even for ABI-sensitive dependencies like MPI that would otherwise require many rebuilds.
Paper
HPC for Machine Learning
Performance Measurement, Modeling, & Tools
Programming Frameworks
Livestreamed
Recorded
TP
DescriptionThe acceleration of Sparse-dense Matrix Multiplication (SpMM) using Tensor Cores (TCs) in GPUs has recently garnered significant attention. TCs are designed for block-wise matrix multiplication; however, block partitioning of general unstructured sparse matrices often results in low-level density, causing a substantial waste of computational resources. Sparse Tensor Cores (SpTCs) can mitigate this issue by skipping 50% of zero values; however, SpTCs are limited to strict 2:4 or 1:2 structured sparsity. To bridge this gap, we propose MP-SpMM, a novel matching and padding approach that transforms general sparse matrices into structured sparsity, drawing inspiration from the maximum matching problem in graph theory. Moreover, we introduce a novel storage format and a highly optimized GPU kernel that fully exploits the capabilities of SpTCs. Extensive experiments on modern GPUs demonstrate that MP-SpMM outperforms state-of-the-art SpMM libraries, DTC-SpMM and RoDe, with an average speedup of 2.42x (up to 7.65x) and 1.92x (up to 8.60x).
Birds of a Feather
Quantum & Other Post Moore Computing Technologies
Livestreamed
Recorded
TP
XO/EX
DescriptionQuantum-classical hybrid computing is moving from theory to reality, yet no clear roadmap exists for how best to integrate quantum processing units (QPUs) into established HPC environments. In this BoF, we hope to bring together a global community of HPC practitioners, system architects, quantum computing specialists, and workflow researchers, including participants in the Workflow Community Initiative, to assess the state of hybrid integration and identify practical steps toward scalable, impactful deployment.
This BoF will be highly interactive, drawing on the experience and expertise of all participants via a series of parallel breakout sessions focused on four anchor topics:
• Hybrid Applications, Workflows and Use-Cases
• Middleware for Dataflow and Workflow Orchestration
• Software Integration and Performance Engineering
• State of the Industry: Practice and People
Co-led by a diverse and global organizing team representing BCS, DOE, EPCC, Inria, ORNL, NVIDIA, Quantinuum, and RIKEN, our goals are to:
• Share early experiences from hybrid HPC+QC deployments
• Identify best practices across software, hardware, and workflow integration
• Accelerate knowledge-sharing to ensure the community progresses efficiently
• Encourage collaboration and convergence around architectural, workflow and programming models
• Provide a space for peer-to-peer discussion, discovery, and problem-solving across institutions and disciplines
Format:
1) 20-minute introduction to the BoF and the four anchor topics
2) 45-minute breakout group discussions
3) 25-minute regroup to share challenges, learnings and open questions to the full group with time for final thoughts and contributions and commitments for ongoing collaboration and further engagement at future industry gatherings
This BoF will be highly interactive, drawing on the experience and expertise of all participants via a series of parallel breakout sessions focused on four anchor topics:
• Hybrid Applications, Workflows and Use-Cases
• Middleware for Dataflow and Workflow Orchestration
• Software Integration and Performance Engineering
• State of the Industry: Practice and People
Co-led by a diverse and global organizing team representing BCS, DOE, EPCC, Inria, ORNL, NVIDIA, Quantinuum, and RIKEN, our goals are to:
• Share early experiences from hybrid HPC+QC deployments
• Identify best practices across software, hardware, and workflow integration
• Accelerate knowledge-sharing to ensure the community progresses efficiently
• Encourage collaboration and convergence around architectural, workflow and programming models
• Provide a space for peer-to-peer discussion, discovery, and problem-solving across institutions and disciplines
Format:
1) 20-minute introduction to the BoF and the four anchor topics
2) 45-minute breakout group discussions
3) 25-minute regroup to share challenges, learnings and open questions to the full group with time for final thoughts and contributions and commitments for ongoing collaboration and further engagement at future industry gatherings
Workshop
Livestreamed
Recorded
TP
W
DescriptionApplication energy optimization in HPC data centers face two critical gaps. Systematic methodologies that connect data center policies to application decisions and accessible monitoring tools that enable data-driven optimization. We address both gaps through two complementary pillars. First, we present a methodology based on extended weighted Energy Delay Product (EDP) to translate data center operational priorities and integrate energy considerations into the energy optimization workflow which starts from continuous monitoring through targeted optimization. Second, we present a user-space monitoring tool, Omnistat, that enables this methodology by providing developers with direct access to actionable energy telemetry. Through deployment on the Frontier supercomputer and case studies exploring performance-energy trade-offs, we show how these pillars help energy as an integral optimization target for developers as active participants in data center efficiency.
Research and ACM SRC Posters
Research & ACM SRC Posters
TP
DescriptionLarge language models (LLMs) have advanced code generation ability across many domains, but often struggle with quantum code due to limited domain-specific data and inherent domain complexity. To address this issue, we focus on the Qiskit framework and fine-tune pretrained LLMs using quantum code from GitHub and datasets including OASST1 and COMMITPACKFT. More importantly, we construct instruction-style prompt/completion pairs based on real-world Qiskit code to improve alignment during fine-tuning. Experiments show that our fine-tuned models significantly improve quantum code generation ability, validating the effectiveness of our approach.
Workshop
Livestreamed
Recorded
TP
W
DescriptionBring Your Own Digital Twin is a fast paced, lighting talk style part of the Digital Twins for HPC Workshop. Presenters will discuss their digital twins in a short span of time, covering their intended use cases, functional highlights and lessons learned from building, deploying and maintaining a real world Digital Twin.
Workshop
Livestreamed
Recorded
TP
W
DescriptionPart 2 of the "Bring Your Own Digital Twin" session - Bring Your Own Digital Twin is a fast paced, lighting talk style part of the Digital Twins for HPC Workshop. Presenters will discuss their digital twins in a short span of time, covering their intended use cases, functional highlights and lessons learned from building, deploying and maintaining a real world Digital Twin.
Paper
Algorithms
Applications
Data Analytics
Livestreamed
Recorded
TP
DescriptionEvolving graph processing has become a critical component in various applications and is gaining increasing attention. However, existing evolving graph systems suffer from cache contention and workload imbalance between threads, which leads to poor scalability and performance degradation on modern multi-core computers.
In this paper, we introduce Bubble, a high-performance evolving graph processing engine designed with high scalability. By employing a novel graph format based on mini-batch sorting, Bubble utilizes the private caches of modern processor cores, achieving near-linear scalability in graph ingestion, while maintaining high performance for graph analytics. Compared with state-of-the-art systems, including LSGraph, GraphOne, and XPGraph, Bubble achieves 2.46×-8.86× higher throughput in graph ingestion and 0.77×-3.29× speedups when running common graph algorithms.
In this paper, we introduce Bubble, a high-performance evolving graph processing engine designed with high scalability. By employing a novel graph format based on mini-batch sorting, Bubble utilizes the private caches of modern processor cores, achieving near-linear scalability in graph ingestion, while maintaining high performance for graph analytics. Compared with state-of-the-art systems, including LSGraph, GraphOne, and XPGraph, Bubble achieves 2.46×-8.86× higher throughput in graph ingestion and 0.77×-3.29× speedups when running common graph algorithms.
Workshop
Partially Livestreamed
Partially Recorded
TP
W
DescriptionHigh-performance computing (HPC) is critical to advancing STEM research, yet undergraduate curriculum exposure to HPC remains limited, often only offered in graduate-level courses, if offered at all. The Argonne Introduction to High Performance Computing Bootcamp addresses this need through a one-week, immersive program, introducing participants to core HPC concepts and computational tools. As part of the bootcamp organizing team, the Argonne professional career intern focused on creating a sustainable framework for the bootcamp’s annual delivery by documenting processes, developing reusable templates, and refining evaluation methods. Additional efforts included onboarding peer mentors, ensuring data quality in evaluations, and capturing both quantitative and qualitative measures of the program’s impact. Through these strategies, the bootcamp aims to expand accessibility, improve participant preparedness for HPC enabled research, and cultivate a more diverse and skilled HPC community.
Birds of a Feather
Clouds & Distributed Computing
Livestreamed
Recorded
TP
XO/EX
DescriptionThis BoF is a collaborative discussion on architecting and deploying AI data commons and AI data meshes to support scalable, responsible, and federated AI. Focusing on minimal, interoperable architectures, it aims to empower approaches building small to midscale AI models, highlight challenges and opportunities in federating public and private data commons, and accelerate community adoption of best practices. Key topics include core services, embedding architectures, secure federation, and agentic orchestration. The session seeks to foster a roadmap for the community, exchange best practices, and explore the potential for establishing a working group to advance AI data infrastructure standards.
Workshop
Education & Workforce Development
Livestreamed
Recorded
TP
W
DescriptionGiven the rapidly changing computing landscape propelled with innovations and convergence of new cutting-edge technologies such as HPC, AI, Cybersecurity, Quantum computing and more, the accelerated need for upskilling/reskilling the workforce to mitigate skills gaps is becoming increasingly important. Furthermore, a triumvirate of user expertise, connections, and communities is required to enable efficient integration of (HPC) and AI ecosystems. To address the challenges involved in leveraging AI, the National Artificial Intelligence Research Resource (NAIRR) Pilot was launched in 2024. As part of this effort, the NAIRR Pilot User Experience Working Group (UEWG) have conducted various engagement initiates, such as researcher showcases, pilot industry partner showcases, webinar series, regional and national workshops. This paper presents a reproducible instructional roadmap based on the observations and results of the above-mentioned training and education efforts that can be used to efficiently train the next generation workforce in AI and HPC at all levels.
Community Engagement and Support
Not Livestreamed
Not Recorded
TP
XO/EX
DescriptionThis presentation highlights Mozambique’s HPC ecosystem, managed by MoRENet, showing how high-performance computing supports climate modeling, computational physics, Big Data analysis, and capacity building, empowering scientific research across the country.
Workshop
Livestreamed
Recorded
TP
W
DescriptionFloating-point data is typically compressed at strict error bounds to reduce storage cost while facilitating scientific analyses. Unfortunately, this tends to yield large compressed files. In some cases, however, a user might not need the data at a high fidelity. Progressive compression addresses this issue by refactoring the data into a hierarchical series of increasing fidelity, allowing users to download the data at an initial fidelity and subsequently retrieve higher fidelities. This paper studies a resolution-based progressive compression approach that achieves competitive compression ratios against traditional compression methods. Furthermore, it studies how the progression of resolution affects the quality of the data.
Birds of a Feather
Community Meetings
Livestreamed
Recorded
TP
XO/EX
DescriptionThis BoF convenes global leaders of high performance computing (HPC) communities to identify common challenges and share strategies for building and sustaining regional and national HPC ecosystems. Co-hosted by the UK HPC-SIG and the U.S. CASC, the session builds on a successful ISC2025 BoF and CASC’s 2025 position paper on RCD regional collaboration. Participants will exchange funding and governance models, explore cross-border partnerships, and co-develop ideas for an international HPC Communities Network and shared resource hub. The session fosters peer-to-peer learning and lays the groundwork for a collaborative publication and ongoing global engagement.
Tutorial
Livestreamed
Recorded
TUT
DescriptionAgentic systems, in which autonomous agents collaborate to solve complex problems, are emerging as a transformative methodology in AI. However, adapting agentic architectures to scientific cyberinfrastructure—spanning HPC systems, experimental facilities, and federated data repositories—introduces new technical challenges. In this half-day tutorial, we introduce participants to the design, deployment, and management of scalable agentic systems for scientific discovery. We will present Academy, a Python-based middleware platform built to support agentic workflows across heterogeneous research environments. Participants will learn core agentic system concepts, including asynchronous execution models, stateful agent orchestration, and dynamic resource management. We will explore the design of real-world agentic applications and discuss common patterns for integrating with widely used scientific tools and infrastructure. A guided hands-on session will then help attendees build and launch their own agentic systems. This tutorial is designed for researchers, developers, and cyberinfrastructure professionals interested in advancing AI-driven science with next-generation autonomous systems.
Workshop
Education & Workforce Development
Livestreamed
Recorded
TP
W
DescriptionHigh-performance computing (HPC) is increasingly vital across diverse disciplines, including those historically underrepresented in computational research, such as sociology, psychology, and the arts. To lower barriers to entry, the University of California, Merced (UC Merced) created a 90-minute introductory HPC workshop requiring no prior technical background. The workshop combines a theoretical overview of campus clusters, Linux basics, and HPC concepts with a hands-on session where participants connect via SSH and browser-based tools, load software modules, and submit jobs using Slurm. Offered in a hybrid format with synchronous and asynchronous materials, the program has been delivered over 20 times since 2021, serving primarily students (75.7%), faculty (16.2%), and staff. Post-workshop surveys show 83% of attendees are more likely to use HPC, contributing to a doubling of active campus users. This scalable, inclusive model effectively broadens HPC adoption and fosters computational engagement across disciplines.
Birds of a Feather
Community Meetings
Livestreamed
Recorded
TP
XO/EX
DescriptionEffective outreach is key to growing and diversifying the HPC community by engaging everyone from students and researchers to policymakers and the public. Yet many organizations face challenges developing or sustaining their outreach efforts, especially without dedicated staff or resources. This session aims to ease already heavy workloads by showcasing practical ways to reinvent, reuse, and repurpose existing HPC outreach materials to meet diverse needs. Participants will leave with a ready-to-use strategy for adapting outreach, lowering the barrier to participation and contributing to a growing global knowledge base that supports more inclusive and impactful HPC outreach efforts into the future.
Research and ACM SRC Posters
Research & ACM SRC Posters
TP
DescriptionMars is a leading target for human exploration, yet its weather remains difficult to predict due to phenomena such as global dust storms. While Earth forecasting has advanced through machine learning (ML), Mars lacks comparable systems. This work investigates whether Microsoft’s Aurora, a state-of-the-art Earth climate foundation model, can be adapted for Martian data. Using the EMARS reanalysis, we regridded variables to Aurora’s expected layout and executed inference on the University of Michigan’s Great Lakes supercomputing cluster. We verified Aurora’s Earth pipeline by reproducing ERA5 benchmarks and established a functional pathway for applying Aurora to Mars; however, full predictive accuracy requires Mars-specific surface/static variables and fine-tuning. The poster will present the adaptation pipeline, validation on ERA5, preliminary EMARS runs, and a roadmap to reliable ML-based Mars weather forecasting, emphasizing the role of HPC in data preparation and model execution.
Paper
HPC for Machine Learning
Livestreamed
Recorded
TP
DescriptionExisting methods for training large language models (LLMs) on long-sequence data, such as tensor parallelism and context parallelism, exhibit low model FLOPs utilization (MFU) as sequence lengths and number of GPUs increase, especially when sequence lengths exceed 1M tokens. To address these challenges, we propose BurstEngine, an efficient framework designed to train LLMs on long-sequence data. BurstEngine introduces BurstAttention, an optimized distributed attention with lower communication cost than RingAttention. BurstAttention leverages topology-aware ring communication to fully utilize network bandwidth and incorporates fine-grained communication-computation overlap to minimize communication cost. Furthermore, BurstEngine introduces sequence-level selective checkpointing and fuses the language modeling head with the loss function to reduce memory cost. Additionally, BurstEngine introduces workload balance optimization for various types of attention masking. By integrating these optimizations, BurstEngine achieves a $1.2\times$ speedup with much lower memory overhead than the state-of-the-art baselines when training LLMs on extremely long sequences of over 1M tokens.
Research and ACM SRC Posters
Research & ACM SRC Posters
TP
DescriptionPerformance portability remains a major challenge in high performance computing as applications increasingly target diverse GPU architectures. The C++17 standard introduced stdpar, a high-level parallelism model to simplify parallel programming. NVIDIA extended this model for GPU execution within heterogeneous architectures, followed by an AMD implementation.
We evaluate stdpar for a classical Particle-In-Cell (PIC) method on recent NVIDIA and AMD GPUs, comparing it to Thrust, Kokkos, and SYCL in runtime performance and programming productivity. The PIC implementation is dominated by a projection operator that heavily uses atomic operations. Our analysis covers both overall loop performance and the projection kernel. On NVIDIA GPUs, stdpar processes 1.7× fewer particles than Kokkos and 1.1× fewer than Thrust under equivalent conditions, despite productivity benefits.
This work is ongoing, with further tuning planned. At the poster session, results will be presented with performance charts, kernel breakdowns, and code snippets to illustrate these trade-offs.
We evaluate stdpar for a classical Particle-In-Cell (PIC) method on recent NVIDIA and AMD GPUs, comparing it to Thrust, Kokkos, and SYCL in runtime performance and programming productivity. The PIC implementation is dominated by a projection operator that heavily uses atomic operations. Our analysis covers both overall loop performance and the projection kernel. On NVIDIA GPUs, stdpar processes 1.7× fewer particles than Kokkos and 1.1× fewer than Thrust under equivalent conditions, despite productivity benefits.
This work is ongoing, with further tuning planned. At the poster session, results will be presented with performance charts, kernel breakdowns, and code snippets to illustrate these trade-offs.
Paper
Performance Measurement, Modeling, & Tools
System Software and Cloud Computing
Livestreamed
Recorded
TP
DescriptionPerformance engineering often involves localized, bottleneck-based optimization, supported by a plethora of tools. When no apparent bottlenecks exist, engineers resort to coarser whole-program optimization, consisting of data layout, sparsity, allocation strategy, and algorithmic modifications, to name a few. In this work, we aim to codify whole-program optimization by providing three global views based on a single tracing format.
The format, called C.A.T.S., captures information necessary for static and runtime analysis of large applications. Instead of call stacks and function annotations, C.A.T.S. uses control flow stacks and memory events to identify common performance anti-patterns and potential optimizations.
We develop interactive timeline, dataflow, and access visualizations, and implement compiler analysis passes to extract C.A.T.S. traces statically and in seconds on consumer hardware. The visualizations and analyses are demonstrated on case studies including sparse computations, hydrodynamics, and climate modeling, yielding 3x memory footprint reduction, improvements in communication-computation overlap, code fusion, and data layouts.
The format, called C.A.T.S., captures information necessary for static and runtime analysis of large applications. Instead of call stacks and function annotations, C.A.T.S. uses control flow stacks and memory events to identify common performance anti-patterns and potential optimizations.
We develop interactive timeline, dataflow, and access visualizations, and implement compiler analysis passes to extract C.A.T.S. traces statically and in seconds on consumer hardware. The visualizations and analyses are demonstrated on case studies including sparse computations, hydrodynamics, and climate modeling, yielding 3x memory footprint reduction, improvements in communication-computation overlap, code fusion, and data layouts.
Workshop
Livestreamed
Recorded
TP
W
DescriptionComplex energy system co-design optimization requires sophisticated computational workflows that orchestrate multiple interdependent components and scale across high-performance computing environments. Traditional approaches rely on specialized, monolithic solutions that limit reusability and scalability when addressing heterogeneous co-design problems. We introduce CAMEO (Co-design Architecture for Multi-objective Energy System Optimization), a modular workflow management framework that abstracts co-design problems as directed acyclic graphs with standardized input-output interfaces. The framework employs JSON-based specifications enabling systematic decomposition of optimization problems into reusable components like data loaders, scenario generators, and optimization solvers. CAMEO's architecture supports multiple optimization paradigms through containerized execution and seamlessly integrates with high-performance computing via Nextflow orchestration. We demonstrate CAMEO's versatility through three use cases: power grid expansion for data center integration (27 problems), optimal battery design for variable generation (3,200 problems), and distribution network generation for Virginia counties (133 problems), showcasing scalable execution across diverse computing environments.
Research and ACM SRC Posters
Research & ACM SRC Posters
TP
DescriptionFederated learning (FL) has emerged as a promising paradigm for privacy-preserving distributed training. However, its performance is often hindered by communication bottlenecks, especially over long-distance networks. In this work, we investigate the effectiveness of long-haul remote direct memory access (RDMA) as a high-performance communication substrate for FL. We develop a simulation framework that incorporates rate-limiting techniques to emulate wide-area RDMA deployments, enabling accurate comparisons with traditional TCP/IP networks. Through evaluations we demonstrate that long-haul RDMA can reduce communication time by up to 90.79% under WAN-like conditions and decrease total runtime by as much as 85.83%. These results underscore RDMA's promise in accelerating FL across distributed geographic settings.
Research and ACM SRC Posters
Research & ACM SRC Posters
TP
DescriptionLossy compression is widely used to reduce storage costs and I/O demands, especially on SCSI-HDDs. However, its benefits diminish on NVMe-SSDs, where compression and decompression runtimes often exceed raw I/O speed. To address this, we conduct a detailed study of compression runtimes, control methods, and NVMe parameters. We find that existing error-bounded methods fail to optimize compression ratios, and serial pipelines remain inefficient. With the low-level NVMe driver SPDK, we take a systematic approach to evaluating the limitations within implementing lossy-compressed I/O for NVMe-SSDs and expose numerous observations towards motivating our design.
Panel
Architectures
HPC Education
HPC Software & Runtime Systems
Livestreamed
Recorded
TP
Canceled
DescriptionThe National Supercomputing Mission (NSM) is a visionary initiative of the Government of India, aimed at enhancing the country’s high performance computing (HPC) ecosystem. Led by MeitY and DST, and executed by C-DAC and IISc, NSM focuses on deploying indigenous supercomputers, developing HPC software solutions, and fostering skilled manpower. With the successful deployment of PARAM supercomputers across academic and research institutions, NSM is accelerating advancements in AI, scientific research, and industrial applications. This session will showcase NSM’s impact, future roadmap, and India’s strategic position in the global HPC landscape.
C-DAC has been at the forefront of India’s supercomputing revolution, driving innovations in HPC through the National Supercomputing Mission. From pioneering the PARAM series to developing indigenous HPC software, AI-HPC frameworks, and RISC-V-based computing, C-DAC has significantly contributed to India’s self-reliance in supercomputing. Its advanced HPC infrastructure powers scientific research, industry applications, and national development initiatives.
C-DAC has been at the forefront of India’s supercomputing revolution, driving innovations in HPC through the National Supercomputing Mission. From pioneering the PARAM series to developing indigenous HPC software, AI-HPC frameworks, and RISC-V-based computing, C-DAC has significantly contributed to India’s self-reliance in supercomputing. Its advanced HPC infrastructure powers scientific research, industry applications, and national development initiatives.
Panel
AI, Machine Learning, & Deep Learning
Cloud, Data Center, & Distributed Computing
Livestreamed
Recorded
TP
Canceled
DescriptionWhat if your data is at the extreme edge and has an HPC requirement—possibly on the space station, on a lunar colony, or a Mars mission? Panelists from NASA will discuss current and future solutions to this challenging problem, present visually captivating imagery, and provide insight developed over many years while including fresh approaches from new voices. Examples include predicting and navigating Mars atmospheric conditions, ensuring successful entry, descent, and landing, and dodging asteroids along the way. Hear from and engage with experts familiar with how recent and planned missions have expanded our concept of computing at the edge for both space-based and terrestrial challenges.
Paper
Algorithms
Livestreamed
Recorded
TP
DescriptionWe address inefficiencies in task scheduling, memory management, and scalability in GPU-resident sparse LU factorization with a two-level approach of sequentially scheduled coarse-grained blocks containing multiple fine-grained blocks managed with a lightweight static scheduler enabling multi-stream parallelism. Additionally, we design an intelligent memory caching mechanism for the fine-grained scheduler, which retains frequently accessed data in GPU memory. To further enhance scalability, we introduce a distributed memory design that partitions the input matrix using a 1D block-cyclic distribution and optimizes inter-GPU communication via NVLink. The multi-GPU design reaches a computational throughput of 6.46 TFLOP/s on four A100 GPUs, demonstrating promising scalability. This is up to 7x speedup over the latest \texttt{SuperLU\_DIST} with 3D communication, 94x speedup over \texttt{PanguLU}, 16x speedup over \texttt{PasTiX}, and 10x speedup over our own coarse-grained dynamic scheduling implementation while reaching up to 21\% of the A100’s theoretical peak performance.
Workshop
Livestreamed
Recorded
TP
W
DescriptionFar memory tiers improve memory utilization by enabling memory intensive applications to use idle memory from other machines over the network. Recently, compiler approaches to far memory have demonstrated how static analysis can be leveraged to automatically transform applications to make efficient use of remote memory tiers. However, policies in these compilers, e.g., the determination of whether objects should be remoted, prefetched, or evacuated are made conservatively at compile time or require profiling. While profiling can alleviate conservative policies, profile-guided systems can be expensive and may not work well for applications that have variation in their inputs. We propose CaRDS, system that combines both runtime and static analysis to determine far memory policies dynamically, at data structure granularity, and without profiling. CaRDS remoting policies can outperform prior automatic approaches by up to 2× and are within 25% of profile-guided systems when the local memory is highly constrained.
Workshop
Livestreamed
Recorded
TP
W
DescriptionData management libraries (DMLs) such as HDF5, Zarr, and NetCDF are used heavily across domains. Despite their heavy usage, security of DMLs has been sparsely explored. Threat modeling is a method for analyzing the security of complex software systems; STRIDE is the most popular model used for evaluating security of software. In this study, we evaluate the application and effectiveness of STRIDE for DMLs. We identified three key shortcomings of STRIDE when applied to DMLs: the attack categorizations are often inapplicable, the attack categories provide little context, and current approaches do not analyze file structures used by DMLs. We propose CASSE as a novel threat modeling approach targeting DMLs to focus on these problems with a new attack taxonomy and including file structure diagrams. We evaluated CASSE by using it to model threats on three popular DMLs, HDF5, NetCDF, and Zarr. The application of CASSE to other DMLs is similar.
Research and ACM SRC Posters
Research & ACM SRC Posters
TP
DescriptionHPC workloads are increasingly data-intensive, with contention on shared storage emerging as a primary bottleneck. Existing I/O-aware job schedulers rely on static bandwidth assumptions that overlook time-varying I/O behavior, leading to inefficient utilization and unpredictability.
This work introduces the Contention-Avoiding Temporal I/O-aware job Scheduling (CATIOS) framework, which considers temporal I/O behavior in scheduling decisions to address these issues. CATIOS first matches incoming jobs with similar historical profiles semantically and resource-wise, who serve as proxies and are evaluated sequentially in time-resolved contention-aware simulation after being reprioritized by configurable scheduling policies. These enable CATIOS to avoid overlapping bursts according to objectives.
Evaluations with Blue Waters workloads on a SimGrid-based platform show that CATIOS reduces makespan while maintaining controlled average wait times, achieving a balanced trade-off of approximately 1:1.3 between makespan reduction and average wait time increase. These results demonstrate CATIOS’s capability to improve data-intensive HPC systems with mixed workloads.
This work introduces the Contention-Avoiding Temporal I/O-aware job Scheduling (CATIOS) framework, which considers temporal I/O behavior in scheduling decisions to address these issues. CATIOS first matches incoming jobs with similar historical profiles semantically and resource-wise, who serve as proxies and are evaluated sequentially in time-resolved contention-aware simulation after being reprioritized by configurable scheduling policies. These enable CATIOS to avoid overlapping bursts according to objectives.
Evaluations with Blue Waters workloads on a SimGrid-based platform show that CATIOS reduces makespan while maintaining controlled average wait times, achieving a balanced trade-off of approximately 1:1.3 between makespan reduction and average wait time increase. These results demonstrate CATIOS’s capability to improve data-intensive HPC systems with mixed workloads.
Art of HPC
Art of HPC
Not Livestreamed
Not Recorded
TP
W
TUT
XO/EX
DescriptionThis animation shows a supercomputer simulation of the large coronal mass ejection (CME) that occurred on July 14, 2000 (nicknamed the "Bastille Day Event"). The CME ejected over a billion tons of million-degree plasma at over 1,600 km/s from the solar corona, which is the magnetized outer atmosphere of the Sun. Extreme CMEs like this well-studied event can lead to costly impacts on our technological infrastructure, including power grids, satellites, and GPS communication. Magnetic field lines of the erupting flux system and a volumetric rendering of the density of the entrained plasma are shown erupting from the source active region on the Sun.
The HPC simulation was developed by Predictive Science Inc. using the Magnetohydrodynamic Algorithm outside a Sphere (MAS) Modern Fortran code (github.com/predsci/mas) at a resolution of over 60 million cells, running on several thousand processors.
The HPC simulation was developed by Predictive Science Inc. using the Magnetohydrodynamic Algorithm outside a Sphere (MAS) Modern Fortran code (github.com/predsci/mas) at a resolution of over 60 million cells, running on several thousand processors.
Workshop
Livestreamed
Recorded
TP
W
DescriptionLarge-scale distributed computing infrastructures like the Worldwide LHC Computing Grid (WLCG) require comprehensive simulation tools for performance evaluation and resource optimization. Existing simulators suffer from limited scalability, hardwired algorithms, lack of real-time monitoring, and inability to generate machine learning-suitable datasets.We present CGSim, a simulation framework addressing these limitations. Built on the validated SimGrid framework, CGSim provides high-level abstractions for modeling heterogeneous grid environments while maintaining accuracy and scalability. Key features include a modular plugin mechanism for testing custom workflow policies, interactive real-time visualization dashboards, and automatic generation of event-level datasets for AI-assisted performance modeling. Comprehensive evaluation using production ATLAS PanDA workloads demonstrates significant calibration accuracy improvements across WLCG sites. Scalability experiments show near-linear scaling for multi-site simulations, with distributed workloads achieving 6× better performance than single-site execution. CGSim enables researchers to simulate WLCG-scale infrastructures with hundreds of sites and thousands of concurrent jobs on commodity hardware within practical time budgets.
Research and ACM SRC Posters
Research & ACM SRC Posters
TP
DescriptionResearchers in high performance computing (HPC) and cloud environments encounter disparate sources of documentation and difficulties finding accurate information. This can cause inefficiency, increase the reliance on support teams, and change the focus of the researcher from the main experiment. To address these challenges, we developed an AI-powered search system leveraging large language models (LLMs) with retrieval-augmented generation (RAG) to unify various documentation sources and provide accurate, context-aware answers with cited references to relevant sources. We evaluated our RAG system with Chameleon Cloud testbed documentation as a case study, finding that our RAG system outperforms other generic LLMs in answering a variety of user questions and performs comparable to proprietary LLMs when properly tuned and optimized.
Research and ACM SRC Posters
Research & ACM SRC Posters
TP
DescriptionIn this work, we evaluate the range of performance-energy tradeoffs achievable by the independent application of node-level power capping and DVFS controls on the Aurora supercomputer. We then analyze the default uncore frequency behavior under constrained node power, revealing inefficiencies—particularly in memory-bound workloads—where the uncore is under-provisioned despite high demand. To address this, we propose U-PullUp, a lightweight uncore DVFS control strategy designed to operate alongside a node-level power cap. U-PullUp complements the default policy by increasing the frequency of the uncore when its utilization is high. Our approach yields up to 34.2% energy savings under a power limit.
Paper
Energy Efficiency
Performance Measurement, Modeling, & Tools
Power Use Monitoring & Optimization
State of the Practice
Livestreamed
Recorded
TP
DescriptionWhile historically used for graphics applications, graphics processing units (GPUs) have become the most prominent hardware for accelerating parallel workloads, including HPC and AI/ML. As the demand for GPUs skyrockets, AMD released the CDNA3 architecture to accelerate HPC and generative AI. This paper serves as a comprehensive third-party evaluation of the AMD CDNA3 GPU, specifically the MI300X and MI325X, by characterizing their performance, power, and energy efficiency using microbenchmark and real-world applications.
First, we develop a microbenchmark to investigate the computing capability of the compute unit and measure device-wide scaling. Secondly, we measure both on-chip and off-chip memory access latency and bandwidth, and communication link bandwidth between devices. Thirdly, we subject both GPUs to real-world applications. Although MI325X gives the highest performance, the best energy efficiency is often obtained by capping the MI325X at the same power level as MI300X, with the higher HBM3E bandwidth solely contributing to performance improvements.
First, we develop a microbenchmark to investigate the computing capability of the compute unit and measure device-wide scaling. Secondly, we measure both on-chip and off-chip memory access latency and bandwidth, and communication link bandwidth between devices. Thirdly, we subject both GPUs to real-world applications. Although MI325X gives the highest performance, the best energy efficiency is often obtained by capping the MI325X at the same power level as MI300X, with the higher HBM3E bandwidth solely contributing to performance improvements.
Workshop
Livestreamed
Recorded
TP
W
DescriptionAs GPU-accelerated high-performance computing (HPC) systems approach exascale performance, controlling energy consumption without compromising throughput is essential. Architectures such as the AMD MI250X-based Frontier supercomputer provide runtime mechanisms like frequency and power capping, enabling energy tuning without modifying application code. Although both target energy reduction, they operate via distinct hardware control paths and influence workloads differently. We present a comprehensive evaluation of these strategies on a leadership-class system using diverse HPC proxy applications representative of production workloads. Our study analyzes performance–energy trade-offs across multiple capping levels, node counts (1 and 32), and application profiles. Results show that frequency capping generally achieves higher energy efficiency and scalability, with gains of up to 13.2% without performance loss, while power capping is more effective for single-node runs or bursty GPU utilization. We also provide practical guidelines to help system administrators and users balance energy efficiency and performance in large-scale scientific workloads.
Workshop
Livestreamed
Recorded
TP
W
DescriptionDifferent compilers can generate code with notably different performance characteristics—even on the same system. Today, GPU developers have three popular options for compiling CUDA or HIP code for GPUs. First, CUDA code can be compiled by either NVCC or Clang for NVIDIA GPUs. Alternatively, AMD’s recently introduced HIP platform makes porting from CUDA to HIP relatively simple, enabling compilation for AMD and NVIDIA GPUs. This study compares the performance of 107,632 data-compression algorithms when compiling them with different compilers and running them on different GPUs from NVIDIA and AMD. We find that the relative performance of some of these codes changes significantly depending on the compiler and hardware used. For example, Clang tends to produce relatively slow compressors but relatively fast decompressors compared to NVCC and HIPCC.
Workshop
Livestreamed
Recorded
TP
W
DescriptionKubernetes scales and automates container orchestration, deployment, and management across environments, supporting the increasing demand for novel HPC workflows, especially in AI. Kubernetes' declarative approach allows users to schedule, scale, and monitor containers while supporting multiple container runtimes. Charliecloud is a container runtime that enhances HPC workloads by allowing fully unprivileged, lightweight container management. However, Kubernetes is only compatible with container runtimes that implement the Container Runtime Interface (CRI). To address this, we developed a prototype CRI-compatible server for Charliecloud, allowing Kubernetes to manage pods and create, start, and track Charliecloud containers. Despite Kubernetes expecting certain features that Charliecloud does not use, such as network namespaces, we show that the two systems can still communicate effectively. Our implementation requires minimal modifications to Charliecloud without changes to Kubernetes. This demonstrates that Kubernetes and Charliecloud are compatible tools, advancing workflows that require large compute power. LA-UR-24-28252
Workshop
Livestreamed
Recorded
TP
W
DescriptionWith the increase of GPU workflows, GPU containers are needed for these complex workflows. GPU containers get the flexibility of containers and the performance benefits of GPUs. For popular nVidia containers, there are extra requirements such as access to GPU drivers, and nvidia-container-toolkit installed.
There are three new features that Charliecloud is working on specifically to help GPU workflows. They are: nVidia CDI, ENTRYPOINT and CMD Dockerfile instructions, and a tool that would work similarly to docker-compose. While CDI was implemented to bring Charliecloud up to standard, ENTRYPOINT and docker-compose are features that users have been explicitly asking for.
This lightning talk will discuss how Charliecloud is currently working adding new tools and features to better support our GPU users and a status update of nVidia CDI, ENTRYPOINT, and docker-compose. LA-UR-25-28140
There are three new features that Charliecloud is working on specifically to help GPU workflows. They are: nVidia CDI, ENTRYPOINT and CMD Dockerfile instructions, and a tool that would work similarly to docker-compose. While CDI was implemented to bring Charliecloud up to standard, ENTRYPOINT and docker-compose are features that users have been explicitly asking for.
This lightning talk will discuss how Charliecloud is currently working adding new tools and features to better support our GPU users and a status update of nVidia CDI, ENTRYPOINT, and docker-compose. LA-UR-25-28140
Birds of a Feather
System Software
Livestreamed
Recorded
TP
XO/EX
DescriptionThe PMIx infrastructure project supports the integration of applications, tools, middleware, and system runtimes. Since its introduction in 2014, adoption has spread across the HPC community from clusters to cloud environments, embracing uses spanning application launch to dynamic application management, fault tolerance, and cross-library coordination.
We invite SC attendees to an overview of the latest activities surrounding PMIx. We will cover new, highly challenging use-cases of particular interest to dynamic workflow management and malleable applications where PMIx may play an important role in the solution, and solicit input on prioritizing the roadmap for the upcoming year.
We invite SC attendees to an overview of the latest activities surrounding PMIx. We will cover new, highly challenging use-cases of particular interest to dynamic workflow management and malleable applications where PMIx may play an important role in the solution, and solicit input on prioritizing the roadmap for the upcoming year.
Workshop
Livestreamed
Recorded
TP
W
DescriptionScientists and operators at SLAC National Accelerator Laboratory rely on electronic logbooks (ELOGs) to record and share information surrounding accelerator operations. However, since creating log entries is time-consuming and complex, they are often brief, incomplete, jargonized, and inconsistent. With thousands of records spanning decades, this makes it difficult for operators to search for and interpret information. Through interviews with operators, we identified two critical gaps: the lack of automated shift summarization and the difficulty of real-time ELOG information retrieval. We introduce ChatEED, a novel agentic retrieval-augmented generation (RAG) system that addresses these two needs while also prioritizing security, modularity, efficiency, and transparency. In this paper, we analyze the operator needs and workflow that guide the system design, detail the system architecture and deployment, and outline future directions for expansion and evaluation. This ongoing work demonstrates the potential for AI systems to improve continuity, communication, and efficiency in high-performance science facilities.
Research and ACM SRC Posters
Research & ACM SRC Posters
TP
DescriptionChatHPC democratizes large language models for the high performance computing (HPC) community by providing the infrastructure, ecosystem, and knowledge needed to apply modern generative AI technologies to rapidly create specific capabilities for critical HPC components while using relatively modest computational resources. We target major components of the HPC software stack, including programming models, runtimes, I/O, tooling, and math libraries. Thanks to AI, ChatHPC provides a more productive HPC ecosystem by boosting important tasks related to portability, parallelization, optimization, scalability, and instrumentation, among others. With relatively small datasets (on the order of KB), the AI assistants, which are created in a few minutes by using one node with two NVIDIA H100 GPUs and the ChatHPC library, can create new capabilities with Meta Code Llama base model to produce high-quality software with a level of trustworthiness up to 90% higher than the OpenAI ChatGPT-4o model for critical programming tasks in HPC.
Paper
State of the Practice
Livestreamed
Recorded
TP
DescriptionChatHPC democratizes LLMs for HPC by providing the ecosystem and state-of-the-practice for the HPC community to create specific capabilities using AI for critical HPC components rapidly on reasonable computational resources. Our divide-and-conquer approach creates a collection of reliable and optimized AI assistants, which can be merged together, and is based on the cost-effective and fast Code Llama fine-tuning supervised by experts. We target major components of the HPC software stack, including programming models, runtime, I/O, tooling, and math libraries. ChatHPC provides a productive HPC ecosystem by boosting tasks related to portability, parallelization, optimization, scalability, and instrumentation, through the assistance of AI. With small data sets, ChatX assistants are capable of creating non-existence capabilities in the 7B parameter CodeLlama-Base model and producing high-quality software with a level of trustworthiness of up to 90% higher than the 1.8T parameter ChatGPT-4o model for critical programming tasks in the HPC software stack.
Workshop
Livestreamed
Recorded
TP
W
DescriptionWe present CiRE, a tool that computes floating-point rounding error for basic blocks of LLVM codes via static analysis. Using CiRE, programmers can explore different mixed-precision settings and compiler optimizations while improving performance and guarding against excessive error. Our studies using CiRE have yielded the following insights: (1) often, performance as well as accuracy can be improved; (2) compilers for different languages produce code with widely varying error, even for the same expression; (3) the choice of subexpressions to target for low precision allocation has a huge impact on error and performance. CiRE can analyze expressions with 10^5 or more operators, thus making it capable of analyzing basic blocks generated by unrolling loops.
Research and ACM SRC Posters
Research & ACM SRC Posters
TP
DescriptionNumerical programmers often adjust precision settings and compiler optimizations to maximize performance, but these changes can unpredictably affect floating-point rounding errors. We present CIRE, a tool that statically estimates tight bounds on floating-point rounding error by analyzing LLVM code which reflects precision and optimization choices. This enables simultaneous optimization of both error and performance. CIRE uses symbolic automatic differentiation with interval-based optimization to calculate maximum error across input intervals of interest. Our findings reveal that (1) optimizations sometimes improve both performance and accuracy by reducing the number of operations; (2) results vary significantly between different source languages, such as C++ and Rust; and (3) error and performance depend heavily on which subexpressions are subject to lower-precision allocation. We evaluate various combinations of optimizations and precision across popular benchmarks, providing insights into factors affecting numerical error and computational performance.
Workshop
Livestreamed
Recorded
TP
W
DescriptionThree-dimensional electron diffraction (3D ED) has become an essential technique for determining high-resolution molecular structures. In typical 3D ED experiments, hundreds to thousands of molecular structures are reconstructed from a single sample. However, many of these structures are inaccurate. Traditionally, researchers have had to manually inspect each structure to identify valid ones, which is a time-consuming and labor-intensive process.
We propose an LLM-centric automatic screening method to efficiently identify correct molecular structures from 3D ED outputs. The method proceeds through three stages: (1) rule-based filtering to eliminate clearly impossible candidates, (2) classification by a fine-tuned LLM trained on both correct and artificially-generated corrupted molecules, and (3) grouping to merge identical topologies. This combination allows diverse 3D ED datasets to be classified quickly and accurately.
This method substantially reduces the manual burden and enables efficient large-scale classification of 3D ED data.
We propose an LLM-centric automatic screening method to efficiently identify correct molecular structures from 3D ED outputs. The method proceeds through three stages: (1) rule-based filtering to eliminate clearly impossible candidates, (2) classification by a fine-tuned LLM trained on both correct and artificially-generated corrupted molecules, and (3) grouping to merge identical topologies. This combination allows diverse 3D ED datasets to be classified quickly and accurately.
This method substantially reduces the manual burden and enables efficient large-scale classification of 3D ED data.
Research and ACM SRC Posters
Research & ACM SRC Posters
TP
DescriptionTraditional performance analysis tools, such as the roofline model, require visual interpretation to determine performance bounds. For CPUs which have complex cache hierarchies and front-end out-of-order capabilities—that is, the CPUs we use for high performance computing—accurately identifying the true performance bound is challenging. This work is the first step towards a data-driven approach to performance modeling, leveraging machine learning techniques. We build and evaluate a number of supervised and unsupervised models using a new curated data set of performance counters collected from well-understood (i.e., easily labeled) benchmark applications. We further analyze the data set and highlight potential "performance fingerprints" obtainable using this methodology.
Workshop
Livestreamed
Recorded
TP
W
DescriptionAs large language models move into production at unprecedented scale, the requirements for efficient, reliable, and cost-effective inference have diverged from those of training. Modern deployments must meet diverse SLAs, support rapidly growing GPU fleets, and include workloads with different performance characteristics. NVIDIA Dynamo is a production-grade framework for distributed inference at scale that addresses these challenges through modular disaggregation, topology-aware scheduling, and intelligent memory and KV-cache management. This presentation covers Dynamo’s design for high-performance inference at scale, detailing how disaggregating inference across prefill and decode phases increases utilization. We highlight advancements such as KV-cache-aware routing and offloading strategies that leverage the full memory hierarchy, from HBM to networked storage. Together, these strategies enable a cohesive platform that enables efficient and scalable LLM inference in real-world production environments.
Workshop
Livestreamed
Recorded
TP
W
DescriptionThe rapid advancement of new HPC technologies has facilitated the convergence of artificial intelligence (AI), big data analytics, and HPC platforms to solve complex, large-scale, real-time analytics and applications for scientific and non-scientific fields. Given the dynamism of today’s computational environments, the traditional classroom approach for HPC pedagogy does not fit all needs required at various levels of education. The traditional computer science education, which typically mentions briefly the concept of threading, is no longer apt for preparing the future HPC workforce. Additionally, many K-12 and post-college personnel are encountering problems or are involved in projects where high performance computing can make a useful contribution. In recent years, there have been several pedagogical and andragogical approaches initiated by the HPC community to increase the instructional effectiveness for bridging the gaps in HPC knowledge and skills. This discussion aims to share experiences with educational challenges and opportunities that stimulate the acquisition of high performance computing skills.
Paper
System Software and Cloud Computing
Livestreamed
Recorded
TP
DescriptionMessage Passing Interface (MPI) is a foundational programming model for high-performance computing. MPI libraries traditionally employ network interconnects (e.g., Ethernet and InfiniBand) and network protocols (e.g., TCP and RoCE) with complex software stacks for cross-node communication. We present cMPI, the first work to optimize MPI point-to-point communication (both one-sided and two-sided) using CXL memory sharing on a real CXL platform, transforming cross-node communication into memory transactions and data copies within CXL memory, bypassing traditional network protocols. We analyze performance across various interconnects and find that CXL memory sharing achieves 7.2×-8.1× lower latency than TCP-based interconnects deployed in small- and medium-scale clusters. We address challenges of CXL memory sharing for MPI communication, including data object management over the dax representation [50], cache coherence, and atomic operations. Overall, cMPI outperforms TCP over standard Ethernet NIC and high-end SmartNIC by up to 49× and 72× in latency and bandwidth, respectively, for small messages.
Workshop
Livestreamed
Recorded
TP
W
DescriptionFuture supercomputers must be designed not only for raw speed but also for efficiency, scalability, and fitness to scientific and AI-driven workloads. Quantitative codesign provides a systematic way to achieve this by linking workload characterization, performance modeling, and hardware–software trade-offs.
This talk will outline the principles of quantitative codesign and show how metrics such as time-to-solution, energy, and numerical accuracy guide choices in processors, memory, interconnects, and software. Unfortunately, we have not had a real co-design baked into our supercomputers.
This talk will outline the principles of quantitative codesign and show how metrics such as time-to-solution, energy, and numerical accuracy guide choices in processors, memory, interconnects, and software. Unfortunately, we have not had a real co-design baked into our supercomputers.
Birds of a Feather
Architectures & Networks
Livestreamed
Recorded
TP
XO/EX
DescriptionThe world's largest supercomputers for scientific discovery are also premier systems for artificial intelligence model training and inference. While traditional HPC compute has predominantly leveraged the MPI standard, AI workloads have increasingly focused on collective communication libraries, such as NVIDIA's NCCL and AMD's RCCL, which are optimized for high-bandwidth throughput. This BoF session at SC25 aims to delve into the intricacies of collective communication libraries, focusing on the comparison between the widely adopted Message Passing Interface (MPI) and NCCL/RCCL, as well as other key messaging libraries such as SHMEM.
Early Career
Not Livestreamed
Not Recorded
TP
W
TUT
XO/EX
DescriptionThis session will cover the art of story telling with topics like theme, structure, purpose and resolution, including what makes some stories good while others leave a bad after-taste.
Flash Session
Not Livestreamed
Not Recorded
TP
XO/EX
DescriptionJoin Elvis Leka as he addresses the escalating thermal management challenges in high-density AI data centers. Parker Hannifin leverages decades of expertise in fluid transfer and thermal management to introduce innovative liquid cooling solutions designed specifically for space-constrained, high-performance environments.
This session covers the latest advancements in single-phase and two-phase liquid cooling technologies, focusing on fluid transfer products that achieve higher flow rates while minimizing pressure drops, crucial for maintaining efficiency and reliability in modern data centers. Attendees will gain practical insights into integrating these solutions within existing infrastructures, optimizing cooling performance without sacrificing valuable rack space.
With industry leaders already deploying thousands of liquid-cooled AI racks each month, this presentation delivers timely, actionable insights for engineers, IT administrators, and decision-makers aiming to future-proof their data centers, equipping them to confidently address both today’s and tomorrow’s most demanding AI cooling challenges.
This session covers the latest advancements in single-phase and two-phase liquid cooling technologies, focusing on fluid transfer products that achieve higher flow rates while minimizing pressure drops, crucial for maintaining efficiency and reliability in modern data centers. Attendees will gain practical insights into integrating these solutions within existing infrastructures, optimizing cooling performance without sacrificing valuable rack space.
With industry leaders already deploying thousands of liquid-cooled AI racks each month, this presentation delivers timely, actionable insights for engineers, IT administrators, and decision-makers aiming to future-proof their data centers, equipping them to confidently address both today’s and tomorrow’s most demanding AI cooling challenges.
Workshop
Performance Evaluation, Scalability, & Portability
Livestreamed
Recorded
TP
W
DescriptionDue to the increasing diversity of high-performance computing architectures, researchers and practitioners are increasingly interested in comparing a code’s performance and scalability across different platforms. However, there is a lack of available guidance on how to actually set up and analyze such cross-platform studies. In this talk, we contend that the natural base unit of computing for such studies is a single compute node on each platform and offer guidance in setting up, running, and analyzing node-to-node scaling studies. We propose templates for presenting scaling results of these studies and provide several case studies highlighting the benefits of this approach.
Workshop
Livestreamed
Recorded
TP
W
DescriptionDistributed-memory parallel processing addresses computational
problems requiring significantly more memory or
computational resources than can be
found on one node. Software written for distributed-memory
parallel processing typically uses a distributed-memory parallel
programming framework to enhance productivity, scalability, and
portability across supercomputers and cluster systems.
These frameworks vary in their capabilities and support for managing
communication and synchronization overhead to achieve scalability.
This paper employs a communication-intensive distributed radix
sort algorithm to examine and compare the performance, scalability,
usability, and productivity differences between five
distributed-memory parallel programming frameworks: Chapel, MPI,
OpenSHMEM, Conveyors, and Lamellar.
The Chapel implementation has the fewest source lines of code (113) and is the
most performant on 128 nodes of an HPE Cray Supercomputing EX (achieving about
17 billion elements sorted per second). The source code is available at
https://github.com/mppf/distributed-lsb, and we welcome contributions,
including optimizations to the implementations and results from runs on
different systems.
problems requiring significantly more memory or
computational resources than can be
found on one node. Software written for distributed-memory
parallel processing typically uses a distributed-memory parallel
programming framework to enhance productivity, scalability, and
portability across supercomputers and cluster systems.
These frameworks vary in their capabilities and support for managing
communication and synchronization overhead to achieve scalability.
This paper employs a communication-intensive distributed radix
sort algorithm to examine and compare the performance, scalability,
usability, and productivity differences between five
distributed-memory parallel programming frameworks: Chapel, MPI,
OpenSHMEM, Conveyors, and Lamellar.
The Chapel implementation has the fewest source lines of code (113) and is the
most performant on 128 nodes of an HPE Cray Supercomputing EX (achieving about
17 billion elements sorted per second). The source code is available at
https://github.com/mppf/distributed-lsb, and we welcome contributions,
including optimizations to the implementations and results from runs on
different systems.
Workshop
Livestreamed
Recorded
TP
W
DescriptionGraph algorithms are important for many domains, and GPUs can be used to accelerate them. Unfortunately, CUDA code can only be run on NVIDIA GPUs. In this study, we port the CUDA graph codes from the Indigo3 benchmark suite to HIP. This enables them to run on both AMD and NVIDIA GPUs. In addition, it allows the study of performance differences between compiled CUDA and HIP codes that are otherwise identical. Since the Indigo3 codes are written in a variety of different implementation and parallelization styles, this also allows us to study the performance of AMD GPUs on these styles and compare the results with the NVIDIA-based style trends.
Paper
HPC for Machine Learning
System Software and Cloud Computing
Livestreamed
Recorded
TP
DescriptionWith the proliferation of deep learning technologies across various service domains, the sharing of accelerators such as GPUs, TPUs, and NPUs for inference processing has become increasingly common. These accelerators must efficiently handle multiple deep learning services operating concurrently. However, inference requests, characterized by sequences of short-duration kernels, create significant challenges for online schedulers attempting to maintain quality of service (QoS) guarantees.
This paper presents QoSlicer, a novel compile-time QoS management framework that employs kernel slicing to relieve the burden on schedulers. By generating multiple pre-determined slicing plans, QoSlicer enables more efficient, lightweight QoS scheduling while ensuring target latency requirements are met. Our approach incorporates a heuristic search algorithm to identify optimal slicing plans and implements robust performance estimation models to validate these plans. Our experimental evaluation across 75 diverse workload combinations demonstrates that QoSlicer improves throughput by an average of 20.2% compared to state-of-the-art scheduling techniques.
This paper presents QoSlicer, a novel compile-time QoS management framework that employs kernel slicing to relieve the burden on schedulers. By generating multiple pre-determined slicing plans, QoSlicer enables more efficient, lightweight QoS scheduling while ensuring target latency requirements are met. Our approach incorporates a heuristic search algorithm to identify optimal slicing plans and implements robust performance estimation models to validate these plans. Our experimental evaluation across 75 diverse workload combinations demonstrates that QoSlicer improves throughput by an average of 20.2% compared to state-of-the-art scheduling techniques.
Workshop
Livestreamed
Recorded
TP
W
DescriptionThis paper describes a novel application and evaluation of programmable networking in High-Energy Physics (HEP): a complete parser for the custom packet format used by Fermilab’s DUNE experiment. Notably, this parser is implemented on a Tofino programmable network switch and evaluated on the FABRIC testbed by using network traffic generated by the ICEBERG DUNE prototype. The parsed network traffic consists of Jumbo Ethernet frames that contain digitizations of sensor readings from ICEBERG’s detector.
This work is an early investigation into providing in-network processing support for HEP experiments. The paper describes DUNE’s custom packet format, the challenges encountered when implementing a parser for that format, and an exploration of the techniques that are needed to overcome those challenges. We identify performance bottlenecks and discuss directions for future research.
This work is an early investigation into providing in-network processing support for HEP experiments. The paper describes DUNE’s custom packet format, the challenges encountered when implementing a parser for that format, and an exploration of the techniques that are needed to overcome those challenges. We identify performance bottlenecks and discuss directions for future research.
Workshop
Livestreamed
Recorded
TP
W
DescriptionWith the widespread application of Mixture of Experts (MoE) reasoning models in the field of LLM learning, efficiently serving MoE models under limited GPU memory constraints has emerged as a significant challenge. Offloading the non-activated experts to main memory has been identified as an efficient approach to address such a problem, while it brings the challenges of transferring the expert between the GPU memory and main memory. We need to explore an efficient approach to compress the expert and analyze how the compression error affects the inference performance.
To bridge this gap, we propose employing error-bounded lossy compression algorithms (such as SZ3 and CuSZp) to compress non-activated experts, thereby reducing data transfer overhead during MoE inference. We conduct extensive experiments across various benchmarks and present a comprehensive analysis of how compression-induced errors in different experts affect overall inference accuracy.
To bridge this gap, we propose employing error-bounded lossy compression algorithms (such as SZ3 and CuSZp) to compress non-activated experts, thereby reducing data transfer overhead during MoE inference. We conduct extensive experiments across various benchmarks and present a comprehensive analysis of how compression-induced errors in different experts affect overall inference accuracy.
Tutorial
Livestreamed
Recorded
TUT
DescriptionLarge-scale numerical simulations, observations, experiments, and AI computations generate or consume very large datasets. Data compression is an efficient technique to reduce scientific datasets and make them easier to analyze, store, and transfer. The first part of this half-day tutorial reviews the motivations, principles, techniques, and error analysis methods for lossy compression of scientific datasets. It details the main compression stages (decorrelation, approximation, coding) and their variations in state-of-the-art generic lossy compressors: SZ, ZFP, MGARD, and SPERR. The second part of the tutorial focuses on lossy compression trustworthiness, hands-on sessions, and customization of lossy compression to respond to user-specific lossy compression constraints. In the third part of the tutorial, we discuss different ways of composing and testing specialized lossy compressors. The tutorial uses examples of real-world scientific datasets to illustrate the different compression techniques and their performance. The tutorial features one hour of hands-on sessions on generic compressors and how to compose specialized compressors. Participants are encouraged to bring their data to make the tutorial productive. The tutorial, given by the leading teams in this domain and primarily targeting beginners interested in learning about lossy compression for scientific data, is improved from the highly rated tutorials given at SC17-SC24.
Research and ACM SRC Posters
Research & ACM SRC Posters
TP
DescriptionWe have developed a comprehensive simulation tool to model the launching, progression, and completion of virtual machines and corresponding workloads within a cloud cluster of arbitrary size. The simulator employs various policies to allocate computational resources for these virtual machines, simulates hardware failures and workload interruptions, and reallocates new resources as needed. The primary goal of this work is to test the interaction of allocation policy design with various types of hardware failures, analyzing the expected resource utilization and workload delay in these scenarios. The modular design of the simulator provides the framework for implementing and analyzing cutting-edge allocation policies as they emerge. Through a series of experiments, the simulator demonstrates the effectiveness of different policies in managing resource allocation amidst failing hardware, providing valuable insights into the optimization of cloud infrastructure and the development of resilient resource management strategies.
Workshop
Livestreamed
Recorded
TP
W
DescriptionFoundation models are driving a paradigm shift across the life sciences, yet their transformative potential is fundamentally coupled to high-performance computing (HPC). The computational workloads from genomics, transcriptomics, proteomics, chemistry, and biomedical literature are remarkably diverse, creating distinct challenges for HPC infrastructure. This paper presents the first systematic, cross-domain analysis of these HPC needs. We characterize and compare the specific bottlenecks inherent to each domain—from the massive I/O of genomics to the intense memory pressure of proteomics and the unique compute kernels of molecular modeling. Analyzing these diverse workloads allows us to identify key trade-offs in hardware utilization and software design. We conclude by outlining a unified set of best practices and co-design principles for building next-generation HPC systems capable of accelerating discovery across the full spectrum of AI-driven science.
ACM Gordon Bell Climate Modeling Finalist
Awards and Award Talks
Applications
GBC
Livestreamed
Recorded
TP
Community Engagement and Support
Not Livestreamed
Not Recorded
TP
XO/EX
DescriptionMSK's HPC team shares how they balance competing requirements from genomics to AI research and across labs, core facilities, and clinical practice. They present practical solutions and highlight breakthroughs made possible by their infrastructure.
Paper
State of the Practice
System Software and Cloud Computing
Livestreamed
Recorded
TP
DescriptionThe increasing interconnectivity of HPC systems has highlighted the need for efficient application migration across different environments. Containers, widely adopted for this purpose, simplify deployment but often fail to deliver optimal performance due to the separated build and execution container workflow. This leads to generic container images that miss out on system-specific software stack advantages, a challenge we define as the adaptability issue.
We propose coMtainer, a compilation-assisted image transformation framework that embeds build-time information into images. This enables remote HPC systems to specialize and rebuild the container using native toolchains and libraries. coMtainer preserves image neutrality while resolving the adaptability issue, allowing optimized execution without user involvement. Moreover, the embedded metadata unlocks advanced compiler optimizations such as LTO and PGO. We implement and evaluate coMtainer across a variety of real-world HPC applications, demonstrating coMtainer's practicability, applicability, and effectiveness.
We propose coMtainer, a compilation-assisted image transformation framework that embeds build-time information into images. This enables remote HPC systems to specialize and rebuild the container using native toolchains and libraries. coMtainer preserves image neutrality while resolving the adaptability issue, allowing optimized execution without user involvement. Moreover, the embedded metadata unlocks advanced compiler optimizations such as LTO and PGO. We implement and evaluate coMtainer across a variety of real-world HPC applications, demonstrating coMtainer's practicability, applicability, and effectiveness.
Early Career
Not Livestreamed
Not Recorded
TP
W
TUT
XO/EX
DescriptionProgram review and feedback (Webinars as well)
Conference recommendations
SC26 volunteering
Conference recommendations
SC26 volunteering
Workshop
Livestreamed
Recorded
TP
W
DescriptionWhile increases in available hardware concurrency have been the primary area of performance improvement over the last decade or so, parallel/concurrent programming is still a challenge. Most mainstream programming approaches, languages, and systems are designed for sequential programming first, with concurrency an afterthought. This poses a challenge for modern workloads, especially in areas such as artificial intelligence, machine learning, and data analytics, where there is an abundance of irregular concurrency due to unbalanced workloads and I/O patterns. Additionally, concurrency bugs tend to be nondeterministic, difficult to trace/reproduce, and consequently under-reported.
In this position paper, we describe the state-of-the-art in workflow-level concurrency, the challenges and opportunities in emerging application areas, and outline a solution in the form of a novel Python-based programming model.
In this position paper, we describe the state-of-the-art in workflow-level concurrency, the challenges and opportunities in emerging application areas, and outline a solution in the form of a novel Python-based programming model.
Research and ACM SRC Posters
Research & ACM SRC Posters
TP
DescriptionRecent work at NSF NCAR has developed Python packages and documentation for instantiating regional ocean models in the Community Earth System Model, but how can we guide a community of users through the subsequent tuning and development of purpose-built models? Here, we leverage recent advances in natural language processing and large language models (LLMs) to explore novel tools for guiding regional model development. We demonstrate that a curated, high-quality dataset based on a small number of interviews with experts can be used to fine-tune or context-prompt LLMs for use in regional modeling. This style of training data—regional modeling narratives—emphasizes the importance of high-quality, disciplinary data in LLM development, and it has the potential to provide access to previously siloed, institutional experience. In the future, we aim to grow this dataset and incorporate more technical documentation that the LLM can dynamically retrieve to inform more concrete guidance.
Workshop
Livestreamed
Recorded
TP
W
DescriptionMany instruments used in high-energy particle physics observations, e.g., gamma-ray telescopes, use FPGAs for front-end signal processing of raw sensor data. The use of high-level synthesis HLS) to express the signal processing algorithms has the potential to significantly reduce development time for new instruments of this type. We describe our experience with one of the computational stages in the signal processing pipeline, island detection, exploring its implementation across multiple configurations: 1D versus 2D islands, and 4-way versus 8-way connected-component labeling (CCL) in the 2D configuration. We report resource usage and performance for each, including the required optimizations necessary for HLS to be effective. Results indicate that our implementation can perform 4-way CCL on 15k images per second even for 43 × 43 pixel arrays.
Tutorial
Livestreamed
Recorded
TUT
DescriptionQuantum computing has the potential to revolutionize many fields in the 21st century. Over the past decade, numerous quantum computers have been made publicly available. However, the effectiveness of the hardware is heavily reliant on the software ecosystem—a lesson drawn from classical computing's evolution. Unlike classical systems, which benefit from mature electronic design automation and high performance computing (HPC) tools for handling complexity and optimizing performance, quantum software is still in its infancy. One of the goals of this tutorial is to educate the HPC community on quantum computing and to bring these two communities closer together. To this end, the tutorial intends to cover topics such as high-level support for users in realizing applications as well as efficient methods for the classical simulation, compilation, and verification of quantum circuits. Furthermore, the tutorial showcases how expertise in classical HPC can address key challenges in the quantum software stack, enhancing efficiency, scalability, and reliability. All of the above is accompanied by hands-on demonstrations based on the Munich Quantum Toolkit (MQT), an open-source collection of high-performance software tools for quantum computing developed by the Chair for Design Automation at the Technical University of Munich and the Munich Quantum Software Company.
Paper
HPC for Machine Learning
Programming Frameworks
Livestreamed
Recorded
TP
DescriptionOptimizing deep learning (DL) operators, especially GEMM-like operations, on heterogeneous many-core processors such as MT-3000 is difficult due to large search spaces and hardware-specific constraints. Existing methods, including hand-tuned libraries and auto-tuners, are either costly to develop or deliver limited performance. We propose DynaChain, an operator-level optimization framework for MT-3000. DynaChain separates computation and data movement, enabling independent optimization and maximizing data reuse across schedules. To shrink the search space, it employs constraint dependency chains that dynamically prune invalid scheduling choices. For irregular matrix dimensions, DynaChain uses an integer linear programming (ILP) based decomposition to avoid padding and enhance hardware utilization. At the low level, it generates optimized micro-kernels tailored to MT-3000’s VLIW+SIMD architecture, improving register allocation and pipelining for irregular operations. Experiments on representative DL operators show that DynaChain eases kernel development for heterogeneous architectures while achieving performance comparable to expert-tuned libraries.
Workshop
Livestreamed
Recorded
TP
W
DescriptionThe ongoing revolution enabled via containerization, virtualization, and new orchestration models has dramatically changed how applications and services are delivered and managed across the computing industry. This revolution has established a new ecosystem of tools and techniques with new, flexible and agile approaches, and continues to gain traction in the HPC community. In addition to HPC-optimized container runtimes, emerging technologies like Kubernetes create a new set of opportunities and challenges. While adoption is growing, questions regarding best practices, foundational concepts, tools, and standards remain. Our goal is to promote the adoption of these tools and introspect the impact of this new ecosystem on HPC use cases. This workshop serves as a key venue for presenting late-breaking research, sharing experiences and best practices, and fostering collaboration in this field. Our seventh workshop iteration will continue to emphasize real-world experiences and challenges in adopting and optimizing these new approaches for HPC.
Birds of a Feather
Architectures & Networks
Livestreamed
Recorded
TP
XO/EX
DescriptionContinuum computing is a distributed, multi-layered ecosystem that spans sensors at the edge, interconnected instruments, data centers, supercomputers, and recently quantum computers. These interconnected systems form a digital continuum wherein computation is orchestrated in various stages. The rising complexity of the continuum is accompanied by a corresponding increase in the vulnerability of its environment. The convergence of AI, data-intensive applications, and mobile workloads necessitates a reevaluation of strategies for securing these systems. We will explore resilience in continuum computing, emphasizing technologies and architectures that protect data, models, and computation across a federated landscape, including advances in quantum networks for secure communication.
Birds of a Feather
Architectures & Networks
Livestreamed
Recorded
TP
XO/EX
DescriptionRecent national laboratories and supercomputing centers are deploying heterogeneous systems integrating multi-core CPUs with GPUs, AI accelerators, FPGAs, IPUs, and DPUs. While these converged HPC-AI platforms promise unified infrastructure for large-scale simulation and AI workloads, they introduce complexities in programming, scalability, portability, and optimization. This BoF session examines the novel co-design strategies required to address heterogeneous architectures, focusing on scalable application frameworks, unified programming models, portable workflows, and software-hardware integration. Attendees will share experiences, tools, and emerging architectures to facilitate the convergence of HPC and AI, aiming to equip the community with actionable insights for developing high-performance applications on post-exascale systems.
Paper
Energy Efficiency
Performance Measurement, Modeling, & Tools
Power Use Monitoring & Optimization
State of the Practice
Livestreamed
Recorded
TP
DescriptionEfforts to reduce the environmental impact of HPC often focus on resource providers, but choices made by users (e.g., concerning where to run) can be equally consequential. Here we present evidence that new accounting methods that charge users for energy used can incentivize significantly more efficient behavior. We first survey 300 HPC users and find that fewer than 30% are aware of their energy consumption, and that energy efficiency is a low-priority concern. We then propose two new multi-resource accounting methods that charge for computations based on their energy consumption or carbon footprint, respectively. Finally, we conduct both simulation studies and a user study to evaluate the impact of these two methods on user behavior. We find that while only providing users feedback on their energy use had no impact on their behavior, associating energy with cost incentivized users to select more efficient resources and to use 40% less energy.
Tutorial
Livestreamed
Recorded
TUT
DescriptionWhile many developers put a lot of effort into optimizing large-scale parallelism, they often neglect the importance of an efficient serial code. Even worse, slow serial code tends to scale very well, hiding the fact that resources are wasted because no definite hardware performance limit (“bottleneck”) is exhausted. This tutorial conveys the required knowledge to develop a thorough understanding of the interactions between software and hardware on the level of a single CPU core and the lowest memory hierarchy level (the L1 cache). We introduce general out-of-order core architectures and their typical performance bottlenecks using modern x86-64 (Intel Sapphire Rapids) and ARM (Fujitsu A64FX) processors as examples. We then go into detail about x86 and AArch64 assembly code, specifically including vectorization (SIMD), pipeline utilization, critical paths, and loop-carried dependencies. We also demonstrate performance analysis and performance engineering using the Open Source Architecture Code Analyzer (OSACA) in combination with a dedicated instance of the well-known Compiler Explorer. Various hands-on exercises will allow attendees to make their own experiments and measurements and identify in-core performance bottlenecks. Furthermore, we show real-life use cases from computational science (sparse solvers, lattice QCD) to emphasize how profitable in-core performance engineering can be.
Workshop
Debugging & Correctness Tools
Software Tools
Livestreamed
Recorded
TP
W
ACM Gordon Bell Finalist
Awards and Award Talks
Applications
GB
Livestreamed
Recorded
TP
DescriptionResolving the most fundamental questions in cosmology requires simulations that match the scale, fidelity, and physical complexity demanded by next-generation sky surveys. To achieve the realism needed for this critical scientific partnership, detailed gas dynamics must be treated self-consistently with gravity for end-to-end modeling of structure formation. Exascale computing enables simulations that span survey-scale volumes while incorporating key astrophysical processes that shape complex cosmic structures. We present results from CRK-HACC, a cosmological hydrodynamics code built for extreme scalability. Using separation-of-scale techniques, GPU-resident tree solvers, in situ analysis pipelines, and multi-tiered I/O, CRK-HACC executed Frontier-E: a four-trillion-particle full-sky simulation, over an order of magnitude larger than previous efforts. The run achieved 513.1 PFLOPs peak performance, processing 46.6 billion particles per second and writing more than 100 PB of data in just over one week of runtime. Frontier-E marks a significant advance in predictive modeling for next-generation cosmological science.
Paper
Programming Frameworks
Livestreamed
Recorded
TP
DescriptionGraph pattern matching (GPM) is essential in fields like circuit logic synthesis, anomaly detection, social network analysis, cheminformatics, recommendation systems, and classification systems. Its NP-completeness and the irregular nature of graph data make scaling to distributed systems challenging. By utilizing architecture-specific communication techniques and topology-aware data partitioning, the scalability of GPM on large-scale data can be improved. However, the lack of performance portability complicates the parallel evolution of GPM software with hardware architectures, burdening developers.
This paper proposes a vertex-addressing scheme based on a distributed shared memory model (DSM) that relaxes strict DSM constraints, achieving both performance portability and scalability. This approach enables seamless code extension to thousands of nodes across different supercomputing architectures while maintaining performance comparable to manually optimized versions.
This paper proposes a vertex-addressing scheme based on a distributed shared memory model (DSM) that relaxes strict DSM constraints, achieving both performance portability and scalability. This approach enables seamless code extension to thousands of nodes across different supercomputing architectures while maintaining performance comparable to manually optimized versions.
Workshop
Debugging & Correctness Tools
HPC Software & Runtime Systems
Livestreamed
Recorded
TP
W
DescriptionThe complexity of MPI programming led to the development of various correctness tools.
Static tools may examine the entire source but yield false positives as they lack runtime context.
Dynamic tools, detecting errors at runtime, offer higher precision but suffer from false negatives due to limited coverage and substantial overhead due to instrumentation requirements.
This paper presents a coupled workflow that combines these approaches.
Forwarding static reports to dynamic tools improves the overall accuracy and reduces the instrumentation overhead.
A proposed generic exchange format enables interoperability between static and dynamic tools.
The coupled workflow is implemented by the example of MPI local data race detection through integration of the tools CoVer, SPMD IR, and MUST.
It eliminates static tool false positives, enables detection of previously missed races in dynamic tools, and
significantly reduces the runtime overhead through targeted instrumentation up to an order of magnitude.
Static tools may examine the entire source but yield false positives as they lack runtime context.
Dynamic tools, detecting errors at runtime, offer higher precision but suffer from false negatives due to limited coverage and substantial overhead due to instrumentation requirements.
This paper presents a coupled workflow that combines these approaches.
Forwarding static reports to dynamic tools improves the overall accuracy and reduces the instrumentation overhead.
A proposed generic exchange format enables interoperability between static and dynamic tools.
The coupled workflow is implemented by the example of MPI local data race detection through integration of the tools CoVer, SPMD IR, and MUST.
It eliminates static tool false positives, enables detection of previously missed races in dynamic tools, and
significantly reduces the runtime overhead through targeted instrumentation up to an order of magnitude.
Paper
CPU- and GPU-Initiated Communication Strategies for Conjugate Gradient Methods on Large GPU Clusters
11:15am - 11:37am CST Tuesday, 18 November 2025 263-264BP
Performance Measurement, Modeling, & Tools
Livestreamed
Recorded
TP
DescriptionStrong scaling of conjugate gradient (CG) algorithms on GPU-based supercomputers is notoriously challenging. These linear system solvers have low computational intensity, making inter-GPU communication and synchronization primary bottlenecks. In light of recent developments in multi-GPU communication, we revisit CG parallelization for large-scale GPU clusters. We implement standard and pipelined CG solvers using three flavors of multi-GPU communication: GPU-aware MPI, NVIDIA's NCCL/AMD's RCCL, and NVIDIA's NVSHMEM.
Our monolithic NVSHMEM-based implementation with GPU-initiated communication enables CPU-free execution and thus lower overhead. However, lack of vendor-supported device-side computational kernels means that CPU-controlled CG implementations based on GPU-aware MPI or NCCL/RCCL are still favored for small GPU counts. Compared with state-of-the-art CG implementations, we have also eliminated unnecessary CPU-GPU data transfers and synchronization points. Our CG implementations are benchmarked on NVIDIA- and AMD-based supercomputers using SuiteSparse matrices and real-world finite element applications, achieving strong scaling on over 1,000 GPUs, and outperforming existing approaches.
Our monolithic NVSHMEM-based implementation with GPU-initiated communication enables CPU-free execution and thus lower overhead. However, lack of vendor-supported device-side computational kernels means that CPU-controlled CG implementations based on GPU-aware MPI or NCCL/RCCL are still favored for small GPU counts. Compared with state-of-the-art CG implementations, we have also eliminated unnecessary CPU-GPU data transfers and synchronization points. Our CG implementations are benchmarked on NVIDIA- and AMD-based supercomputers using SuiteSparse matrices and real-world finite element applications, achieving strong scaling on over 1,000 GPUs, and outperforming existing approaches.
Panel
Algorithms, Numerical Methods, & Libraries
Architectures
Livestreamed
Recorded
TP
DescriptionMany HPC applications are to some degree or another memory bandwidth-bound. However, for applications that run on CPUs, it has been difficult to balance the tradeoffs of bandwidth, latency, capacity, and power to achieve a big win in improved memory bandwidth. Often, to get more bandwidth, it requires an imbalance in other areas such as excess capacity, worse latency, or increased power to the CPU and/or the memory. In this session, our panel of experts—including silicon designers, system architects, scientific computing experts, and hyperscalar system providers—will examine the problems that we have seen in the past and present in trying to design platforms around CPUs and memory technologies that are not being designed to work optimally together and what options and opportunities we in the community have to improve the situation with both hardware and software choices we can make.
Art of HPC
Art of HPC
Not Livestreamed
Not Recorded
TP
W
TUT
XO/EX
DescriptionIn this 3D visualization, we delve into the atomic-scale structure of an Iron (Fe)-Chromium (Cr) alloy post-irradiation, a vital material for nuclear energy applications. The image contrasts Fe atoms in black with Cr atoms represented as vibrant spheres, each hue indicating the specific cluster to which the Cr atom belongs.
Captured through atom probe tomography (APT), this snapshot zooms into a 3-nanometer-thick slice from a sample measuring roughly 80 × 80 × 200 nanometers. The slice encompasses 447,418 atoms, among which approximately 54,000 are Cr atoms. The Fe-Cr alloy, irradiated to 1.8 displacements per atom (dpa) at 290 degrees Celsius at Idaho National Laboratory’s Advanced Test Reactor, reveals Cr atoms clustering due to irradiation effects, a phenomenon that significantly influences the alloy's physical and mechanical properties.
Leveraging the power of high performance computing (HPC) and advanced deep learning, this image-based workflow identifies and characterizes Cr clusters with unprecedented precision. The HPC workflow surpasses manual methods, offering enhanced consistency, speed, scalability, and reproducibility in APT data analysis. Each cluster is distinctly color-coded, while Fe atoms create the black backdrop, uncovering the intricate, hidden structure through the marriage of data, computation, and design.
Visualizing real experimental data is essential to understanding complex atomic-scale phenomena that are otherwise invisible. This image is not a simulation, it is reconstructed from actual APT measurements, revealing the spatial distribution of atoms. By rendering these data visually, we can more effectively interpret clustering behavior, validate computational methods, and communicate scientific insights.
Captured through atom probe tomography (APT), this snapshot zooms into a 3-nanometer-thick slice from a sample measuring roughly 80 × 80 × 200 nanometers. The slice encompasses 447,418 atoms, among which approximately 54,000 are Cr atoms. The Fe-Cr alloy, irradiated to 1.8 displacements per atom (dpa) at 290 degrees Celsius at Idaho National Laboratory’s Advanced Test Reactor, reveals Cr atoms clustering due to irradiation effects, a phenomenon that significantly influences the alloy's physical and mechanical properties.
Leveraging the power of high performance computing (HPC) and advanced deep learning, this image-based workflow identifies and characterizes Cr clusters with unprecedented precision. The HPC workflow surpasses manual methods, offering enhanced consistency, speed, scalability, and reproducibility in APT data analysis. Each cluster is distinctly color-coded, while Fe atoms create the black backdrop, uncovering the intricate, hidden structure through the marriage of data, computation, and design.
Visualizing real experimental data is essential to understanding complex atomic-scale phenomena that are otherwise invisible. This image is not a simulation, it is reconstructed from actual APT measurements, revealing the spatial distribution of atoms. By rendering these data visually, we can more effectively interpret clustering behavior, validate computational methods, and communicate scientific insights.
Exhibitor Forum
Hardware and Architecture
Livestreamed
Recorded
TP
XO/EX
DescriptionBy 2030, data center power demand is projected to rise by 165%, with a 50% increase anticipated by 2027 alone—fueled by AI workloads and exponential cloud growth. Meeting this demand isn’t just a matter of adding more power capacity; it requires a fundamental reimagining of how that power is delivered and distributed.
HARTING is working to develop connectors that support high-power computing and enable more space-efficient racks. In this presentation, Will Stewart, Senior Industry Segment Manager, Smart Infrastructure and Mobility at HARTING will discuss the power surge led by AI and data centers, the need for high-power computing solutions, and how connectors are at the center of energy and space efficiency.
HARTING is working to develop connectors that support high-power computing and enable more space-efficient racks. In this presentation, Will Stewart, Senior Industry Segment Manager, Smart Infrastructure and Mobility at HARTING will discuss the power surge led by AI and data centers, the need for high-power computing solutions, and how connectors are at the center of energy and space efficiency.
Research and ACM SRC Posters
Research & ACM SRC Posters
TP
DescriptionThis paper introduces CROSS BOAT (Cross HPC System Bayesian Optimization with Adaptive Transfer), a novel method for efficient parameter tuning in high performance computing (HPC) systems. Optimizing the many configurable parameters in HPC environments usually requires costly evaluations on each target system. To address this, we propose a transfer learning approach that leverages knowledge from a well understood source system to accelerate optimization on new targets. CROSS BOAT uses an adaptive transfer mechanism that combines expected improvement from the target with a progressively weighted source knowledge term, balancing exploration and exploitation. Experiments on simulated HPC systems show that CROSS BOAT outperforms standard Bayesian optimization when target systems differ significantly from the source, achieving up to 24.5% better performance with fewer evaluations. For more similar systems, standard methods remain competitive, underscoring the context-dependent value of transfer learning for faster and more effective HPC system optimization.
SCinet Network Research Exhibition
Not Livestreamed
Not Recorded
DescriptionNo NRI.
Workshop
Livestreamed
Recorded
TP
W
DescriptionCryoSPARC has historically meshed poorly with shared compute resources. This presentation demonstrates a way to integrate CryoSPARC into shared compute systems without the need of containers or SSH tunnels.
Research and ACM SRC Posters
Research & ACM SRC Posters
TP
DescriptionRecently, there have been attempts to utilize AI accelerators for scientific computing; however, these devices generally lack hardware support for double-precision floating-point arithmetic, which is essential for many scientific applications.
The Cerebras CS-2 system (CS-2) delivers extremely high single-precision performance of 1.06 PFlops/s but does not support native double-precision arithmetic. To overcome this limitation and enable scientific computations requiring double precision, a software-based approach is essential.
We propose csDF, a double-float (DF) arithmetic library for the CS-2 that provides DF numeric types and arithmetic operations. To demonstrate the capability of csDF, we implemented a naive pseudo-double-precision matrix multiplication using DF addition and multiplication, and measured its strong scaling performance. Our result shows 8.09 Tera DF-Flops/s, which shows the feasibility of software-based double-precision arithmetic and enables previously infeasible scientific computations.
The Cerebras CS-2 system (CS-2) delivers extremely high single-precision performance of 1.06 PFlops/s but does not support native double-precision arithmetic. To overcome this limitation and enable scientific computations requiring double precision, a software-based approach is essential.
We propose csDF, a double-float (DF) arithmetic library for the CS-2 that provides DF numeric types and arithmetic operations. To demonstrate the capability of csDF, we implemented a naive pseudo-double-precision matrix multiplication using DF addition and multiplication, and measured its strong scaling performance. Our result shows 8.09 Tera DF-Flops/s, which shows the feasibility of software-based double-precision arithmetic and enables previously infeasible scientific computations.
Birds of a Feather
Storage
Livestreamed
Recorded
TP
XO/EX
DescriptionExponentially growing data volumes present fundamental challenges to manage and access large quantities of data. With a new generation of more flexible hardware and software, computational storage re-emerges as a promising technology to reduce network contention and improve performance for key applications. As industry is converging on a first set of standards, it is up to the HPC community as well as developers and scientists from the different domains to find the use cases and tools necessary. This BoF strives to connect the stakeholders, from application to middleware and hardware developers to explore the potential for HPC and scientific computing.
Research and ACM SRC Posters
Research & ACM SRC Posters
TP
DescriptionMixture-of-experts (MoE) architectures enable trillion-parameter models but face prohibitive memory scaling, limited compression interpretability, and vendor-specific implementations hindering heterogeneous HPC deployment.
We present the first Julia-based MoE framework introducing CUR decomposition for interpretable expert compression—a novel approach applying CUR matrix factorization to MoE architectures—with hardware-agnostic design. While SVD-based methods provide effective compression, CUR-MoE offers comparable performance with enhanced interpretability through preserved column/row structure, maintaining viability at high compression ratios (35.29 perplexity at 70% compression). Comprehensive gating evaluation reveals ExpertChoice achieves optimal load balancing. Julia's LLVM compilation enables consistent 5-6× GPU acceleration across NVIDIA, AMD, Intel, and Apple hardware.
Our core implementation is completed, validated on WikiText-2 across platforms. We are expanding comprehensive platform support for Apple Metal and Intel Arc while extending Transformers.jl and Flux.jl integrations. The poster will include visual comparisons, cross-vendor benchmarks, detailed oral explanations, and QR codes with live interactive GitHub examples demonstrating CUR structure preservation.
We present the first Julia-based MoE framework introducing CUR decomposition for interpretable expert compression—a novel approach applying CUR matrix factorization to MoE architectures—with hardware-agnostic design. While SVD-based methods provide effective compression, CUR-MoE offers comparable performance with enhanced interpretability through preserved column/row structure, maintaining viability at high compression ratios (35.29 perplexity at 70% compression). Comprehensive gating evaluation reveals ExpertChoice achieves optimal load balancing. Julia's LLVM compilation enables consistent 5-6× GPU acceleration across NVIDIA, AMD, Intel, and Apple hardware.
Our core implementation is completed, validated on WikiText-2 across platforms. We are expanding comprehensive platform support for Apple Metal and Intel Arc while extending Transformers.jl and Flux.jl integrations. The poster will include visual comparisons, cross-vendor benchmarks, detailed oral explanations, and QR codes with live interactive GitHub examples demonstrating CUR structure preservation.
Birds of a Feather
Artificial Intelligence & Machine Learning
Education & Workforce Development
Livestreamed
Recorded
TP
XO/EX
DescriptionThe National Science Foundation's Office of Advanced Cyberinfrastructure (OAC) supports developing and providing state-of-the-art cyberinfrastructure (CI) resources, including HPC systems, tools, and services to advance science and engineering. The central vision of OAC is to support sustainable research workforce development by leveraging CI across domains. A particular focus now is on integrating artificial intelligence (AI) into CI while utilizing CI for AI. We continue to facilitate innovation and innovative usage of CI+AI, democratized access, and the development of sustainable CI ecosystems. We seek to engage the community and institutions to obtain feedback on the evolving needs of the CI+AI workforce.
Panel
Applications & Application Frameworks
High Performance I/O, Storage, Archive, & File Systems
Scalable Data Analytics & Management
Livestreamed
Recorded
TP
DescriptionExascale computing systems enable Earth system models and observation systems for environmental prediction to resolve finer and finer scales. For climate prediction, long time scales are necessary, requiring exascale computation, but producing petascale data volumes. The traditional model of storing, analyzing, and copying climate model output or observations from the largest computers and storage centers on the planet will fail for petascale data. Turning computation into usable predictions requires a new generation of tools and workflows to produce usable information for society. Petascale data requires new technologies including cloud computing and object storage, analysis servers, new software frameworks and artificial intelligence methods. This panel will focus on how exascale climate model output is being used with new frameworks to enable Earth system prediction for society, and how petascale data can be efficiently "democratized" to enable new classes of users to benefit from exascale computing.
Doctoral Showcase
Research & ACM SRC Posters
Livestreamed
Recorded
TP
DescriptionSpatial decision support systems (SDSS) are pivotal in resolving complex geospatial challenges but face critical limitations in harmonizing conflicting objectives, capturing behavioral heterogeneity, and enabling efficient large-scale data processing. Besides, a central challenge is that current Geospatial cyberinfrastructure (GeoCI) is inefficient in supporting real-time data access. Additionally, CI struggles to deliver scalable computations needed for addressing complex, multidimensional problems. This research addressed these limitations by presenting a unified GeoCI framework powered by geospatial artificial intelligence (GeoAI). Our framework provides a dual-capability platform, enabling both participatory, stakeholder-driven planning for scenarios like offshore wind siting, and high-fidelity, agent-based simulations for dynamic phenomena such as epidemic transmission. These sophisticated applications are underpinned by an intelligent service tier where a machine learning model reduces data retrieval times by over 80%, making the entire system more scalable and responsive. This study provides a validated template for building next-generation spatial data sharing systems (SDSS) that can balance the needs of all stakeholders, simulate complex real-world dynamics, and process massive geospatial datasets in real time, thereby achieving the goals of greater efficiency, inclusiveness, and adaptability to the complexities of the real world.
Art of HPC
Art of HPC
Not Livestreamed
Not Recorded
TP
W
TUT
XO/EX
DescriptionThe image shows a visualization of one timestep of a model of the physics of the Greenland ice-sheet. The inset is an artist-generated image representing the location of the sites where cores of the bedrock beneath the ice were drilled, as well as contours of the ice thickness and degree of coverage relative to the area of the island.
Workshop
Livestreamed
Recorded
TP
W
DescriptionThe computing continuum has emerged as a promising paradigm for decentralized data processing. This approach brings computation closer to data sources, reducing latency and enabling faster insights. However, managing such distributed systems introduces new challenges, particularly in ensuring the availability and reliability of data across heterogeneous and failure-prone environments. In this paper, we focus on addressing these challenges by introducing DagOnStore as a novel component of the DAGonStar workflow engine, integrating it with the DynoStore wide-area storage system to provide resilient and location-transparent data access. DagOnStore implements reliability and availability schemes based on erasure codes and utilization-aware load-balancing to guarantee that input and output data remain accessible and consistent, even in the presence of storage node failures or disconnections. We validate our approach through different tests, demonstrating that DagOnStore enables scalable and fault-tolerant workflow execution across the computing continuum with minimal user intervention.
Workshop
Data Analytics
High Performance I/O, Storage, Archive, & File Systems
Storage
Livestreamed
Recorded
TP
W
DescriptionThe Distributed Asynchronous Object Storage (DAOS) is an open source software-defined high performance scalable storage system that has redefined performance for a wide spectrum of AI and HPC workloads. This PDSW25 WiP session presents first performance results of DAOS running on NVIDIA NDR InfiniBand, HPE Slingshot 400, Cornelis Omni-Path 400, and 400GbE Ethernet fabrics.
Paper
Algorithms
Livestreamed
Recorded
TP
DescriptionThis paper presents DAS-ILU, a distributed asynchronous parallel incomplete LU factorization method based on domain decomposition. DAS-ILU partitions the computational domain into independently processed interior nodes and asynchronously updated separator nodes, thereby reducing cross-processor dependencies and halving the separator size compared to conventional methods. To further improve performance, it employs optimized data exchange patterns to minimize communication overhead and extends support to block-structured sparse matrices via exact block inversions. Comprehensive evaluations on a range of problem types—including structural mechanics, computational fluid dynamics, and reservoir simulation—demonstrate the superior performance of DAS-ILU. Compared to state-of-the-art ILU implementations, DAS-ILU achieves solve time speedups of up to $2.07\times$ over Chow-Patel's fine-grained parallel ILU and up to $4.11\times$ over HYPRE's ILU. Moreover, DAS-ILU exhibits strong robustness when applied to challenging non-symmetric and indefinite systems.
Workshop
Livestreamed
Recorded
TP
W
DescriptionLarge-scale international collaborations such as ATLAS rely on globally distributed workflows and data management to process, move, and store vast volumes of data. ATLAS’s Production and Distributed Analysis (PanDA) workflow system and the Rucio data management system are each highly optimized for their respective design goals. However, operating them together at global scale exposes systemic inefficiencies, including underutilized resources, redundant or unnecessary transfers, and altered error distributions. Moreover, PanDA and Rucio currently lack shared performance awareness and coordinated, adaptive strategies.
This work charts a path toward co-optimizing the two systems by diagnosing data-management pitfalls and prioritizing end-to-end improvements. With the observation of spatially and temporally imbalanced transfer activities, we develop a metadata-matching algorithm that links PanDA jobs and Rucio datasets at the file level, yielding a complete, fine-grained view of data access and movement. Using this linkage, we identify anomalous transfer patterns that violate PanDA’s data-centric job-allocation principle.
This work charts a path toward co-optimizing the two systems by diagnosing data-management pitfalls and prioritizing end-to-end improvements. With the observation of spatially and temporally imbalanced transfer activities, we develop a metadata-matching algorithm that links PanDA jobs and Rucio datasets at the file level, yielding a complete, fine-grained view of data access and movement. Using this linkage, we identify anomalous transfer patterns that violate PanDA’s data-centric job-allocation principle.
Workshop
Debugging & Correctness Tools
HPC Software & Runtime Systems
Livestreamed
Recorded
TP
W
DescriptionWe propose a data race detection approach for code written in a source programming language, by means of AI-agent translation to a target language, followed by conventional tool-based detection in the target language. We evaluate this translate-then-check approach by translating the C/Fortran+OpenMP programs in DataRaceBench to the Go programming language, and using the Go data race detector to check for races. The translation is controlled through natural language prompts, similar to approaches popularized as vibe coding. Translate-then-check achieves 92.8% accuracy and 9 false negatives for the C programs in DataRaceBench, compared to 89.9% accuracy and 17 false negatives for Clang+ThreadSanitizer applied directly to the original C programs. We discuss the approach and overall accuracy, as well as individual programs for which translate-then-check leads to false negatives or positives, in part due to limitations of the Go data race checker, and limitations of the translation.
Birds of a Feather
Clouds & Distributed Computing
Livestreamed
Recorded
TP
XO/EX
DescriptionThis BoF session will examine the potential of decentralized data center architectures to advance high performance computing (HPC) and artificial intelligence (AI). Discussions will focus on the technical, operational, and policy-driven dimensions of distributed systems, including edge computing, federated learning, and energy-efficient design strategies. The session aims to engage researchers, system architects, and data center professionals in exploring scalable and resilient alternatives to traditional centralized infrastructures. Participants will have an opportunity to exchange insights, share implementation experiences, and foster collaborations that inform the future of computing infrastructure.
Tutorial
Livestreamed
Recorded
TUT
DescriptionDeep learning is rapidly and fundamentally transforming the way science and industry use data to solve problems. Deep neural network models have been shown to be powerful tools for extracting insights from data across a large number of domains, from large language models (LLMs) to protein folding. As these models grow in complexity to solve increasingly challenging problems with larger and larger datasets, the need for scalable methods and software to train them grows accordingly. The Deep Learning at Scale tutorial aims to provide attendees with a working knowledge of deep learning on HPC-class systems, including core concepts, scientific applications, performance optimization, tips, and techniques for scaling. We will provide access to large GPU HPC systems, example code, and datasets to allow attendees to experiment hands-on with optimized, scalable distributed training of deep neural network machine learning models from real scientific computing applications.
Paper
Applications
GBC
Livestreamed
Recorded
TP
DescriptionFor decades, supercritical flame simulations incorporating detailed chemistry and real-fluid transport have been limited to millions of cells, constraining the resolved spatial and temporal scales of the physical system.
We optimize the supercritical flame simulation software DeepFlame—which incorporates deep neural networks while retaining real-fluid mechanical and chemical accuracy—from three perspectives: parallel computing, computational efficiency, and I/O performance. Our highly optimized DeepFlame achieves a supercritical liquid oxygen/methane (LOX/\ce{CH4}) turbulent combustion simulation of up to 618 and 154 billion cells with unprecedented time-to-solution, attaining 439/1186 and 187/316 PFlop/s (32.3\%/21.8\% and 37.4\%/31.8\% of the peak) in FP32/mixed-FP16 precision on Sunway (98,304 nodes) and Fugaku (73,728 nodes) supercomputers, respectively. This computational capability surpasses existing capacities by three orders of magnitude, enabling the first practical simulation of rocket engine combustion with >100 LOX/\ce{CH4} injectors. This breakthrough establishes high-fidelity supercritical flame modeling as a critical design tool for next-generation rocket propulsion and ultra-high energy density systems.
We optimize the supercritical flame simulation software DeepFlame—which incorporates deep neural networks while retaining real-fluid mechanical and chemical accuracy—from three perspectives: parallel computing, computational efficiency, and I/O performance. Our highly optimized DeepFlame achieves a supercritical liquid oxygen/methane (LOX/\ce{CH4}) turbulent combustion simulation of up to 618 and 154 billion cells with unprecedented time-to-solution, attaining 439/1186 and 187/316 PFlop/s (32.3\%/21.8\% and 37.4\%/31.8\% of the peak) in FP32/mixed-FP16 precision on Sunway (98,304 nodes) and Fugaku (73,728 nodes) supercomputers, respectively. This computational capability surpasses existing capacities by three orders of magnitude, enabling the first practical simulation of rocket engine combustion with >100 LOX/\ce{CH4} injectors. This breakthrough establishes high-fidelity supercritical flame modeling as a critical design tool for next-generation rocket propulsion and ultra-high energy density systems.
Community Engagement and Support
Not Livestreamed
Not Recorded
TP
XO/EX
DescriptionConvective permitting scale climate projections in WA have been delivered through partnership with state governments, university and Pawsey Supercomputing Research centre. The partnerships halve data production time and cut the computational walltime significantly via optimized workflows. The dataset represents the highest resolution regional climate projections for WA and supports robust climate risk assessment and resource planning.
Tutorial
Livestreamed
Recorded
TUT
DescriptionHPC leadership and management skills are essential to the success of HPC. This includes securing funding, procuring the right technology, building effective support teams, ensuring value for money, and delivering a high-quality service to users. This tutorial will provide practical, experience-based training on delivering HPC. This includes stakeholder management, requirements capture, market engagement, hardware procurement, benchmarking, bid scoring, acceptance testing, total cost of ownership, cost recovery models, metrics, and value. The presenters have been involved in numerous major HPC procurements in several countries, over three decades, as HPC managers or advisors. The tutorial is applicable to HPC procurements and service delivery in most countries, public or private sector, and is based on experiences from a diversity of real-world cases. The lead author, Andrew Jones, has become the de facto international leader in delivering training on these topics, with a desire to improve the best practices of the community, and without a sales focus or product to favor. The SC tutorials by these authors have been consistently among the most strongly attended and highly rated by attendees for several years.
Birds of a Feather
Programming Frameworks
Livestreamed
Recorded
TP
XO/EX
DescriptionA recent surge in novel AI accelerator hardware (e.g., from Cerebras, Groq, SambaNova, and Tenstorrent) has sparked intense interest in running HPC applications and driven algorithmic research on these architectures. However, the maturity levels of associated tooling and software stacks vary significantly, with gaps in documentation and support. This BoF aims to explore trends in AI accelerators for HPC, ideal programming models, software stack support, code portability, training resources, and common challenges. We hope to foster a community of users interested or experienced in using AI accelerators for HPC applications.
Paper
Applications
Performance Measurement, Modeling, & Tools
State of the Practice
Livestreamed
Recorded
TP
DescriptionDeep neural networks are known to be resilient to random bitwise faults in their parameters. However, this resilience has primarily been established through evaluations on classification models. The extent to which this claim holds for large language models remains under-explored.
In this work, we conduct an extensive measurement study on the impact of random bitwise faults in commercial-scale language model inference. We first expose that these language models are not truly resilient to random bit-flips. While aggregate metrics such as accuracy may suggest resilience, an in-depth inspection of the generated outputs shows significant degradation in text quality. Our analysis also shows that tasks requiring more complex reasoning suffer more from performance and quality degradation. Moreover, we extend our resilience analysis to models with augmented reasoning capabilities, such as Chain of Thought or Mixture of Experts architectures.
In this work, we conduct an extensive measurement study on the impact of random bitwise faults in commercial-scale language model inference. We first expose that these language models are not truly resilient to random bit-flips. While aggregate metrics such as accuracy may suggest resilience, an in-depth inspection of the generated outputs shows significant degradation in text quality. Our analysis also shows that tasks requiring more complex reasoning suffer more from performance and quality degradation. Moreover, we extend our resilience analysis to models with augmented reasoning capabilities, such as Chain of Thought or Mixture of Experts architectures.
Paper
Algorithms
HPC for Machine Learning
Performance Measurement, Modeling, & Tools
State of the Practice
Livestreamed
Recorded
TP
DescriptionModern high performance computing (HPC) applications are increasingly vulnerable to silent data corruptions (SDCs) caused by transient hardware faults. While selective instruction duplication (SID) offers an efficient software-level protection strategy, existing SID methods rely on SDC vulnerability profiles derived from only the default reference input often found in application suites. However, they overlook the input-dependent nature of SDC propagation. This leads to significant SDC coverage loss when inputs vary. We present PROTEGO, a novel input-aware SID protection framework that efficiently adapts protection to runtime inputs. PROTEGO performs a one-time vulnerability-guided input exploration to identify a small number of input groups with distinct SID protection patterns. At runtime, PROTEGO uses lightweight features derived from input arguments to select and deploy the appropriate SID protection. Our evaluation across 10 HPC applications demonstrates the effectiveness and efficiency of PROTEGO in mitigating SDC coverage loss across diverse inputs, compared to existing SID techniques.
Workshop
Livestreamed
Recorded
TP
W
DescriptionAbstract—Data movement bottlenecks have become the dominant performance limiter in modern computing systems. At the same time, scientific detectors generate overwhelming data
volumes; X-ray detectors may soon produce terabytes per second and high-energy physics experiments demand bandwidth on the order of petabytes per second. Streaming compression can reduce data movement overheads, hardware accelerators can further enhance data flow, and the exploration of system-level hardware compressors represents an untapped opportunity. This paper presents a preliminary study on enabling hardware evaluation of streaming compressors. We designed and implemented a custom hardware accelerator for scientific data compression using modern hardware description languages, providing a complete end-toend hardware acceleration system for CPU-based platforms. Our prototype features a multi-stage state machine, parallel element processing, and optimized data transfers, achieving 1.45× speedup
over a software baseline with comparable quality, with 31% fewer cycles per element and 45% faster compression throughput.
volumes; X-ray detectors may soon produce terabytes per second and high-energy physics experiments demand bandwidth on the order of petabytes per second. Streaming compression can reduce data movement overheads, hardware accelerators can further enhance data flow, and the exploration of system-level hardware compressors represents an untapped opportunity. This paper presents a preliminary study on enabling hardware evaluation of streaming compressors. We designed and implemented a custom hardware accelerator for scientific data compression using modern hardware description languages, providing a complete end-toend hardware acceleration system for CPU-based platforms. Our prototype features a multi-stage state machine, parallel element processing, and optimized data transfers, achieving 1.45× speedup
over a software baseline with comparable quality, with 31% fewer cycles per element and 45% faster compression throughput.
Doctoral Showcase
Research & ACM SRC Posters
Livestreamed
Recorded
TP
DescriptionGPU-accelerated HPC and deep learning workloads now operate at scales of tens to thousands of GPUs, making collective communication a dominant cost. Applications such as Amber, heFFTe, and distributed LLM training require frequent synchronization and exchange of large data partitions. At the same time, systems are increasingly heterogeneous: clusters combine NVIDIA, AMD, and Intel GPUs with interconnects such as NVLink, Infinity Fabric, InfiniBand, and Slingshot. Many MPI runtimes remain tuned for CPU-centric designs, performing unnecessary host staging, adding extra copies, and underutilizing high-bandwidth device paths or multi-rail topology. Support for newer stacks, particularly SYCL and Level Zero on Intel GPUs, is also uneven, hindering performance portability.
We present a unified, GPU-aware collective framework that targets portability and efficiency across vendors and networks. For Alltoall, we design IPC-based intra-node paths that avoid host staging and introduce push and pull variants that overlap intra- and inter-node transfers. For Allreduce, we implement on-device reduction kernels with native inter-node GPU support and computation-communication overlap; for medium messages at large scale, we add a direct sendrecv algorithm with throttling to balance bandwidth and latency. The framework extends to Intel GPUs via SYCL and Level Zero, alongside CUDA and ROCm back ends. To mitigate inter-node bandwidth limits for very large messages, we integrate a lightweight casting-based compression that downcasts in flight with negligible accuracy loss. Together, these designs provide efficient Alltoall and Allreduce across NVIDIA, AMD, and Intel platforms, improving end-to-end performance while reducing CPU involvement and data movement overhead.
We present a unified, GPU-aware collective framework that targets portability and efficiency across vendors and networks. For Alltoall, we design IPC-based intra-node paths that avoid host staging and introduce push and pull variants that overlap intra- and inter-node transfers. For Allreduce, we implement on-device reduction kernels with native inter-node GPU support and computation-communication overlap; for medium messages at large scale, we add a direct sendrecv algorithm with throttling to balance bandwidth and latency. The framework extends to Intel GPUs via SYCL and Level Zero, alongside CUDA and ROCm back ends. To mitigate inter-node bandwidth limits for very large messages, we integrate a lightweight casting-based compression that downcasts in flight with negligible accuracy loss. Together, these designs provide efficient Alltoall and Allreduce across NVIDIA, AMD, and Intel platforms, improving end-to-end performance while reducing CPU involvement and data movement overhead.
ACM Gordon Bell Climate Modeling Finalist
Awards and Award Talks
Applications
GBC
Livestreamed
Recorded
TP
DescriptionWe present the first digital twin framework that operationalizes the production of multi-decadal, global climate projections at kilometer-scale resolution, developed within the European Union’s Destination Earth initiative. Using three coupled Earth system models and selected impact-sector applications, we have built end-to-end workflows for both regular and on-demand climate projections on two EuroHPC supercomputers, LUMI and MareNostrum5. These workflows produced the first-ever multi-decadal simulations at 5 km resolution across all major Earth system components, using the same output parameters and grid, and achieving a production throughput of 0.6 simulated years per day and a climate data portfolio of 6.6 petabytes. We demonstrate the scalability of two of these Earth system models across both CPU- and GPU-based systems at global resolutions up to 1 km, across atmosphere, ocean, land, and sea-ice, and report record-breaking full-machine performance on LUMI and MareNostrum5 of up to 97 simulated days per day at 1 km resolution.
Research and ACM SRC Posters
Research & ACM SRC Posters
TP
DescriptionHigh performance computing (HPC) systems frequently execute large-scale sparse matrix computations in scientific and engineering domains. These workloads are susceptible to silent data corruptions (SDCs)—undetected faults that can alter results without triggering errors—posing a significant risk to computational integrity. In this work, we show how injected errors in sparse matrices propagate during repeated sparse matrix-vector multiplication (SpMV) executions and evaluate whether hardware performance counter (PMC) patterns can be used to detect such corruptions. We conduct controlled experiments with Gaussian noise injection at varying magnitudes and injection rates, record hardware counter values using the Linux perf tool, and train a decision tree classifier to distinguish corrupted runs from clean runs. Experiments on four real-world matrices from the SuiteSparse Matrix Collection yield detection accuracies around 90%–99% with under 2% runtime overhead. The results confirm that PMC-based classification is a viable approach for lightweight SDC detection.
Workshop
Livestreamed
Recorded
TP
W
DescriptionThere are two sources of inaccuracy when simulating parallel and distributed computing systems: (i) a simulator implemented at an insufficient level of detail; and (ii) incorrectly calibrated simulation parameter values. Increasing the simulator's level of detail can improve accuracy, but at the cost of higher space, time, and/or software complexity. Furthermore, evaluating the intrinsic accuracy of a simulator requires that its parameters be well-calibrated. Making decisions regarding the level of detail is thus challenging. We propose a methodology for instantiating the simulation calibration process and a framework for automating this process, which makes it possible to pick appropriate levels of detail for any simulator. We demonstrate the usefulness of our approach via two case studies for two different domains.
Flash Session
Not Livestreamed
Not Recorded
TP
XO/EX
DescriptionCurrently there is no recognized standard for determining the energy efficiency of a single-phase liquid-to-liquid coolant distribution unit (CDU). To compound the complexity of making meaningful decisions about energy use and efficiency, there is yet to be an agreed-upon efficiency metric for this type of cooling equipment. With no efficiency metric, how can engineers and owners determine what CDU is best for their application? Understanding the energy efficiency of the CDU is critical to determining the ROI, OPEX, and TCO of the cooling system.
In this presentation, Dave Meadows will address the current confusion surrounding liquid-to-liquid single-phase coolant distribution unit (CDU) energy efficiency. Standards organizations such as ASHRAE, AHRI, and ANSI are racing to catch up with liquid cooling technology innovations but still have much work ahead of them to develop universally accepted standards. Dave will discuss a reasonable approach to determining the energy efficiency of a single-phase liquid-to-liquid CDU utilizing current related test standards and proposed ASHRAE rating conditions, allowing CDUs from different manufacturers to be fairly compared side by side. He will discuss methods for evaluating both the TCS and FWS loops efficiency as it relates specifically to the CDU.
In this presentation, Dave Meadows will address the current confusion surrounding liquid-to-liquid single-phase coolant distribution unit (CDU) energy efficiency. Standards organizations such as ASHRAE, AHRI, and ANSI are racing to catch up with liquid cooling technology innovations but still have much work ahead of them to develop universally accepted standards. Dave will discuss a reasonable approach to determining the energy efficiency of a single-phase liquid-to-liquid CDU utilizing current related test standards and proposed ASHRAE rating conditions, allowing CDUs from different manufacturers to be fairly compared side by side. He will discuss methods for evaluating both the TCS and FWS loops efficiency as it relates specifically to the CDU.
Workshop
Livestreamed
Recorded
TP
W
Workshop
Performance Evaluation, Scalability, & Portability
Livestreamed
Recorded
TP
W
DescriptionThis paper presents the development of a performance portable distributed FFT implementation on top of the Kokkos ecosystem. Thanks to kokkos and kokkos-fft, we largely simplify the implementation details of distributed FFT with performance portability. We newly develop unique features like batched-distributed FFT and interfaces to vendor distributed FFT libraries. We demonstrate that our distributed-FFT works efficiently on NVIDIA A100 and AMD MI250X GPUs, while keeping a reasonable performance on CPUs.
Paper
System Software and Cloud Computing
Livestreamed
Recorded
TP
DescriptionDisaggregation of hardware resources and integration of heterogeneous accelerators are two emerging trends in datacenters. Existing data systems focus on either disaggregated systems with CPUs or incorporation of heterogeneous accelerators within traditional monolithic servers. None can adequately address the challenges posed by systems that are both disaggregated and heterogeneous.
We present DHAP, an end-to-end framework comprising a query compiler and a specialized runtime, designed to efficiently process online analytical queries in a disaggregated and heterogeneous environment. At higher levels the compiler, a planning module, automatically identifies efficient execution plans. At lower levels, optimizations are applied to generate executable code for heterogeneous back-ends. The runtime efficiently processes queries on disaggregated CPU/GPU compute nodes, facilitating inter-stage pipelined execution and minimizing communication costs. Experiments show that DHAP achieves near-optimal solutions, with latency speedups of up to 16.3x on SSB and TPC benchmarks. Furthermore, it attains significant speedups compared to existing query processing systems.
We present DHAP, an end-to-end framework comprising a query compiler and a specialized runtime, designed to efficiently process online analytical queries in a disaggregated and heterogeneous environment. At higher levels the compiler, a planning module, automatically identifies efficient execution plans. At lower levels, optimizations are applied to generate executable code for heterogeneous back-ends. The runtime efficiently processes queries on disaggregated CPU/GPU compute nodes, facilitating inter-stage pipelined execution and minimizing communication costs. Experiments show that DHAP achieves near-optimal solutions, with latency speedups of up to 16.3x on SSB and TPC benchmarks. Furthermore, it attains significant speedups compared to existing query processing systems.
Workshop
Debugging & Correctness Tools
HPC Software & Runtime Systems
Livestreamed
Recorded
TP
W
DescriptionMany high-performance programs can benefit from parallelism creating orders of magnitude speedups in their performance. However, translating code into its parallel equivalent is challenging, time-consuming, and error-prone. In recent years there has been a move to automate this process, creating algorithms to perform translations. While automation removes the manual effort, it needs to be accompanied by strong validation. Incorrect translation can lead to data races, poor performance, rounding problems, or unexpected behavior. In this paper, we present a dynamic validation approach called Seq2ParDiff that uses differential testing to check conformance of the parallel program to the original sequential program version. We evaluate Seq2ParDiff on two sets of benchmarks for OpenMP programs. In the first, we find 20 new faults, outperforming state of the art static techniques. In the second, we find many faults that other tools miss, however we are not as effective in finding some types of data races.
Paper
Architectures & Networks
HPC for Machine Learning
Performance Measurement, Modeling, & Tools
Programming Frameworks
Livestreamed
Recorded
TP
DescriptionThe Mixture-of-Experts (MoE) model reduces the computation of large LLMs by sparsely activating experts, but its massive parameter storage creates severe GPU memory bottlenecks. Existing solutions offload experts to host memory and prefetch them with sophisticated policies, yet they target single-batch inference and suffer from communication bottlenecks at larger batch sizes. We identify two forms of locality in expert activation: a small set of experts are frequently invoked across inference (global locality), while others recur within short decoding bursts (temporal locality). To exploit this, we propose DiffMoE, which introduces a differential cache hierarchy in GPU memory. Globally hot experts reside in per-layer high-priority caches, locally hot ones are dynamically managed in per-layer medium-priority caches under a priority-driven replacement policy, and cold experts are cached temporarily and evicted on demand. Moreover, a lightweight predictor overlaps expert migration with computation to reduce latency. Evaluation shows DiffMoE outperforms the state-of-the-art systems significantly.
Research and ACM SRC Posters
Research & ACM SRC Posters
TP
DescriptionDiffPro is a simple framework to speed up and shrink diffusion models while preserving image quality. It combines layer-wise quantization, guided by a manifold-based sensitivity check with adaptive timestep selection. Compared with quantization-only or sampling-only baselines, this joint strategy yields better FID–memory trade-offs. We evaluate on MNIST, CIFAR-10, and CelebA using both PTQ and QAT, showing meaningful size reductions (e.g., ~34 MB on a CIFAR-10 setup) with competitive FID. Current limits include occasional mis-ranking of a few blocks at high noise and a focus on unconditioned models. Next, we will extend DiffPro to text/class-conditioned diffusion, replace hand-tuned thresholds with a budgeted optimizer that co-selects per-layer bit-widths and timesteps using per-timestep sensitivity, and incorporate hardware-aware costs.
Workshop
Livestreamed
Recorded
TP
W
DescriptionDan Isaacs' presentation will encompass a survey of the state of the art of digital twins and their uses today both across industry and in particular, in HPC data center design, deployment and maintenance. Dan's talk will feature a short interview with Dr. Michael Grieves - the originator of the Digital Twins concept and wrote the seminal book for. Dr. Grieves has over five decades of executive, board, and technical experience in both global and entrepreneurial technology and manufacturing companies.
Research and ACM SRC Posters
Research & ACM SRC Posters
TP
DescriptionAs heterogeneous supercomputing becomes mainstream, traditional hybrid models such as MPI+OpenMP increasingly struggle to coordinate and manage GPU memory while maintaining portable performance.
This work introduces DiOMP-Offloading, a framework that unifies OpenMP target offloading with a PGAS model. Built atop LLVM/OpenMP and using GASNet-EX as the communication layer, DiOMP-Offloading centrally manages global memory regions, providing a globally addressable space for remote put/get operations. It integrates OMPCCL, a portable device-side collective layer that enables the use of vendor collective backends by reconciling allocation life-cycles and address translation. Instead of relying on separate MPI+OpenMP, DiOMP-Offloading improves scalability and programmability by abstracting away replicated device-memory and communication management logic. Demonstrations on large-scale platforms show that DiOMP-Offloading delivers better performance in micro-benchmarks and applications under a single PGAS+OpenMP offloading model. These results indicate that DiOMP-Offloading can contribute to a more portable, scalable, and efficient path forward for heterogeneous computing.
This work introduces DiOMP-Offloading, a framework that unifies OpenMP target offloading with a PGAS model. Built atop LLVM/OpenMP and using GASNet-EX as the communication layer, DiOMP-Offloading centrally manages global memory regions, providing a globally addressable space for remote put/get operations. It integrates OMPCCL, a portable device-side collective layer that enables the use of vendor collective backends by reconciling allocation life-cycles and address translation. Instead of relying on separate MPI+OpenMP, DiOMP-Offloading improves scalability and programmability by abstracting away replicated device-memory and communication management logic. Demonstrations on large-scale platforms show that DiOMP-Offloading delivers better performance in micro-benchmarks and applications under a single PGAS+OpenMP offloading model. These results indicate that DiOMP-Offloading can contribute to a more portable, scalable, and efficient path forward for heterogeneous computing.
Workshop
Livestreamed
Recorded
TP
W
DescriptionHigh-performance computing faces rising core counts, increasing heterogeneity, and growing memory bandwidth. These trends complicate programmability, portability, and scalability, while traditional MPI + OpenMP struggles with distributed GPU memory and portable performance.
We present DiOMP-Offloading, a framework unifying OpenMP target offloading with a Partitioned Global Address Space (PGAS) model. Built on LLVM-OpenMP and GASNet-EX, it centrally manages global memory and supports symmetric/asymmetric GPU allocations, enabling remote put/get operations. DiOMP also integrates OMPCCL, a portable device-side collective layer that harmonizes allocation lifecycles and address translation across vendor backends.
By eliminating separate MPI + X stacks and abstracting replicated device memory and communication logic, DiOMP improves scalability and programmability. Experiments on large-scale NVIDIA A100, Grace Hopper, and AMD MI250X platforms show superior micro-benchmark and application performance, demonstrating that DiOMP-Offloading offers a more portable, scalable, and efficient path for heterogeneous supercomputing.
We present DiOMP-Offloading, a framework unifying OpenMP target offloading with a Partitioned Global Address Space (PGAS) model. Built on LLVM-OpenMP and GASNet-EX, it centrally manages global memory and supports symmetric/asymmetric GPU allocations, enabling remote put/get operations. DiOMP also integrates OMPCCL, a portable device-side collective layer that harmonizes allocation lifecycles and address translation across vendor backends.
By eliminating separate MPI + X stacks and abstracting replicated device memory and communication logic, DiOMP improves scalability and programmability. Experiments on large-scale NVIDIA A100, Grace Hopper, and AMD MI250X platforms show superior micro-benchmark and application performance, demonstrating that DiOMP-Offloading offers a more portable, scalable, and efficient path for heterogeneous supercomputing.
Workshop
Livestreamed
Recorded
TP
W
DescriptionOver the past two decades, the scientific workflow community has transformed how science is conducted at scale — moving from manual scripting and ad hoc data movement to automated, intelligent systems that enable reproducible, data-driven discovery. Since its founding in 2006, the WORKS workshop has chronicled this evolution, documenting the shift from early grid-enabled workflows to today’s AI-augmented, self-optimizing systems.
This talk reflects on that journey through the lens of the Pegasus Workflow Management System, one of the earliest and most enduring platforms for scientific automation. It traces how core computer science principles—abstraction, optimization, provenance, and adaptability—have guided Pegasus’s evolution from workflow planning for distributed systems to orchestrating AI-driven, agentic, and self-managing workflows. The talk will conclude with a look toward the future of science automation, where hybrid physical and cyber infrastructures work together to advance knowledge and discovery.
This talk reflects on that journey through the lens of the Pegasus Workflow Management System, one of the earliest and most enduring platforms for scientific automation. It traces how core computer science principles—abstraction, optimization, provenance, and adaptability—have guided Pegasus’s evolution from workflow planning for distributed systems to orchestrating AI-driven, agentic, and self-managing workflows. The talk will conclude with a look toward the future of science automation, where hybrid physical and cyber infrastructures work together to advance knowledge and discovery.
Research and ACM SRC Posters
Research & ACM SRC Posters
TP
Description3D Gaussian Splatting (3D-GS) has recently emerged as a powerful technique for real-time, photorealistic rendering by optimizing anisotropic Gaussian primitives from view-dependent images. While 3D-GS has been extended to scientific visualization, prior work remains limited to single-GPU settings, restricting scalability for large datasets on high performance computing (HPC) systems. We present a distributed 3D-GS pipeline tailored for HPC. Our approach partitions data across nodes, trains Gaussian splats in parallel using multi-nodes and multi-GPUs, and merges splats for global rendering. To eliminate artifacts, we add ghost cells at partition boundaries and apply background masks to remove irrelevant pixels. Benchmarks on the Richtmyer–Meshkov datasets (about 106.7M Gaussians) show up to 3X speedup across 8 nodes on Polaris while preserving image quality. These results demonstrate that distributed 3D-GS enables scalable visualization of large-scale scientific data and provide a foundation for future in situ applications.
Paper
Applications
Architectures & Networks
HPC for Machine Learning
Livestreamed
Recorded
TP
DescriptionVision-based scientific foundation models hold significant promise for advancing scientific discovery and innovation. This potential stems from their ability to aggregate images from diverse sources—such as varying physical groundings or data acquisition systems—and to learn spatio-temporal correlations using transformer architectures. However, tokenizing and aggregating images can be compute-intensive, a challenge not fully addressed by current distributed methods. In this work, we introduce the Distributed Cross-Channel Hierarchical Aggregation (D-CHAG) approach designed for datasets with a large number of channels across image modalities. Our method is compatible with any model-parallel strategy and any type of vision transformer architecture, significantly improving computational efficiency. We evaluated D-CHAG on hyperspectral imaging and weather forecasting tasks. When integrated with tensor parallelism and model sharding, our approach achieved up to a 75% reduction in memory usage and more than doubled sustained throughput on up to 1,024 AMD GPUs on the Frontier supercomputer.
Tutorial
Livestreamed
Recorded
TUT
DescriptionDeep learning (DL) is rapidly becoming pervasive in almost all areas of computer science, and is even being used to assist computational science simulations and data analysis. A key behavior of these deep neural networks (DNNs) is that they reliably scale, i.e., they continuously improve in performance when the number of model parameters and amount of data grow. As the demand for larger, more sophisticated, and more accurate DL models increases, the need for large-scale parallel model training, fine-tuning, and inference has become increasingly pressing. Subsequently, in the past few years, several parallel algorithms and frameworks have been developed to parallelize model training and inference on GPU-based platforms. This tutorial will introduce and provide basics of the state of the art in distributed deep learning. We will use large language models (LLMs) as a running example, and teach the audience the fundamentals involved in performing the three essential steps of working with LLMs: (1) training an LLM from scratch, (2) continued training/fine-tuning of an LLM from a checkpoint, and (3) inference on a trained LLM. We will cover algorithms and frameworks falling under the purview of data parallelism (PyTorch DDP and DeepSpeed), and tensor parallelism (AxoNN).
Research and ACM SRC Posters
Research & ACM SRC Posters
TP
DescriptionHigh performance computing (HPC) workloads are driving rack power densities beyond 100 kW, creating unprecedented stress on data center cooling and power systems. Conventional CFD-based digital twins provide high-fidelity design optimization but are too computationally intensive and rigid for operational use. We present the first physics-constrained Distributed Modular Digital Twin Network (DMDTN), designed for real-time performance evaluation, load prediction, and fault detection. Each subsystem (e.g., cooling, power, IT load) is represented by an AI-driven surrogate model, interconnected through conservation laws and coordinated via a distributed message bus. This modular design preserves physical consistency while enabling scalability and rapid adaptability. Using synthetic datasets, DMDTN achieved ~60% lower prediction error (RMSE 172 vs. 450) and more than 2× faster training (201 vs. 442 seconds) than a monolithic model, while maintaining robustness under stress. DMDTN complements CFD by enabling accurate, real-time operational management of HPC data centers.
Workshop
Livestreamed
Recorded
TP
W
DescriptionMLIR (Multi-Level Intermediate Representation) is a popular framework for implementing domain-specific compilers for optimizing matrix/tensor computations. However, no support currently exists in MLIR for distributed sparse tensor computations. In this paper, we describe the design and implementation of a new MLIR dialect for high-level specification of distributed sparse tensor computations. This specification is then lowered to MPI-based code for distributed execution. We illustrate the expressiveness of the new dialect by implementing a Graph Attention Network computation with multiple sparse tensor operators.
Research and ACM SRC Posters
Research & ACM SRC Posters
TP
DescriptionComputational fluid dynamics (CFD) simulations are essential tools for analyzing complex flow phenomena in engineering and scientific research. These simulations are typically formulated based on the Navier-Stokes equations, which govern the motion of incompressible fluids, and the pressure field is obtained by solving the Poisson equation using iterative solvers. However, iterative convergence is not always guaranteed. In certain cases, the residuals diverge, leading to numerical instability and eventual simulation failure. When divergence occurs after tens of thousands of time steps, it results in substantial waste of computational resources and delays research progress. To address this problem, this study proposes an AI-based divergence prediction system. By utilizing learned data from prior simulations, the proposed method enables prediction of divergence within about one hundred time steps. This early detection allows simulations to be interrupted before significant resources are consumed, thereby improving efficiency and supporting timely progress in computational research.
Research and ACM SRC Posters
Research & ACM SRC Posters
TP
DescriptionDiffusion models create high-quality images but are slow because denoising steps run in sequence. We present a hybrid parallel diffusion framework speeding up generation on mixed-capacity GPUs while keeping images coherent. First, we split each image into patches sized by each GPU’s memory (i.e., memory-aware partitioning), so stronger devices handle more work and weaker ones are not overloaded. Second, we build a fast, low-resolution preview of the full image and use it to guide every patch, preventing seams and preserving global structure. Third, we parallelize time with a parareal strategy: a coarse pass provides guesses, fine solvers refine segments in parallel, and corrections align results. While GPUs compute, they share boundary pixels asynchronously to hide communication. Finally, cosine-weighted blending stitches patches into a seamless output. Early tests show lower idle time, better scaling, and consistent quality on images.
Workshop
Livestreamed
Recorded
TP
W
DescriptionIn this work, we conduct an experimental study to explore applicability of LLMs for configuring, annotating, and translating scientific workflows. We use three different workflow-specific experiments and evaluate several open- and closed-source language models using state-of-the-art workflow systems. Our studies reveal that LLMs often struggle due to a lack of training data for scientific workflows. We further observe that the performance of LLMs varies across experiments and workflow systems. We discuss the implications of our findings and draw attention to several approaches extending LLM capabilities for scientific workflows. Our findings can help workflow developers and users in understanding LLM capabilities in scientific workflows, and motivate further research applying LLMs to workflows.
Workshop
Livestreamed
Recorded
TP
W
DescriptionCeph is a widely used distributed object store, but its messenger layer imposes substantial CPU overhead on the host. To address this limitation, we propose DoCeph, a DPU-offloaded storage architecture for Ceph that disaggregates the system by offloading the communication-intensive messaging component to the DPU while retaining the storage backend on the host. The DPU efficiently manages communication, using lightweight RPC for metadata operations and DMA for data transfer. Moreover, DoCeph introduces a pipelining technique that overlaps data transmission with buffer preparation, mitigating hardware-imposed transfer size limitations. We implemented DoCeph on a Ceph cluster with NVIDIA BlueField-3 DPUs. Evaluation results indicate that DoCeph cuts host CPU usage by up to 92% while sustaining stable throughput and providing larger performance benefits for object writes over 1 MB.
Birds of a Feather
Quantum & Other Post Moore Computing Technologies
Livestreamed
Recorded
TP
XO/EX
DescriptionAs AI and post-exascale demands reshape high performance computing, and neuromorphic systems push the boundaries of brain-inspired efficiency, a key question emerges: Does HPC need neuromorphic architectures to stay sustainable—or is neuromorphic computing reliant on HPC infrastructure to scale competitively?
This BoF session aims to unite researchers, practitioners, and industry leaders to explore overlapping interests between neuromorphic computing and HPC. The session will consist of a concise overview of the state of neuromorphic computing, insights from a diverse panel of experts, and an interactive discussion on mutual benefits and future collaboration.
This BoF session aims to unite researchers, practitioners, and industry leaders to explore overlapping interests between neuromorphic computing and HPC. The session will consist of a concise overview of the state of neuromorphic computing, insights from a diverse panel of experts, and an interactive discussion on mutual benefits and future collaboration.
Flash Session
Not Livestreamed
Not Recorded
TP
XO/EX
DescriptionThe TCS: secondary cooling loop (or secondary fluid network) is often overlooked, yet it is just as critical as the cooling distribution unit (CDU). As the essential link enabling the liquid-to-chip solution, its reliability directly dictates the health and performance of your entire liquid-cooled system. This session will provide attendees with the dos, don'ts, and critical considerations necessary for implementing a high-performance, reliable, and scalable secondary fluid network. Learn how to ensure your infrastructure can truly support next-generation, high-density computing and AI.
• Reliability & Uptime: Must ensure 24/7 365-day operation without leaks or failure, as any fault can bring down the entire IT load.
• Thermal Performance: Must maintain the correct flow rate and temperature differential to effectively transfer heat from the cold plates to the CDU and wetted materials compatibility.
• Pressure Integrity: Must handle the system's required operating pressures, including transient spikes, with sufficient safety margins.
• Reliability & Uptime: Must ensure 24/7 365-day operation without leaks or failure, as any fault can bring down the entire IT load.
• Thermal Performance: Must maintain the correct flow rate and temperature differential to effectively transfer heat from the cold plates to the CDU and wetted materials compatibility.
• Pressure Integrity: Must handle the system's required operating pressures, including transient spikes, with sufficient safety margins.
Workshop
Livestreamed
Recorded
TP
W
DescriptionDori has been the Joint Genome Institute's (JGI) primary high-performance-computing cluster
since 2022. It has been serving a wide array of research and production workloads,
processing petabytes of data. This paper presents the challenges of running a mixed production/research cluster, a governance and communications model that enables addressing user needs, and a suite of software tools that have been developed to address the diverse set of requirements that result from a user base that manages and processes large amounts of data.
since 2022. It has been serving a wide array of research and production workloads,
processing petabytes of data. This paper presents the challenges of running a mixed production/research cluster, a governance and communications model that enables addressing user needs, and a suite of software tools that have been developed to address the diverse set of requirements that result from a user base that manages and processes large amounts of data.
Paper
Architectures & Networks
HPC for Machine Learning
Performance Measurement, Modeling, & Tools
Programming Frameworks
Livestreamed
Recorded
TP
DescriptionSecure, efficient, and scalable AllReduce-based data aggregation is essential for artificial intelligence (AI) and scientific applications on modern high performance computing (HPC) and cloud infrastructures. As AllReduce is increasingly used across these distributed infrastructures, privacy has become a critical concern. State-of-the-art (SOTA) homomorphic encryption (HE)-based AllReduce solutions introduce high overhead, require secure key exchanges, and remain vulnerable to collusion.
We propose DPAR, the first differentially private, collusion-resistant AllReduce framework optimized for large-scale HPC and AI workloads. DPAR introduces three key innovations: integrating differential privacy (DP) to eliminate collusion risks without key exchanges, scalable noise growth to preserve accuracy, and performance optimizations using a noise pooling mechanism.
DPAR is a drop-in Message Passing Interface (MPI) AllReduce replacement, providing strong privacy with minimal performance cost. Evaluated on Delta and Frontier supercomputers with up to 8,192 cores, DPAR outperforms the SOTA HE solution by up to 34.7% in modern AI workloads.
We propose DPAR, the first differentially private, collusion-resistant AllReduce framework optimized for large-scale HPC and AI workloads. DPAR introduces three key innovations: integrating differential privacy (DP) to eliminate collusion risks without key exchanges, scalable noise growth to preserve accuracy, and performance optimizations using a noise pooling mechanism.
DPAR is a drop-in Message Passing Interface (MPI) AllReduce replacement, providing strong privacy with minimal performance cost. Evaluated on Delta and Frontier supercomputers with up to 8,192 cores, DPAR outperforms the SOTA HE solution by up to 34.7% in modern AI workloads.
Paper
Applications
Architectures & Networks
Livestreamed
Recorded
TP
DescriptionApproximate nearest neighbor search (ANNS) is essential for applications like recommendation systems and retrieval-augmented generation (RAG), but is highly I/O-intensive and memory-demanding. CPUs face I/O bottlenecks, while GPUs are constrained by limited memory. DRAM-based Processing-in-Memory (DRAM-PIM) offers a promising alternative by providing high bandwidth, large memory capacity, and near-data computation. This work introduces DRIM-ANN, the first optimized ANNS engine leveraging UPMEM’s DRAM-PIM. While UPMEM scales memory bandwidth and capacity, it suffers from low computing power because of the limited processor embedded in each DRAM bank. To address this, we systematically optimize ANNS approximation configurations and replace expensive squaring operations with lookup tables to align the computing requirements with UPMEM’s architecture. Additionally, we propose load-balancing and I/O optimization strategies to maximize parallel processing efficiency. Experimental results show that DRIM-ANN achieves a 2.46× speedup over a 32-thread CPU and up to 2.67× over a GPU when deployed on computationally enhanced PIM platforms.
Workshop
Livestreamed
Recorded
TP
W
DescriptionPreempting attacks that target supercomputing systems before damage is done remains a top security priority. The main challenge is that noisy attack attempts and unreliable alerts often mask \emph{real attacks}, causing permanent damage such as system integrity violations and data breaches. This paper describes a security testbed embedded in the live traffic of a supercomputer at the National Center for Supercomputing Applications (NCSA). Deployment of our testbed at NCSA enabled the following key contributions:
1) Insights from characterizing unique \textit{attack patterns} found in real security logs of 228 security incidents curated in the past two decades at NCSA.
2) Deployment of an attack visualization tool to illustrate the challenges of identifying real attacks in high-performance computing (HPC) environments and to support security operators in interactive attack analyses.
3) Demonstration of the utility of the testbed by running dynamic models, such as factor-graph-based models, to preempt a real-world ransomware family.
1) Insights from characterizing unique \textit{attack patterns} found in real security logs of 228 security incidents curated in the past two decades at NCSA.
2) Deployment of an attack visualization tool to illustrate the challenges of identifying real attacks in high-performance computing (HPC) environments and to support security operators in interactive attack analyses.
3) Demonstration of the utility of the testbed by running dynamic models, such as factor-graph-based models, to preempt a real-world ransomware family.
Workshop
Livestreamed
Recorded
TP
W
DescriptionThread coarsening is a well known optimization technique for GPUs.
It enables instruction-level parallelism, reduces redundant computation, and can provide better memory access patterns.
However, the presence of divergent control flow - cases where uniformity of branch conditions among threads cannot be proven at compile time - diminishes its effectiveness.
In this work, we implement multi-level thread coarsening for CPU and GPU OpenMP code, by implementing a generic thread coarsening transformation on LLVM IR.
We introduce dynamic convergence - a new technique that generates both coarsened and non-coarsened versions of divergent regions in the code and allows for the uniformity check to happen at runtime instead of compile time.
We performed evalution on HecBench for GPU and LULESH for CPU.
We found that best case speedup without dynamic convergence was 4.6% for GPUs and 2.9%
for CPUs, while we achieved 7.5% for GPUs and 4.3% for CPUs with it on.
It enables instruction-level parallelism, reduces redundant computation, and can provide better memory access patterns.
However, the presence of divergent control flow - cases where uniformity of branch conditions among threads cannot be proven at compile time - diminishes its effectiveness.
In this work, we implement multi-level thread coarsening for CPU and GPU OpenMP code, by implementing a generic thread coarsening transformation on LLVM IR.
We introduce dynamic convergence - a new technique that generates both coarsened and non-coarsened versions of divergent regions in the code and allows for the uniformity check to happen at runtime instead of compile time.
We performed evalution on HecBench for GPU and LULESH for CPU.
We found that best case speedup without dynamic convergence was 4.6% for GPUs and 2.9%
for CPUs, while we achieved 7.5% for GPUs and 4.3% for CPUs with it on.
Workshop
Livestreamed
Recorded
TP
W
DescriptionHPC workloads continue to grow in complexity and resource demands, requiring large-scale compute clusters.
To achieve maximal efficiency, multi-node workloads should be scheduled on network-adjacent nodes. SLURM supports topology-aware scheduling using a cluster topology configuration file. However, in large or dynamic environments, nodes may be added or removed at any time, making it crucial to maintain an accurate view of the cluster’s network topology.
Having up-to-date information about network structure is even more important in cloud environments, where users have less control over compute resources than in on-premises setups.
In this talk, we introduce Topograph, an open-source tool that automatically discovers and maintains cluster network topology. Topograph supports both CSPs and on-premises environments, and can be deployed in SLURM and Kubernetes clusters, including hybrid SLURM-on-Kubernetes systems.
By exposing detailed, real-time network topology, Topograph enables HPC workloads to run on nodes with optimal interconnectivity, improving performance and resource efficiency.
To achieve maximal efficiency, multi-node workloads should be scheduled on network-adjacent nodes. SLURM supports topology-aware scheduling using a cluster topology configuration file. However, in large or dynamic environments, nodes may be added or removed at any time, making it crucial to maintain an accurate view of the cluster’s network topology.
Having up-to-date information about network structure is even more important in cloud environments, where users have less control over compute resources than in on-premises setups.
In this talk, we introduce Topograph, an open-source tool that automatically discovers and maintains cluster network topology. Topograph supports both CSPs and on-premises environments, and can be deployed in SLURM and Kubernetes clusters, including hybrid SLURM-on-Kubernetes systems.
By exposing detailed, real-time network topology, Topograph enables HPC workloads to run on nodes with optimal interconnectivity, improving performance and resource efficiency.
Workshop
Livestreamed
Recorded
TP
W
DescriptionBinary instrumentation provides the ability to instrument and modify a program after the compilation process has completed. Operating on the binary level allows instrumentation of the program as it was produced by the compiler. In addition, it can operate on programs or libraries for which you may not have the source code. Binary instrumentation is the foundation for a wide variety of tools, including those for performance profiling, debugging, tracing, architectural simulation, and digital forensics. Dyninst is a free and open-source suite of toolkits for building binary analysis and instrumentation tools for architectures that include the x86, ARM, and Power. It is used in tools produced by industry, academia and research labs. This paper describes our efforts to port Dyninst to the RISC-V architecture. We discuss the challenges presented by the RISC-V, our approaches to solving them, and the status of Dyninst on the RISC-V.
Workshop
Livestreamed
Recorded
TP
W
DescriptionThe explosive growth of large-scale Deep Learning (DL) models has made energy consumption a first-order operational cost and constraint in modern High-Performance Computing (HPC) datacenters. Existing DL schedulers, however, are largely single-objective and energy oblivious, struggling to balance the competing demands of performance, fairness, and Quality of Service (QoS). To address this flaw, we propose a methodology for the co-design of multi-objective and energy-aware schedulers together with the associated simulation framework, the so-called EAS-Sim. Our methodology stands as a systematic approach to enhance State-of-the-Art (SOTA) scheduling heuristics with energy-efficiency objectives.
Using our framework, we design and evaluate four novel and malleable job schedulers. Our flagship energy-aware policy, Zeus, establishes a new Pareto-optimal frontier and reduces total energy consumption by ≈8-10% compared to the SOTA performance scheduler Pollux with no statistically significant loss in system throughput. EAS-Sim is available as open-source on GitHub.
Using our framework, we design and evaluate four novel and malleable job schedulers. Our flagship energy-aware policy, Zeus, establishes a new Pareto-optimal frontier and reduces total energy consumption by ≈8-10% compared to the SOTA performance scheduler Pollux with no statistically significant loss in system throughput. EAS-Sim is available as open-source on GitHub.
Research and ACM SRC Posters
Research & ACM SRC Posters
TP
DescriptionCurrent bioacoustic monitoring technologies cost $600-$1,000+ per device and require manual data retrieval and maintenance by experts, preventing real-time insights and limiting deployment scale. We develop a prototype autonomous monitoring and detection system that streams high-quality audio in real time, while drastically reducing costs and operational overhead.
Our approach combines Listener, a $375 solar-powered recording device built with ESP32 and AudioMoth, with Aggregator, a $210 Raspberry Pi 5-based hub that collects streams from multiple Listeners over WiFi HaLow while performing local inference using Cornell BirdNET. The system eliminates manual retrieval through continuous streaming, supports 25 simultaneous Listeners per Aggregator, and provides integration for visualization and storage. We successfully deployed the system at organic vineyards in Michigan, demonstrating its practical viability.
Our poster presents the system architecture, real-time metrics and analysis results, power consumption benchmarks, and cost comparisons to highlight how this solution enables biodiversity monitoring at unprecedented scale.
Our approach combines Listener, a $375 solar-powered recording device built with ESP32 and AudioMoth, with Aggregator, a $210 Raspberry Pi 5-based hub that collects streams from multiple Listeners over WiFi HaLow while performing local inference using Cornell BirdNET. The system eliminates manual retrieval through continuous streaming, supports 25 simultaneous Listeners per Aggregator, and provides integration for visualization and storage. We successfully deployed the system at organic vineyards in Michigan, demonstrating its practical viability.
Our poster presents the system architecture, real-time metrics and analysis results, power consumption benchmarks, and cost comparisons to highlight how this solution enables biodiversity monitoring at unprecedented scale.
Workshop
Livestreamed
Recorded
TP
W
DescriptionScientific data acquisition (SciDAQ) systems are shifting from archive-based workflows to streaming paradigms, where real-time, fine-grained network monitoring becomes essential. While P4-enabled devices offer per-packet in-band observability, they require specialized switches and routers. Host-side tools like Prometheus exporters lack sufficient temporal granularity. To bridge this gap, we present eCounter, a lightweight, hardware-agnostic, inline telemetry agent built on extended Berkeley Packet Filter (eBPF). eCounter captures per-interface ingress and egress traffic, categorized by IP address and protocol, at millisecond to sub-millisecond resolution. In a 100 Gbps environment, it continuously exports up to 3,257 time-series bins per second with only 4% CPU utilization at a 35 KiB/s data rate. We evaluate eCounter across diverse NIC MTU settings, hook types, CPU architectures and operating systems, and observed negligible impact on concurrent high-throughput streaming applications. Complexity analysis confirms that it can be readily scaled to distributed SciDAQ deployments.
Paper
BP
State of the Practice
System Software and Cloud Computing
Livestreamed
Recorded
TP
DescriptionExisting cloud-oriented container deployment frameworks fail to address the unique challenges of edge environments, including geographic distribution, device heterogeneity, and resource constraints. This leads to suboptimal performance for latency-sensitive edge services like HPC/AI-powered autonomous driving, which demand rapid startup and immediate responsiveness.
Current on-demand image solutions require excessive client-registry communication, resulting in prolonged round-trip time (RTT)—a particularly severe limitation in geographically distributed edge platforms. Furthermore, the user-space file system (FUSE), typically employed to handle device heterogeneity, introduces substantial overhead to the native I/O stack. Our findings reveal that on-demand image solutions exacerbate storage pressure on resource-constrained edge devices. To overcome these challenges, we introduce EDDE, an edge-optimized container deployment framework that redesigns the on-demand image pipeline. When compared to state-of-the-art on-demand solutions, EDDE delivers containers 147% faster on average, reduces native I/O latency by up to 28%, and decreases storage usage by an average of 34%.
Current on-demand image solutions require excessive client-registry communication, resulting in prolonged round-trip time (RTT)—a particularly severe limitation in geographically distributed edge platforms. Furthermore, the user-space file system (FUSE), typically employed to handle device heterogeneity, introduces substantial overhead to the native I/O stack. Our findings reveal that on-demand image solutions exacerbate storage pressure on resource-constrained edge devices. To overcome these challenges, we introduce EDDE, an edge-optimized container deployment framework that redesigns the on-demand image pipeline. When compared to state-of-the-art on-demand solutions, EDDE delivers containers 147% faster on average, reduces native I/O latency by up to 28%, and decreases storage usage by an average of 34%.
Workshop
Livestreamed
Recorded
TP
W
DescriptionThe EduHPC workshop welcomes manuscripts from academia, industry, and national laboratories on topics including HPC, PDC, data science, scalable artificial intelligence and machine learning, and the Internet of Things and Edge computing (IoT/Edge). These topics relate to education at both the undergraduate and graduate levels, as well as professional training and workforce development. Given the increasing relevance of AI workloads on HPC systems, this edition of the workshop will particularly emphasize ML pedagogy. Historically, the workshop has accepted contributions from fields such as CS, CSE, DS, and computational courses across STEM and non-STEM disciplines. The workshop aims to foster collaboration among stakeholders within the SC context. It is a platform for discussing pedagogical challenges, solutions, and opportunities for HPC/PDC/DS/AI/ML/IoT education. Activities at the workshop include paper presentations, invited keynotes, panels on topics like sustainability and reproducibility in technical education, and special sessions for sharing resources and opportunities for collaboration.
Paper
Applications
Performance Measurement, Modeling, & Tools
State of the Practice
Livestreamed
Recorded
TP
DescriptionHigh performance computing (HPC) systems are crucial for scientific advancement and engineering breakthroughs. Unexpected performance degradation or system failures can severely impact these endeavors. This paper introduces NodeSentry, a novel unsupervised anomaly detection framework tailored for compute nodes of large-scale HPC systems. NodeSentry leverages a combined approach of coarse-grained clustering and fine-grained model sharing to effectively address the challenges posed by the massive node scales, frequent job transitions, and complex patterns characteristic of modern HPC deployments. Evaluation on two real-world HPC datasets demonstrates NodeSentry's superior performance, achieving an F1 score exceeding 0.876. This represents a 0.560 average improvement over existing best baseline methods, while simultaneously reducing training overhead by an average of 45.69%. Furthermore, to promote reproducibility and contribute to the broader research community, we open-source NodeSentry's codebase and introduce a novel clustering adjustment and anomaly labeling tool specifically designed for HPC systems.
Tutorial
Livestreamed
Recorded
TUT
DescriptionOver the past decade, GPUs became ubiquitous in HPC installations around the world, delivering the majority of performance of some of the largest supercomputers, steadily increasing the available compute capacity. Finally, four exascale systems are deployed (Frontier, Aurora, El Capitan, JUPITER), using GPUs as the core computing devices for this era of HPC. To take advantage of these GPU-accelerated systems with tens of thousands of devices, application developers need to have the proper skills and tools to understand, manage, and optimize distributed GPU applications. In this tutorial, participants will learn techniques to efficiently program large-scale multi-GPU systems. While programming multiple GPUs with MPI is explained in detail, advanced tuning techniques and complementing programming models like NCCL and NVSHMEM are also presented. Tools for analysis are shown and used to motivate and implement performance optimizations. The tutorial teaches fundamental concepts that apply to GPU-accelerated systems of any vendor in general, taking the NVIDIA platform as an example. This tutorial is a combination of lectures and hands-on exercises, using the JUPITER system for interactive learning and discovery.
Workshop
Livestreamed
Recorded
TP
W
DescriptionThe embedding layer is essential in deep learning, transforming high-dimensional data into compact representations. However, growing datasets and model sizes pose challenges in training time, memory, and generalization. We propose a scalable method for embedding initialization via spectral dimensionality reduction using dominant eigenvector projections.
The proposed approach leverages on MIRAMns, multiple implicitly restarted Arnoldi method with nested subspaces, to extract most informative directions from large and potentially sparse data representations. Unlike traditional embeddings or autoencoders, this proposed approach requires few tunable parameters and is inherently parallel. We apply MIRAMns to matrix representations such as covariance and co-occurrence matrices to compute low-dimensional embeddings that preserve data structure and variance. Experiments across diverse datasets show that the proposed method achieves comparable or better accuracy with significantly reduced dimensionality, enabling smaller, faster deep networks. Additionally, our parallel implementation scales efficiently on HPC platforms, making it well-suited for large-scale scientific and AI workloads.
The proposed approach leverages on MIRAMns, multiple implicitly restarted Arnoldi method with nested subspaces, to extract most informative directions from large and potentially sparse data representations. Unlike traditional embeddings or autoencoders, this proposed approach requires few tunable parameters and is inherently parallel. We apply MIRAMns to matrix representations such as covariance and co-occurrence matrices to compute low-dimensional embeddings that preserve data structure and variance. Experiments across diverse datasets show that the proposed method achieves comparable or better accuracy with significantly reduced dimensionality, enabling smaller, faster deep networks. Additionally, our parallel implementation scales efficiently on HPC platforms, making it well-suited for large-scale scientific and AI workloads.
Workshop
Livestreamed
Recorded
TP
W
DescriptionThe influence maximization problem seeks to identify a subset of k vertices in a network that, when activated, maximizes the spread of influence under a given diffusion process. It is NP-hard to find the optimal set of influential vertices; thus, recent studies have focused on developing algorithms to find an approximate solution. The state-of-the-art parallel implementations leverage a sketch-based algorithm called influence maximization via martingales (IMM). However, IMM incurs significant memory overhead due to the storage requirements of graph traversal samples called random reverse reachable (RRR) sets. In this paper, we introduce efficient Influence Maximization (eIM), a novel GPU-accelerated IMM algorithm designed to improve the efficiency and scalability of IMM. Compared to two popular GPU implementations, eIM achieves similar accuracy with one to three orders of magnitude speedups while reducing the memory requirement to store network data and RRR sets up to 54%.
Art of HPC
Art of HPC
Not Livestreamed
Not Recorded
TP
W
TUT
XO/EX
DescriptionThe gprMax simulation models the propagation of electromagnetic fields from an aboveground source into the earth and models the interactions with the materials the fields come into contact with. The video shows the propagation of the electric field (blue) through the ground and its interactions with a pair of perpendicular metallic pipes. These simulations enable the modeling of the return signature of subterranean structures, which has a variety of industry applications.
Early Career
Not Livestreamed
Not Recorded
TP
W
TUT
XO/EX
DescriptionSmall group practice time.
Early Career
Not Livestreamed
Not Recorded
TP
W
TUT
XO/EX
DescriptionNow that we have an idea of story telling, we'll practice by making a short story that will pique the listener's interest to the level they will ask for more information or even schedule a follow-up meeting.
Workshop
Livestreamed
Recorded
TP
W
DescriptionAs in the previous 10 years, this workshop will bring together application experts, software developers, and hardware engineers, both from industry and academia, to share experiences and best practices to leverage the practical application of reconfigurable logic to scientific computing, AI/ML, and "big data" applications. In particular, the workshop will focus on sharing experiences and techniques for accelerating applications and/or improving energy efficiency with FPGAs using high-level design flows, which enable and improve cross-platform functional and performance portability while also improving productivity. Particular emphasis is given to cross-platform comparisons and combinations that foster a better understanding within the industry and research community on what are the best mappings of applications to a diverse range of hardware architectures that are available today (e.g., FPGA, GPU, many-cores and hybrid devices, ASICs), and on how to most effectively achieve cross-platform compatibility.
Birds of a Feather
Artificial Intelligence & Machine Learning
Livestreamed
Recorded
TP
XO/EX
DescriptionAs AI and ML technologies converge with computational science and engineering, practitioners face emerging challenges in developing and deploying effective workflows. This BoF session invites SC25 attendees to discuss critical issues including workflow composition and orchestration, containerization, robust data management, and AI integration with simulation workflows and infrastructure. We will discuss a variety of AI workflows that include LLMs, science models, and AI agents. Through a short keynote, lightning talks, small group discussion, and short audience polls, this BoF is designed to be interactive and focus on audience interests. Join us to share experiences and identify key challenges.
Workshop
Debugging & Correctness Tools
HPC Software & Runtime Systems
Livestreamed
Recorded
TP
W
DescriptionThis session will feature short presentations about emerging correctness tools.
Birds of a Feather
Storage
Livestreamed
Recorded
TP
XO/EX
DescriptionHistory has never seen applications induce changes in storage architectures and requirements so fast. Agentic workloads, KV caching for LLM inference, vector databases, relational graphs, and GNNs offer new challenges: bringing insight from unstructured data, improving IOPs/TCO for fine-grained access to unbounded data, permissions, and interoperability of GPU-initiated storage with traditional file/object systems. To spur the creativity of our community and frame emerging opportunities for new technologies, we’ve gathered experts on usage models, vendor technologists, and CSPs. The audience will gain new knowledge of emerging tech and new perspectives on topics that they may have only recently heard of.
Workshop
Livestreamed
Recorded
TP
W
DescriptionLarge-scale deep learning workloads increasingly face I/O bottlenecks as datasets exceed local storage and GPU compute outpaces network and disk speeds. While recent systems optimize data-loading time, they often ignore I/O energy costs—a critical factor at scale. We present EMLIO, an Efficient Machine Learning I/O service that minimizes both end-to-end data-loading latency (𝑇) and I/O energy consumption (𝐸) across variable-latency networked storage. EMLIO uses a lightweight data-serving daemon on storage nodes to serialize and batch raw samples, stream them over TCP with out-of-order prefetching, and integrate with GPU-accelerated (NVIDIA DALI) preprocessing on the client side. In evaluations over local disk, LAN (0.05 ms & 10 ms RTT), and WAN (30 ms RTT), EMLIO achieves up to 8.6× faster I/O and 10.9× lower energy use than state-of-the-art loaders, maintaining constant performance and energy profiles across distances. Its service-based architecture offers a scalable blueprint for energy-aware I/O in next-generation AI clouds.
Invited Talk
Life Sciences
Societal Impact
Livestreamed
Recorded
TP
DescriptionFrom the outset, human space flight has depended on real-time control, guidance, decision support, and problem solving from Earth. Extended human presence beyond low-Earth orbit will require greater autonomy than is currently possible. This talk will discuss the new challenges for human space flight missions and approaches to increasing Earth independence, including advances in communication technologies, sensors, onboard intelligent systems, and crew interaction.
Best Poster Presentations (Research, ACM SRC Grad/Undergrad)
Research & ACM SRC Posters
TP
DescriptionExascale simulations generate massive data volumes that strain I/O and post-hoc analysis. We integrate the Damaris in situ middleware into Coddex, a crystal deformation code, to offload data movement and analysis to dedicated processes, enabling runtime extraction of key diagnostics without writing intermediate files. We evaluate tin hysteresis cases on CEA’s INTI cluster (with 14 nodes, 1,728 cores) and compare against a ParaView-based post-hoc pipeline. In situ analysis eliminates per-iteration I/O stalls and reduces output time by up to 5x while preserving overall iteration time, with benefits increasing with the number of tracked variables. We describe the integration design, process pinning, and data exchange, and outline forthcoming support for additional analyses. This work is conducted within the Exa-DoST project of the PEPR NumPEx program, which aims to build the software infrastructure for the first exascale machine expected to be set up in France (Alice Recoque, Jules Verne project).
Research and ACM SRC Posters
Research & ACM SRC Posters
TP
DescriptionAdjoint-based, matrix-free Newton-Krylov methods have long been the gold standard for solving high-dimensional, ill-posed inverse problems. These methods require a pair of forward and adjoint PDE solves per iteration, usually making them intractable for real-time inference and prediction. We present FFTMatvec, an FFT-based GPU-accelerated algorithm that exploits intrinsic problem structure to enable real-time, high-fidelity, extreme-scale inference and prediction for linear autonomous dynamical systems. This algorithm was used to solve a Bayesian inverse problem for tsunami early warning with over one billion parameters in under 0.2 seconds. The application is performance-portable and open-source; scaling results are presented for up to 4,096 GPUs on OLCF's Frontier and NERSC's Perlmutter supercomputers. On 512 GPUs, FFTMatvec achieves more than a 200,000x speedup over state-of-the-art matrix-free adjoint-based methods. Communication-aware partitioning and dynamic mixed precision provide additional performance boosts. Other application areas include nuclear treaty verification and monitoring atmospheric CO2.
Workshop
Partially Livestreamed
Partially Recorded
TP
W
DescriptionAdapting foundation models via fine-tuning often negates the benefits of sparsity, as common sparse-to-dense training results in high inference costs measured in Floating-Point Operations (FLOPs). We propose PHOENIX, a framework designed for efficient sparse inference on the Cerebras CS-2 wafer-scale accelerator. PHOENIX employs an innovative strategy that merges sparse model weights with low-rank adapters, preserving high levels of sparsity throughout the adaptation process without sacrificing accuracy. It leverages the CS-2's native support for unstructured sparsity to accelerate inference computations.
Across multiple models and tasks, PHOENIX maintains accuracy comparable to dense baselines even at 50–60% sparsity. This high level of sparsity enables a near 2x reduction in FLOPs and a 1.7x improvement in inference throughput compared to a single NVIDIA A100 GPU, demonstrating a practical path to efficient, deployable sparse models.
Across multiple models and tasks, PHOENIX maintains accuracy comparable to dense baselines even at 50–60% sparsity. This high level of sparsity enables a near 2x reduction in FLOPs and a 1.7x improvement in inference throughput compared to a single NVIDIA A100 GPU, demonstrating a practical path to efficient, deployable sparse models.
Workshop
Livestreamed
Recorded
TP
W
DescriptionAs the increasing energy consumption of High-Performance Computing (HPC) systems places greater strain on electric grid infrastructure, operational strategies for load balancing become critically important. Energy-aware scheduling offers a promising solution by enabling HPC systems to function as actively managed loads within the energy grid. Despite extensive theoretical research on this strategy, practical implementations and real-system evaluations remain scarce. To bridge this gap, we introduce a systematic approach to developing, evaluating, and implementing energy-aware scheduling without modifications to Slurm's core scheduler. Our method includes a novel mechanism for per-job power prediction based on Large Language Model embeddings of enriched job scripts, coupled with a lightweight, deployable scheduling strategy. Our predictor reduces per-job power MAE by 15% compared to the current state-of-the-art, and our simulated scheduler shifts 4.0 MWh onto on-site solar without throughput loss. These results demonstrate a clear and practical pathway to production deployment of energy-aware scheduling in HPC.
Workshop
Performance Evaluation, Scalability, & Portability
Livestreamed
Recorded
TP
W
DescriptionPerformance portability in HPC and embedded systems is often limited by power and thermal constraints. The OpenMP programming model offers a compile-time mechanism known as variants, allowing different function specializations. Previous research extended this concept to the runtime level, enabling dynamic variant selection. We build on these foundations with an energy‑aware runtime that augments variant selection with low‑overhead power and temperature instrumentation and a multi‑criteria policy balancing power caps, thermal headroom, and performance. Implemented in LLVM and publicly available, our mechanism profiles per‑variant energy and thermal behavior, selecting specializations at runtime based on user‑defined thresholds and live system state. Validation on HPC and embedded platforms shows the runtime enforces dynamic power caps with 98.5% compliance on a workstation (versus 67% unconstrained). On thermally constrained edge devices, proactive CPU/GPU migration beats hardware throttling, cutting execution time by 39% while maintaining stability. In a simulated battery‑limited mission, energy‑aware selection extends battery lifetime.
Research and ACM SRC Posters
Research & ACM SRC Posters
TP
DescriptionMultimodal large language models (MLLMs) extend text-only LLMs with image and video encoders, enabling new capabilities but introducing high and poorly understood energy costs. This work characterizes the energy footprint of MLLM inference at the stage level, decomposing serving into vision encoding, prefill, and decoding for image–text models. Using NVML-based measurements on an NVIDIA A100 with realistic workloads, we demonstrate how encoder design and input complexity (resolution, image count) increase the number of visual tokens and shift energy toward prefill. Our novel contribution is linking token growth to serving inefficiency and demonstrating two practical controls: complexity-aware batching and stage-conditioned DVFS, which reduce energy while meeting latency SLOs. Current results highlight disproportionate energy growth from multimodal inputs, and the study outlines stage-wise breakdowns, token-driven scaling curves, and prototype controls that motivate future input-aware scheduling policies for energy-efficient multimodal inference.
Birds of a Feather
Democritization of HPC
Livestreamed
Recorded
TP
XO/EX
DescriptionThe DoD has invested significant time and funding to support a large base of users on a variety of HPC-backed projects. This BoF will use lightning talks about current research, technology acquisition plans, and software development needs and interests to illustrate DoD goals and opportunities for engagement. These lightning talks are intended to help external organizations and researchers connect with DoD users and sites to encourage partnerships and help solve problems. External engagement will help DoD users and HPC sites grow expertise and connect to the larger HPC community.
Workshop
Livestreamed
Recorded
TP
W
DescriptionThe widespread adoption of Large Language Models (LLMs) has led to an increased demand for large-scale inference services, presenting a unique set of challenges for the HPC community. These services are characterized by moderate-scale models that require dedicating expensive GPUs to handle bursty inference requests, leading to high costs and resource underutilization. In this paper, we propose SwapServeLLM — a novel engine-agnostic hot-swapping method for cost-effective inference. This model hot-swapping approach is enabled by recent driver capabilities for transparent GPU checkpointing. SwapServeLLM optimizes resource utilization by dynamically allocating GPU resources with two key mechanisms: (1) a demand-aware preemption leveraging information about concurrent requests, and (2) efficient request routing with memory reservation minimizing inference latency. Our evaluation demonstrates that SwapServeLLM optimizes model loading for state-of-the-art inference engines by 31× compared to vLLM and up to 29% compared to Ollama, enabling cost-effective inference.
Workshop
Partially Livestreamed
Partially Recorded
TP
W
DescriptionThis work enhances the capabilities of code LLMs in CUDA-to-SYCL kernel translation with parameter-efficient fine-tuning. The resultant fine-tuned LLM, called ChatPORT, is an effort to provide high-fidelity translations from one programming model to another. We describe the preparation of datasets from heterogeneous computing benchmarks for model fine-tuning and testing, the parameter-efficient fine-tuning of 19 open-source code models ranging in size from 0.5 to 34 billion parameters and evaluate the correctness rates of the SYCL kernels by the fine-tuned models. The experimental results show that most code models fail to translate CUDA codes to SYCL correctly. However, fine-tuning these models using a small set of CUDA and SYCL kernels can enhance the capabilities of these models in kernel translation. Depending on the sizes of the models, the correctness rate ranges from 19.9% to 81.7% for a test dataset of 62 CUDA kernels.
Workshop
Education & Workforce Development
Livestreamed
Recorded
TP
W
DescriptionHigh Performance Computing (HPC) is a critical driver of progress in artificial intelligence (AI), data-intensive science, and engineering. At the National University of Singapore, concepts of parallelism are taught in courses such as Parallel Computing and Parallel and Concurrent Programming. These provide strong theoretical foundations, but gaps remain in systems-level competencies, particularly in deploying, optimizing, and scaling applications on real HPC platforms. To address this, we introduced initiatives such as participation in student cluster competitions to train students in resource management, profiling, monitoring, and containerized workflows. This experiential learning bridges theory with operational expertise. Challenges include the steep learning curve of complex systems, limited access to shared infrastructure, and the need for up-to-date instructional expertise. Sustainable HPC curriculum development requires gradual expansion of topics, integration of hands-on training, and competition-driven learning. Formal HPC courses will enhance readiness for careers in computational science and AI, and foster cross-disciplinary collaboration.
Workshop
Livestreamed
Recorded
TP
W
DescriptionFleCSI is a compile-time-configurable programming model designed to support performance-portable parallel application development. In the programming model provided by FleCSI, tasks execute in parallel according to data dependencies specified by a directed acyclic graph. FleCSI natively supports distributed data structures and data access patterns commonly used by computational-science methods.
Without any code modifications, an application built using FleCSI can target one of three communication backends: MPI, Legion, and most recently, HPX. This paper presents the design and implementation of the HPX backend. Specifically, it shows how FleCSI's Legion-like programming model can be mapped efficiently onto HPX's semantically different programming model. The paper explains how FleCSI's task graph can be implemented in terms of HPX futures and introduces a novel optimization for minimizing the number of (costly) communicators HPX needs to create for inter-task communication. An empirical performance study quantifies the benefit of this optimization on two physics applications.
Without any code modifications, an application built using FleCSI can target one of three communication backends: MPI, Legion, and most recently, HPX. This paper presents the design and implementation of the HPX backend. Specifically, it shows how FleCSI's Legion-like programming model can be mapped efficiently onto HPX's semantically different programming model. The paper explains how FleCSI's task graph can be implemented in terms of HPX futures and introduces a novel optimization for minimizing the number of (costly) communicators HPX needs to create for inter-task communication. An empirical performance study quantifies the benefit of this optimization on two physics applications.
Exhibitor Forum
Quantum & Other Post Moore Computing Technologies
Livestreamed
Recorded
TP
XO/EX
DescriptionExplore how Dell Technologies is advancing hybrid quantum-classical computing to enhance predictive machine learning (ML). This session delves into proof-of-concept (POC) middleware software designed to integrate quantum devices with classical systems, boosting ML accuracy and throughput. Attendees will gain practical insights into leveraging hybrid environments to tackle complex data processing and model training challenges. Discover how these innovations pave the way for future interoperability and redefine the boundaries of ML capabilities. Join us to uncover the potential of hybrid solutions in transforming the ML landscape.
Research and ACM SRC Posters
Research & ACM SRC Posters
TP
DescriptionReproducibility is a challenge in HPC and research. HPC experiments are resource-intensive and depend on complex software environments. Snapshotting addresses this issue, captures the complete state of a system in a single step, allowing researchers to automatically rebuild and restore identical environments. However, concerns remain about snapshot efficiency and usability. To improve snapshotting in HPC research, the tools need to be straightforward to use and perform quickly on large bare metal environments. Therefore, we improved usability and evaluated the performance of cc-snapshot, a tool on the Chameleon Cloud testbed. Usability enhancements included new command line options, modular code, and automated tests. To optimize performance, we benchmarked alternative image formats and compression algorithms. The results show that zstd delivered up to 80% faster compression time during snapshot creation compared to zlib. These findings demonstrate that snapshotting can be a practical and effective tool to support reproducibility in HPC experiments.
Art of HPC
Art of HPC
Not Livestreamed
Not Recorded
TP
W
TUT
XO/EX
DescriptionFungi are everywhere, in the air, in the water, and in the soil, supporting and mediating between the living and non-living, beneath the floor of the forest.
Entanglement, inspired by the motif of the forest and its fungal networks, invites spectators into an environment where visible and invisible worlds are interconnected and symbiotic. Through the entanglement of microcosmic and simultaneous connections, it offers a sensory opportunity for contemplation and inspiration regarding ways of connecting with the world beyond ourselves, and a vision of a diverse living world in ecosystemic balance. To borrow a phrase from Ursula Le Guin: the word for world is forest.
The artwork combines procedural modeling, generative AI, and dynamic simulation of a mycorrhizal network of vast numbers of living organisms. It is grounded in an imperative of drawing attention to the importance of nonconscious cognition and interspecies communication in biological and machine senses, as a reminder of the essential broader world around us.
Entanglement, inspired by the motif of the forest and its fungal networks, invites spectators into an environment where visible and invisible worlds are interconnected and symbiotic. Through the entanglement of microcosmic and simultaneous connections, it offers a sensory opportunity for contemplation and inspiration regarding ways of connecting with the world beyond ourselves, and a vision of a diverse living world in ecosystemic balance. To borrow a phrase from Ursula Le Guin: the word for world is forest.
The artwork combines procedural modeling, generative AI, and dynamic simulation of a mycorrhizal network of vast numbers of living organisms. It is grounded in an imperative of drawing attention to the importance of nonconscious cognition and interspecies communication in biological and machine senses, as a reminder of the essential broader world around us.
Exhibitor Forum
Quantum & Other Post Moore Computing Technologies
Livestreamed
Recorded
TP
XO/EX
DescriptionAluminum cold plates have several limitations, two of the main ones being high difficulty in welding and maintenance, and long-term reliability challenges. Envicool and Intel collaborated on research into the long-term reliability requirements and verification methods for aluminum cold plates, conducting two rounds of accelerated tests totaling over 200 days, using higher pressure, higher flow rate and pure water to verify the performance and reliability of the cold plates. The results showed that the cold plate needs to be highly compatible with the coolant. Pure water cannot guarantee that the cold plate will not be corroded. Corrosion is prone to occur at welding joints and corners due to the influence of turbulence.
Workshop
Livestreamed
Recorded
TP
W
DescriptionLarge-scale earthquake simulations produce massive, high-fidelity datasets essential for seismic risk analysis, however, their volume and complexity create a barrier for researchers from various backgrounds who lack specialized knowledge and programming skills. To address this challenge, we leveraged Large Language Models (LLMs) to develop the EQSIM Agent, a conversational AI designed for the interactive exploration of large-scale earthquake simulation data. The agent allows users to query data using natural language, receiving results as text, images, videos, and maps. Beyond standard querying and visualization, it introduces novel features like a vision-based waveform similarity search and a Retrieval-Augmented Generation system that answers questions with facts from relevant publications. This paper details the agent’s implementation and evaluates the challenges of using LLMs in a scientific context. We also provide a practical analysis of various LLMs, evaluating their performance, tool-calling reliability, and cost, to guide the development of future scientific AI agents.
Workshop
Livestreamed
Recorded
TP
W
DescriptionThe management of data-intensive workflows in globally distributed computing systems, such as those used in high-energy physics, presents significant challenges in scalability, resource allocation, and fault tolerance. Workflow Management Systems (WMS) provide a critical framework for addressing these challenges by automating, monitoring, and optimizing the execution of complex computational tasks across heterogeneous resources. Production and Distributed Analysis (PanDA) system, a sophisticated WMS engineered to handle the immense data processing and analysis demands of ATLAS, operating on the Worldwide LHC Computing Grid (WLCG), one of the largest distributed computing infrastructures globally. However, errors frequently occurs when distributing and managing workloads on such a globally distributed computing grid. Errors can occur in various form across different sites. To understand and mitigate these errors, analysis is the first step. In this work, we analyze the errors that occurs across the globally distributed grid which will be stepping stone towards designing effective mitigation strategies.
Birds of a Feather
Ethics & Societal Impact of HPC
Livestreamed
Recorded
TP
XO/EX
DescriptionThe implications of HPC technology for society and the environment prompt us, as a community, to discuss and understand our direct and indirect impacts. This BoF is highly interactive and aims to facilitate discussion within the community about relating ethical behavior and societal norms to the design of HPC solutions and autonomous/intelligent systems, for example, to ensure that these systems do not intentionally perpetuate global inequality. By furthering this dialogue, we can ensure that the HPC community advances its commitment to technology for the benefit of humanity as a whole.
Research and ACM SRC Posters
Research & ACM SRC Posters
TP
DescriptionVisualization and processing of (extreme) large-scale networks is a challenging task due to unique characteristics such as load imbalance, lack of locality, and access irregularity. Considering the possibilities offered by recent supercomputing power, we have revised current algorithms suitable for the visualization of large-scale networks and were able to visualize networks in sizes ranging from hundreds of thousands to millions of nodes. The experiments were performed on the Karolina supercomputer. We have visualized the European Open Web Index produced by the OpenWebSearch.eu project. The complexity of the problem is discussed in the context of performance and computation power needed for the visualization of such (extreme) large-scale graphs.
Workshop
Livestreamed
Recorded
TP
W
DescriptionSingle-cell RNA sequencing (scRNA-seq) now profiles millions of cells in a single study, creating major computational demands. GPU-accelerated pipelines, built on frameworks like NVIDIA RAPIDS and CuPy, promise large runtime reductions, but questions remain about reproducibility compared to CPU workflows. We benchmarked matched CPU and GPU pipelines on a 1.3-million-cell dataset and downsampled subsets. GPUs achieved over 10× faster runtimes but at the cost of biological fidelity. Clustering concordance between CPU and GPU was moderate (Adjusted Rand Index ~0.50) across all sample sizes. Importantly, fidelity depended more on platform-specific algorithms and parameter choices than on dataset size. Results also showed that "ground truth" cluster definitions were relative to the platform used. These findings indicate that while GPUs enable scalable, efficient scRNA-seq analysis, researchers must consider the choice of computational platform as a key factor influencing biological interpretation.
Workshop
Livestreamed
Recorded
TP
W
DescriptionScientific computing centers increasingly face workloads with diverse urgency requirements, driven by applications that demand rapid or even immediate execution. Appropriately configured scheduling policies can significantly improve both user satisfaction and overall cluster utilization. In this work, we present a systematic analysis of scheduler configurations under scenarios where a fraction of jobs have urgent computing needs. We evaluate multiple job scheduling simulators, develop a lightweight job-submission emulation framework, and create tools to analyze and visualize the resulting scheduling data. Our study identifies key trade-offs between responsiveness, fairness, and efficiency, and offers a set of practical scheduling configurations (particularly for Slurm) that can be tailored to HPC environments supporting mixed-urgency workloads.
Workshop
Livestreamed
Recorded
TP
W
DescriptionThe escalating complexity of applications and services encourages a shift towards higher-level data processing pipelines that integrate both Cloud-native and HPC steps into the same workflow. Cloud providers and HPC centers typically provide both execution platforms on separate resources. In this paper we explore a more practical design that enables running unmodified Cloud-native workloads directly on the main HPC cluster, avoiding resource partitioning and retaining the HPC center's existing job management and accounting policies.
Best Poster Presentations (Research, ACM SRC Grad/Undergrad)
Research & ACM SRC Posters
TP
DescriptionTransmitting point cloud data is vital for applications like autonomous vehicle navigation, especially for compute-limited vehicles. LiDAR data can easily grow to gigabytes or terabytes uncompressed, making data transmission costly. While recent research has advanced point cloud compression, most work evaluates performance using object detection and on urban datasets like SemanticKITTI [1] or nuScenes [2]. These do not exactly reflect performance on off-road outdoor data, which is typically noisier and less structured. We benchmark three LiDAR compressors, RENO (neural-based) [8], TMC13 (rules-based baseline) [5], and LCP [9] (scientific particle compressor untested in this domain) on the GOOSE dataset [6]. We trained two 3D semantic segmentation models on this decompressed LiDAR data to observe their downstream segmentation performance. Ultimately, we find RENO to outperform TMC13 and LCP, with LCP providing competitive results to RENO and TMC13 in compression quality and speeds.
Workshop
Livestreamed
Recorded
TP
W
DescriptionHigh-Performance Computing job scheduling involves balancing conflicting objectives such as minimizing makespan, reducing wait times, optimizing resource use, and ensuring fairness. Heuristic-based methods, e.g., FJFS and SJF or intensive optimization techniques, often lack adaptability to dynamic workloads and cannot simultaneously optimize multiobjectives in HPC systems. We propose a novel LLM-based scheduler using a ReAct-style framework, enabling iterative, interpretable decision-making. It incorporates a scratchpad memory to track scheduling history and refine decisions via natural language feedback, while a constraint enforcement module ensures feasibility and safety.
We evaluate our approach using OpenAI's O4-Mini and Anthropic's Claude 3.7 across seven workload scenarios; heterogeneous mixes, bursty patterns, etc. The comparisons reveals that LLM-based scheduling effectively balances multiple objectives while offering transparent reasoning through natural language traces. The method excels in constraint satisfaction and adapts to diverse workloads without domain-specific training. However, a trade-off between reasoning quality and computational overhead challenges real-time deployment.
We evaluate our approach using OpenAI's O4-Mini and Anthropic's Claude 3.7 across seven workload scenarios; heterogeneous mixes, bursty patterns, etc. The comparisons reveals that LLM-based scheduling effectively balances multiple objectives while offering transparent reasoning through natural language traces. The method excels in constraint satisfaction and adapts to diverse workloads without domain-specific training. However, a trade-off between reasoning quality and computational overhead challenges real-time deployment.
Research and ACM SRC Posters
Research & ACM SRC Posters
TP
DescriptionExascale systems like Aurora push performance bounds but they draw tens of megawatts, making precise, low-overhead power monitoring essential for efficiency and cost control. We present an ongoing evaluation of the two primary power-monitoring interfaces on Aurora, quantifying accuracy and temporal granularity from a single node to a system level. Our contribution is a reproducible methodology, combining HPC benchmarks, mini-apps, and spectral analysis, to determine when each tool is trustworthy and how to configure sampling. Preliminary results characterize sampling limits and overhead trade-offs. Complete results are in progress and we seek to deduce if our current methods of power monitoring are suitable for exascale levels. In the poster, we will share the evaluation framework, early comparative results, and actionable best practices for exascale power studies.
Research and ACM SRC Posters
Research & ACM SRC Posters
TP
DescriptionAdvances in artificial intelligence (AI) and machine learning (ML) are reshaping scientific computing and influencing programming practices on high performance computing (HPC) systems. We analyze Python library usage on the Polaris supercomputer to understand adoption patterns in modeling, simulation, data analysis, and ML. Using XALT, a runtime monitoring tool, and PySnooper, a lightweight tracer, we correlate library imports with job scheduler data and scientific domains. Results are presented through visualizations and an interactive dashboard, enabling scientists to track usage trends, identify performance impacts from non-optimized environments, and inform improvements to Argonne’s default Python stack. This work provides actionable guidance for software provisioning, user support, and infrastructure planning in the era of AI-driven science.
Workshop
Livestreamed
Recorded
TP
W
DescriptionHigh-Performance Computing (HPC) systems are used for a variety of applications. An important requirement for some of them is security and protection of data, especially when dealing with highly sensitive data such as the human genome. In order to facilitate the processing of actual patient data on HPC systems, it is imperative to implement robust protective measures. In this paper we analyze the performance of two micro benchmarks and the BWA-MEM2 algorithm as genome sequencing workflow. Our evaluation matrix includes a SMP node, a VM with SEV and SME enabled and VM with only SME enabled, assessed against varying thread counts file systems configurations. Our analysis showed memory bandwidth seems to be the limiting factor as the bandwidth drops can drop to approximately 50%. Overall we observed that the overhead caused by the encryption for the genome alignment workload is adequate with 10.4% for SME and over 20.9% for SEV+SME.
Workshop
Data Analytics
High Performance I/O, Storage, Archive, & File Systems
Storage
Livestreamed
Recorded
TP
W
DescriptionThe intended talk will present early performance numbers of using DAOS for a checkpoint and restart mechanism in a classic HPC application: PALM - an large-eddy simulation code written in Fortran. Different methods are supported: Fortran IO with a file-per-process scheme and two MPI IO based methods that use a single shared file and where one method can aggregate IO in a single process per node. The presentation will reveal early performance numbers of the Fortran IO and the two MPI IO variants in PALM using 9.216 MPI processes and both MPICHs native DAOS support as well as mounted DAOS containers in the Linux filesystem. The DAOS system used in the study provides approx. 0.5 PB of storage using Optane memory technology distributed accross 19 storage nodes and is connected to the HPC system via an Omni-Path interconnect. Finally, the numbers are compared to Lustre and GPFS filesystems in production.
Workshop
Livestreamed
Recorded
TP
W
DescriptionTest-time compute scaling has demonstrated the ability to improve the performance of reasoning language models by generating longer chain-of-thought (CoT) sequences. However, this increase in performance comes with a significant increase in computation cost. In this work, we investigate two compute constraint strategies: (1) reasoning length constraint and (2) model quantization, and study their impact on the safety performance of reasoning models. Specifically, we explore two approaches to apply compute constraints to reasoning models: (1) fine-tuning reasoning models using a length-controlled policy optimization (LCPO) based reinforcement learning method to satisfy a user-defined CoT reasoning length, and (2) applying quantization to maximize the generation of CoT sequences within a user-defined compute constraint. Furthermore, we study the trade-off between the computational efficiency and the safety of the model.
Workshop
Livestreamed
Recorded
TP
W
DescriptionPodman is a modern and flexible container tool, but it doesn’t include several key features needed for high-performance computing (HPC). The Sarus project helps bridge this gap by integrating Podman into a modular, open-source solution that brings mainstream container tech into HPC environments.
This presentation shows how Sarus provides task-specific components to make Podman suitable for operating at scale, including: configuration templates tailored for specific clusters, a SLURM plugin for easy workload manager integration, OCI hooks and CDI specs to plug in compute and network resources, and a utility to support Squashfs-based image stores on parallel filesystems.
Together, these components augment Podman into a cohesive solution optimized for HPC use cases. We'll also share test results from the CSCS Alps infrastructure, showing how Sarus supports efficient and transparent containerized job submissions.
By building on familiar tools like Podman, Sarus offers a capable and HPC-ready container stack for today’s supercomputing needs.
This presentation shows how Sarus provides task-specific components to make Podman suitable for operating at scale, including: configuration templates tailored for specific clusters, a SLURM plugin for easy workload manager integration, OCI hooks and CDI specs to plug in compute and network resources, and a utility to support Squashfs-based image stores on parallel filesystems.
Together, these components augment Podman into a cohesive solution optimized for HPC use cases. We'll also share test results from the CSCS Alps infrastructure, showing how Sarus supports efficient and transparent containerized job submissions.
By building on familiar tools like Podman, Sarus offers a capable and HPC-ready container stack for today’s supercomputing needs.
Workshop
Livestreamed
Recorded
TP
W
DescriptionMPI is currently the de facto standard for programming HPC systems and parallel applications. Development of the MPI standard continues in earnest, with version 4.1 being released within the past year, and features for version 5.0 under active discussion. The aim of this workshop is to bring together researchers and developers to present and discuss innovative algorithms and concepts within the Message Passing programming model, and to create a forum for open discussions on the future of the Message Passing Interface (MPI) in the post-exascale era. Possible workshop topics include and are not limited to algorithms for collective operations, MPI optimization for artificial intelligence and machine learning workloads, data-centric models, scheduling, fault tolerance, MPI optimization in heterogeneous systems, interoperability of MPI with other programming models (e.g., PGAS), integration of task-parallel models in MPI, the role of MPI in "smart" networks, and the use of MPI in large-scale simulations.
Workshop
Partially Livestreamed
Partially Recorded
TP
W
DescriptionThe increasing convergence of AI and HPC, combined with the rapid evolution of heterogeneous computing architectures, is transforming modern supercomputing. The emergence of specialized accelerators, including GPUs, TPUs, IPUs, neuromorphic chips, quantum processors, and FPGAs, has introduced new challenges in performance portability, system optimization, and software adaptability. In this exascale and extreme heterogeneity era, effectively exploiting diverse hardware architectures requires AI-driven approaches, novel programming models, and intelligent workload management. This workshop will bring together experts from academia, industry, and national laboratories to explore AI-HPC convergence, heterogeneous system architectures, energy-efficient computing, and AI-assisted performance optimization. By fostering interdisciplinary discussions and collaborations, the workshop aims to advance scalable, efficient, and sustainable computing. We invite contributions on topics including heterogeneous hardware, AI-driven HPC techniques, memory architectures, and programming models, with a focus on shaping the future of AI-driven scientific discovery and high performance computing.
Workshop
Education & Workforce Development
Livestreamed
Recorded
TP
W
DescriptionThis paper describes initial efforts to expand the CyberAmbassadors program (NSF Award #1730137) to include training on mentoring skills for the cyberinfrastructure (CI) workforce. The new curriculum will help CI professionals at all levels develop the self-assessment, planning, and networking skills necessary to build strong mentoring relationships that can help them navigate emerging CI career paths. The mentoring curriculum will build on the communications, teamwork and leadership skills training from the existing CyberAmbassadors program, and will offer specialized practice in key career development activities like offering constructive feedback, fostering a growth mindset, developing a mentoring network, and building transferable skills. The new curriculum will also integrate research about the benefits of culturally-aware mentoring, which seeks to provide broad support for mentees with diverse identities and experiences. Once finalized, the curriculum will be distributed through a national network of volunteer facilitators who provide trainings for their own campuses, companies and communities.
Workshop
Education & Workforce Development
Livestreamed
Recorded
TP
W
DescriptionDespite its growing importance in physical sciences, research computing with cluster resources remains difficult to access and sustain, especially in long-term, multi-institutional projects. Challenges include site-specific workflows, evolving software stacks, and rapid changes in hardware post-Generative AI. The Nab collaboration, conducting a precision test of the Standard Model at Oak Ridge National Laboratory, hosted a hackathon to address these issues. Over four half-days, ~25 participants engaged in training and collaborative problem-solving across four priority areas, supported by mentors and structured sessions. Post-event surveys showed improved computational knowledge and strong interest in recurring events. This paper shares insights from organizing the hackathon and discusses scalable strategies for computational training in experimental research.
Workshop
Livestreamed
Recorded
TP
W
DescriptionGenerative Artificial Intelligence (GenAI) applications are built from specialized components—inference servers, object storage, vector and graph databases, and user interfaces—interconnected via web-based APIs. While these components are often containerized and deployed in cloud environments, such capabilities are still emerging at High-Performance Computing (HPC) centers. In this paper, we share our experience deploying GenAI workloads within an established HPC center, discussing the integration of HPC and cloud computing environments. We describe our converged computing architecture that integrates HPC and Kubernetes platforms running containerized GenAI workloads, helping with reproducibility. A case study illustrates the deployment of the Llama Large Language Model (LLM) using a containerized inference server (vLLM) across both Kubernetes and HPC platforms using multiple container runtimes. Our experience highlights practical considerations and opportunities for the HPC container community, guiding future research and tool development.
Workshop
Livestreamed
Recorded
TP
W
DescriptionIn 2005, the U.S. Department of Energy created the Oak Ridge Leadership Computing Facility (OLCF) to deploy one-of-a-kind leadership computing resources for researchers across academia, industry, and government. These supercomputers are among the largest in the world and often diverge from conventional architectures, demanding flexible and thorough testing to ensure their functionality and performance. To support this testing, OLCF developed the OLCF Test Harness (OTH). The OTH has adapted through more than 20 years of novel architectures. Unique challenges with each system provide opportunities for continuously improving the OTH. The OTH recently released version 3.0, which implements support for logging test data to InfluxDB, and version 3.1, which extends support further. Among three well-known testing frameworks surveyed, only one documents features that could be leveraged for database logging. In this work, we describe the OTH and the database support within, and discuss the challenges, successes, and goals for the OTH.
Workshop
Livestreamed
Recorded
TP
W
DescriptionIn this paper we explore a stencil application written in SYCL on both CPU and FPGA architectures.
We prepare two versions of the application, using a structured grid and an unstructured grid, and then optimise these implementations for CPU and FPGA architectures, with a focus on maintaining portability between both.
We benchmark the application on an AMD CPU and an Intel Stratix 10 FPGA, seeking to answer whether we can target FPGAs productively from a single-source code base.
Our findings indicate that for low arithmetic intensity kernels FPGA performance is lacking compared to CPU performance, suggesting that FPGA architectures may be unsuitable for such kernels, or that significant platform-specific optimisations may be required to reduce the performance gap, at the expense of developer productivity.
We prepare two versions of the application, using a structured grid and an unstructured grid, and then optimise these implementations for CPU and FPGA architectures, with a focus on maintaining portability between both.
We benchmark the application on an AMD CPU and an Intel Stratix 10 FPGA, seeking to answer whether we can target FPGAs productively from a single-source code base.
Our findings indicate that for low arithmetic intensity kernels FPGA performance is lacking compared to CPU performance, suggesting that FPGA architectures may be unsuitable for such kernels, or that significant platform-specific optimisations may be required to reduce the performance gap, at the expense of developer productivity.
Research and ACM SRC Posters
Research & ACM SRC Posters
TP
DescriptionUsing low-precision cores for acceleration of PDE-based simulations with sparse or small matrices is often challenging due to the frequent data conversion between high- and low-precision variables, and that the required precision varies in time/space due to the heterogeneity of the target problem. As an example of accelerating such PDE-based simulations, we develop an integer-based variable-precision computing method with low data-conversion costs for low-order explicit finite-element wave propagation simulations. Here, the precision level used for solving the problem is chosen locally to attain simulation accuracy, and is accelerated using INT8 Tensor Cores. This leads to a 3.3-fold speedup from a baseline FP64 CUDA-core-based implementation with equivalent simulation accuracy, with 87% weak efficiency up to 256 compute nodes of the GH200-based Miyabi supercomputer. These ideas are expected to be useful for accelerating other PDE-based problems with sparse or small matrices on computer architectures with high-performance, low-precision cores.
Paper
Applications
Performance Measurement, Modeling, & Tools
State of the Practice
Livestreamed
Recorded
TP
DescriptionThe exponential growth of large language model (LLM) training demands in HPC systems has exposed critical reliability challenges, particularly from transient faults. Unlike resilience studies in conventional DNN inference, the massive parameter scale and iterative updates in LLM training trigger more complex failure patterns. To address these challenges, we introduce LLMFI, a new fault injection tool, and reveal six distinct failure behaviors through 300K+ fault injection experiments (exceeding 5K GPU node-hours). Our key insight is that, while most injected faults are eventually masked by the training iteration mechanism, a critical subset leads to catastrophic failures or performance degradation. Further, we propose LLMFT, a novel machine-learning-based fault tolerance framework that implements closed-loop error control via heuristic feature extraction, fault detector, and dual recovery mechanisms. Extensive evaluation demonstrates that LLMFT achieves an average of 97.61% F1-score in fault detection with only 0.01%–0.05% additional GPU memory overhead, effectively mitigating LLM training failures.
Workshop
Livestreamed
Recorded
TP
W
DescriptionManaging operating system deployments across HPC clusters remains challenging. This presentation examines bootc, a technology that packages operating systems as OCI containers to simplify cluster management. We'll explore how bootc enables atomic OS updates, rollbacks, and version control similar to container workflows.
The talk covers bootc's underlying technologies, particularly composefs for efficient storage and deployment. We'll discuss benefits for configuration management and reproducibility in HPC environments, while addressing current limitations specific to high-performance computing workloads. Practical examples will demonstrate bootc deployment in test environments compared to traditional image-based provisioning methods.
This session provides systems professionals with an initial evaluation of whether this emerging technology could address cluster management needs and improve operational workflows while modernizing HPC infrastructure management.
The talk covers bootc's underlying technologies, particularly composefs for efficient storage and deployment. We'll discuss benefits for configuration management and reproducibility in HPC environments, while addressing current limitations specific to high-performance computing workloads. Practical examples will demonstrate bootc deployment in test environments compared to traditional image-based provisioning methods.
This session provides systems professionals with an initial evaluation of whether this emerging technology could address cluster management needs and improve operational workflows while modernizing HPC infrastructure management.
Workshop
Livestreamed
Recorded
TP
W
DescriptionVector databases have rapidly grown in popularity, enabling efficient similarity search over data such as text, images, and video. They now play a central role in modern AI workflows, aiding large language models by grounding model outputs in external literature through retrieval-augmented generation. Despite their importance, little is known about the performance characteristics of vector databases in high-performance computing (HPC) systems that drive large-scale science. This work presents an empirical study of distributed vector database performance on the Polaris supercomputer in the Argonne Leadership Computing Facility. We construct a realistic biological-text workload from BV-BRC and generate embeddings from the peS2o corpus using Qwen3-Embedding-4B. We select Qdrant to evaluate insertion, index construction, and query latency with up to 32 workers. Informed by practical lessons from our experience, this work takes a first step toward characterizing vector database performance on HPC platforms to guide future research and optimization.
Doctoral Showcase
Research & ACM SRC Posters
Livestreamed
Recorded
TP
DescriptionThe computational and memory demands of DNN training have grown with the size of AI models in recent years. To address these demands, popular accelerators (i.e., GPUs) must find novel ways to reduce memory utilization since their memory capacity is on the scale of tens of GB. Other companies have unveiled novel AI accelerators, generally with high on-chip memory capacity and varying architectures. For these accelerators, frequent on-chip/off-chip memory transactions can bottleneck performance. Lossy compression is a promising tool to reduce data footprint for efficient DNN training. Our work studies lossy compressors targeting training data and activation data, and how to efficiently run compression and GNN training on novel AI accelerators.
Our contributions are: 1) a novel, portable training data compressor, called DCT+Chop, for emerging AI accelerators; 2) an activation compression framework tailored to the Graphcore Intelligence Processing Unit (IPU); 3) a GPU-based design for a compressor/optimizer-agnostic lossy activation compression framework, called LAT-ACT; and 4) an exploration in training graph neural networks (GNNs) on the Cerebras CS-2. DCT+Chop and IPU activation compression have yielded strong results, where DCT+Chop can compress training data up to 16X with a throughput on the scale of tens of GB/s. IPU activation compression can speedup single IPU training up to 3.5X and multi-IPU training by several orders of magnitude. Preliminary results suggest LAT-ACT yields compression ratios of 4-12X with limited accuracy degradation. GNN training on the CS-2 can be implemented with PyTorch APIs, but further exploration is needed for supporting sparse operators common to GNNs.
Our contributions are: 1) a novel, portable training data compressor, called DCT+Chop, for emerging AI accelerators; 2) an activation compression framework tailored to the Graphcore Intelligence Processing Unit (IPU); 3) a GPU-based design for a compressor/optimizer-agnostic lossy activation compression framework, called LAT-ACT; and 4) an exploration in training graph neural networks (GNNs) on the Cerebras CS-2. DCT+Chop and IPU activation compression have yielded strong results, where DCT+Chop can compress training data up to 16X with a throughput on the scale of tens of GB/s. IPU activation compression can speedup single IPU training up to 3.5X and multi-IPU training by several orders of magnitude. Preliminary results suggest LAT-ACT yields compression ratios of 4-12X with limited accuracy degradation. GNN training on the CS-2 can be implemented with PyTorch APIs, but further exploration is needed for supporting sparse operators common to GNNs.
Research and ACM SRC Posters
Research & ACM SRC Posters
TP
DescriptionHigh synchronization overhead in frameworks like GNU OpenMP impedes fine-grained task parallelism on many-core architectures. We introduce three advances to GNU OpenMP: a lock-less concurrent queue (XQueue), a scalable distributed tree barrier, and two NUMA-aware, lock-less load-balancing strategies.
Evaluated with Barcelona OpenMP Task Suite (BOTS) benchmarks, our XQueue and tree barrier improve performance by up to 1522.8× over the original GNU OpenMP. The load-balancing strategies provide an additional performance improvement of up to 4×.
We further apply these techniques to the TaskFlow runtime, demonstrating performance and scalability gains in selected applications while also analyzing the inherent limitations of the lock-less approach on x86 architectures.
Evaluated with Barcelona OpenMP Task Suite (BOTS) benchmarks, our XQueue and tree barrier improve performance by up to 1522.8× over the original GNU OpenMP. The load-balancing strategies provide an additional performance improvement of up to 4×.
We further apply these techniques to the TaskFlow runtime, demonstrating performance and scalability gains in selected applications while also analyzing the inherent limitations of the lock-less approach on x86 architectures.
Workshop
Debugging & Correctness Tools
HPC Software & Runtime Systems
Livestreamed
Recorded
TP
W
DescriptionThe increasing scale of deep neural networks has heightened the need to optimize training and inference efficiency. Reduced-precision computation has emerged as a promising approach to improve memory usage, energy efficiency, and computational throughput. While formats such as FP16 and FP8 are increasingly supported by modern hardware for tensor operations, ultra-low-precision formats like FP4 and FP2 remain largely unexplored for non-linear activation functions, which play a critical role in model convergence and stability. In this work, we introduce a PyTorch framework to emulate and train models where activation functions are computed using FP8, FP6, FP4, FP3, and FP2 representations throughout the entire training process. Through comprehensive experiments across multiple models and datasets, we evaluate the feasibility of training neural networks when activation functions operate at low precision and identify those that maintain accuracy despite the noise introduced by quantization.
Workshop
Debugging & Correctness Tools
HPC Software & Runtime Systems
Livestreamed
Recorded
TP
W
DescriptionMPI correctness benchmarks are used to evaluate the implementation quality of MPI correctness tools on a standardized set of tests.
However, existing correctness benchmarks are limited to C, neglecting support for Fortran, the only other language which the MPI standard supports.
Consequently, past evaluations of correctness tools were focused solely on C although some of them support error-checking of Fortran MPI codes.
To alleviate this, we port the test generation logic of the most recently introduced MPI correctness benchmark MPI-BugBench to Fortran.
We explore language-specific porting challenges and perform a comparative accuracy evaluation of the dynamic MPI correctness tool MUST on both C and Fortran.
Our results show that MUST's accuracy is largely consistent across languages, with a notable exception in type checking due to a required software dependency not supporting Fortran.
Additionally, we uncovered bugs in both the OpenMPI Fortran bindings,
and the MPI-BugBench test case generator.
However, existing correctness benchmarks are limited to C, neglecting support for Fortran, the only other language which the MPI standard supports.
Consequently, past evaluations of correctness tools were focused solely on C although some of them support error-checking of Fortran MPI codes.
To alleviate this, we port the test generation logic of the most recently introduced MPI correctness benchmark MPI-BugBench to Fortran.
We explore language-specific porting challenges and perform a comparative accuracy evaluation of the dynamic MPI correctness tool MUST on both C and Fortran.
Our results show that MUST's accuracy is largely consistent across languages, with a notable exception in type checking due to a required software dependency not supporting Fortran.
Additionally, we uncovered bugs in both the OpenMPI Fortran bindings,
and the MPI-BugBench test case generator.
Workshop
Performance Evaluation, Scalability, & Portability
Livestreamed
Recorded
TP
W
DescriptionThe prevalence of heterogeneous computing systems -- comprising both CPUs and GPUs -- has led to the adoption of performance portability programming models, such as RAJA. These models allow developers to write portable code that compiles ahead-of-time (AOT), unmodified for different backends, thus improving productivity and maintainability.
In this work, we explore the integration of just-in-time (JIT) optimization into portable programming models. Our work aims to improve performance with JIT optimization, without sacrificing portability or developer productivity.
We extend Proteus to support indirect kernel launching through RAJA's abstractions. Our evaluation with the RAJAPerf benchmark suite demonstrates promising speedups for both AMD and NVIDIA GPUs, with no slowdowns recorded for either backend. Specifically, we record speedups from $1.2\times$ up to $23\times$ on AMD MI250X and speedups from $1.1\times$ up to $15\times$ on NVIDIA V100, while preserving the performance portability and ease-of-use benefits of RAJA.
In this work, we explore the integration of just-in-time (JIT) optimization into portable programming models. Our work aims to improve performance with JIT optimization, without sacrificing portability or developer productivity.
We extend Proteus to support indirect kernel launching through RAJA's abstractions. Our evaluation with the RAJAPerf benchmark suite demonstrates promising speedups for both AMD and NVIDIA GPUs, with no slowdowns recorded for either backend. Specifically, we record speedups from $1.2\times$ up to $23\times$ on AMD MI250X and speedups from $1.1\times$ up to $15\times$ on NVIDIA V100, while preserving the performance portability and ease-of-use benefits of RAJA.
Workshop
Livestreamed
Recorded
TP
W
DescriptionHigh performance computing (HPC) applications are sensitive to network variability, yet existing tracing tools lack insight into low level network behavior.
Modern network interface controllers (NICs), such as HPE's Slingshot-11 Cassini, provide detailed hardware counters that can reveal conditions like congestion and retries, but remain underused due to limited integration with tracing frameworks.
We extend the THAPI framework with a sampling plugin for Cassini's CXI interface, periodically collecting NIC counters and integrating them into HPC trace timelines via the iprof tool.
Data is visualized in Perfetto, enabling correlation between network telemetry and application events.
Our approach imposes negligible overhead at typical sampling rates and exposes previously hidden performance factors, such as congestion delays and load imbalances.
Case studies on point-to-point and collective patterns demonstrate new diagnostic capabilities.
Contributions include the plugin's design, integration into a state-of-the-art tracing toolchain, and evaluation highlighting opportunities for improved HPC communication performance.
Modern network interface controllers (NICs), such as HPE's Slingshot-11 Cassini, provide detailed hardware counters that can reveal conditions like congestion and retries, but remain underused due to limited integration with tracing frameworks.
We extend the THAPI framework with a sampling plugin for Cassini's CXI interface, periodically collecting NIC counters and integrating them into HPC trace timelines via the iprof tool.
Data is visualized in Perfetto, enabling correlation between network telemetry and application events.
Our approach imposes negligible overhead at typical sampling rates and exposes previously hidden performance factors, such as congestion delays and load imbalances.
Case studies on point-to-point and collective patterns demonstrate new diagnostic capabilities.
Contributions include the plugin's design, integration into a state-of-the-art tracing toolchain, and evaluation highlighting opportunities for improved HPC communication performance.
Workshop
Livestreamed
Recorded
TP
W
DescriptionThe release of the C++26 execution control library (senders and receivers) provides a uniform interface for coordinating asynchronous work on heterogeneous backends.
That said, the standardization of senders and receivers is just a starting point toward structured parallelism and composable heterogeneous parallel programming.
To evaluate these capabilities, we present a sender-based interface for the Qthreads user-level threading library.
It shows that the execution control idioms apply even to cases where work is discovered and created dynamically and the degree of concurrency vastly surpasses available hardware resources.
Here we provide an introduction to the C++26 standard execution control library and show how it can be used with dynamic runtime systems.
We demonstrate that the new standard interface is extensible and motivate future work on similar runtime integration efforts.
We show that this interoperability layer incurs low overhead and provides additional optimization opportunities, even when used in a fine-grained parallel setting.
That said, the standardization of senders and receivers is just a starting point toward structured parallelism and composable heterogeneous parallel programming.
To evaluate these capabilities, we present a sender-based interface for the Qthreads user-level threading library.
It shows that the execution control idioms apply even to cases where work is discovered and created dynamically and the degree of concurrency vastly surpasses available hardware resources.
Here we provide an introduction to the C++26 standard execution control library and show how it can be used with dynamic runtime systems.
We demonstrate that the new standard interface is extensible and motivate future work on similar runtime integration efforts.
We show that this interoperability layer incurs low overhead and provides additional optimization opportunities, even when used in a fine-grained parallel setting.
Research and ACM SRC Posters
Research & ACM SRC Posters
TP
DescriptionPython is widely used in scientific computing for prototyping, but its performance and memory overhead limit its suitability for production in high-performance computing (HPC) environments. Pyccel addresses this by translating Python into human-readable Fortran or C, while retaining Python interoperability. Recent developments extend this approach: the pyccel-wrap tool generates Python bindings for existing Fortran/C libraries by mapping functions and classes to Python objects defined in stub files, and pyccel-make enables project-wide builds with CMake or Meson. Together, these features support bidirectional exchange between Python prototypes and low-level implementations. We demonstrate this with PyGyro, a drift-kinetic plasma simulation code. By replacing SciPy’s sparse matrix solver with SeLaLib’s optimized Fortran implementation via pyccel-wrap, we reduce spline solve time by 30% and enable translation of the surrounding loops, removing Python overhead and enabling OpenMP usage. The approach lowers barriers between teams working on Python prototypes and Fortran/C production codes, supporting tighter inter-community collaboration.
Workshop
Livestreamed
Recorded
TP
W
DescriptionPerformance models allow examining the scaling behavior of an application and identifying performance bottlenecks at an early stage. However, application runtime measurements are often tainted by noise on HPC systems. To tackle this problem, hardware counters can be exploited. Yet, not all counters are equally suitable for performance modeling. Some are noise-sensitive, vary strongly during repeated runs, or are faulty on some systems. Thus, appropriate counters must be identified on the inspected systems, requiring extensive testing and complex setups. This paper presents an automated approach that identifies noise-resilient hardware counters on HPC systems. Our approach automatically builds the setup, runs experiments, analyzes and ranks counters based on noise resilience, and presents results via a graphical or console interface. We demonstrate our approach on two HPC clusters and show how the developed tool, alongside the proposed metrics, enables detecting noise-resilient hardware counters in HPC.
Research and ACM SRC Posters
Research & ACM SRC Posters
TP
DescriptionLarge, sparse linear systems are pervasive in modern science and engineering, and Krylov subspace solvers are an established means of solving them. Yet convergence can be slow for ill-conditioned matrices, so practical deployments usually require preconditioners. Markov chain Monte Carlo (MCMC)-based inversion can generate such preconditioners and accelerate Krylov iterations, but its effectiveness depends on parameters whose optima vary across matrices; manual or grid search is costly. We present an AI-driven framework recommending MCMC parameters for a given linear system. A graph neural surrogate predicts preconditioning speed from A and MCMC parameters. A Bayesian acquisition function then chooses the parameter sets most likely to minimize iterations. On a previously unseen ill-conditioned system, the framework achieves better preconditioning with 50% of the search budget of conventional methods, yielding about a 10% reduction in iterations to convergence. These results suggest a route for incorporating MCMC-based preconditioners into large-scale systems.
Workshop
Livestreamed
Recorded
TP
W
DescriptionLarge, sparse linear systems are pervasive in modern science and engineering, and Krylov subspace solvers are an established means of solving them. Yet convergence can be slow for ill-conditioned matrices, so practical deployments usually require preconditioners. Markov chain Monte Carlo (MCMC)-based matrix inversion can generate such preconditioners and accelerate Krylov iterations, but its effectiveness depends on parameters whose optima vary across matrices; manual or grid search is costly. We present an AI-driven framework recommending MCMC parameters for a given linear system. A graph neural surrogate predicts preconditioning speed from A and MCMC parameters. A Bayesian acquisition function then chooses the parameter sets most likely to minimise iterations. On a previously unseen ill-conditioned system, the framework achieves better preconditioning with 50% of the search budget of conventional methods, yielding about a 10% reduction in iterations to convergence. These results suggest a route for incorporating MCMC-based preconditioners into large-scale systems.
Workshop
Livestreamed
Recorded
TP
W
DescriptionEfficient synchronization of memory mapping information is increasingly important as systems evolve toward greater resource disaggregation and heterogeneity.
When memory is exported between processes, establishing shared mapping often requires costly page table walk and updates, particularly in fault-driven models.
To study these costs, we implement an XPMEM-inspired shared-memory driver and evaluate techniques to reduce mapping overhead.
Our approach combines parallel batched on-demand pinning, bypassing unnecessary cache-policy lookups in PFN mapping, and dynamic re-registration to expand registered regions without tearing down existing mappings.
In our evaluation, these optimizations reduce cold-start memory copy by up to 13.22x over XPMEM in multi-process workloads, with particular benefits for collective communication patterns and rapidly resizing buffers.
While developed in a shared-memory context, the results highlight general strategies—avoiding redundant translation work, enabling parallel mapping operations, and preserving mapping state—that can inform the design of memory management in disaggregated systems, including GPU disaggregation and heterogeneous memory environments.
When memory is exported between processes, establishing shared mapping often requires costly page table walk and updates, particularly in fault-driven models.
To study these costs, we implement an XPMEM-inspired shared-memory driver and evaluate techniques to reduce mapping overhead.
Our approach combines parallel batched on-demand pinning, bypassing unnecessary cache-policy lookups in PFN mapping, and dynamic re-registration to expand registered regions without tearing down existing mappings.
In our evaluation, these optimizations reduce cold-start memory copy by up to 13.22x over XPMEM in multi-process workloads, with particular benefits for collective communication patterns and rapidly resizing buffers.
While developed in a shared-memory context, the results highlight general strategies—avoiding redundant translation work, enabling parallel mapping operations, and preserving mapping state—that can inform the design of memory management in disaggregated systems, including GPU disaggregation and heterogeneous memory environments.
Paper
HPC for Machine Learning
Performance Measurement, Modeling, & Tools
Programming Frameworks
Livestreamed
Recorded
TP
DescriptionSparse tensor contractions are a core computational primitive in scientific computing and machine learning. Effective optimization of such contractions through loop permutation/tiling remains an open challenge. Our work performs the first comprehensive comparative analysis of data access costs and memory requirements for loop permutations for sparse tensor contractions. Based on these insights, we develop FaSTCC, a novel hashing-based parallel implementation of sparse tensor contractions. FaSTCC introduces a new 2D tiled contraction-index-outer scheme and a corresponding tile-aware design. Using probabilistic modeling, our approach automatically chooses between dense and sparse output tile accumulators and selects suitable tile size. We evaluate FaSTCC across two CPU platforms and a range of real-world workloads, demonstrating significant speedups on benchmarks from FROSTT and from quantum chemistry.
Workshop
Livestreamed
Recorded
TP
W
DescriptionThis talk will focus on the Python Community.
Workshop
Livestreamed
Recorded
TP
W
DescriptionTrillion-parameter, science-tuned foundation models can speed discovery, but only inside an AI-native Scientific Discovery Platform (SDP) that connects models to tools, data, HPC, and robotics. I argue for community co-development of the SDP, via open interfaces, shared schedulers, knowledge substrates, provenance, and evaluation, alongside shared models. Early results suggest that such a co-designed stack can boost throughput and reliability in materials and bio workflows, enabling human–AI teams to turn knowledge into experiments and experiments into insight.
Workshop
Livestreamed
Recorded
TP
W
DescriptionHeterogeneous supercomputers have become the standard in HPC. GPUs in particular have dominated the accelerator landscape, offering unprecedented performance in parallel workloads and unlocking new possibilities in fields like AI and climate modeling. With many workloads becoming memory-bound, improving the communication latency and bandwidth within the system has become a main driver in the development of new architectures. The Grace Hopper Superchip (GH200) is a significant step in the direction of tightly coupled heterogeneous systems, in which all CPUs and GPUs share a unified address space and support transparent fine grained access to all main memory on the system. We characterize both intra- and inter-node memory operations on the Quad GH200 nodes of the new Swiss National Supercomputing Centre Alps supercomputer, and show the importance of careful memory placement on example workloads, highlighting tradeoffs and opportunities.
Workshop
Data Analytics
High Performance I/O, Storage, Archive, & File Systems
Storage
Livestreamed
Recorded
TP
W
DescriptionSuccessful science relies on data in many forms. Data management in computational science historically revolved around inputs and outputs of simulations and middleware libraries executed on HPC platforms. More recently, increasingly complex workflows, coupled with a greater variety of computational tasks, has prompted the development of new data management software to meet new requirements, including streaming services, specialized model repositories, and vector databases. Rapid change driven by AI and the incorporation of geographically distributed resources is altering the landscape yet again.This talk will first discuss recent progress, successes, and lessons learned through our efforts to better understand and accelerate the development of data management services for computational science. It then pivots to consider trends in how we pursue science and their implications for data management software going forward.
Workshop
Livestreamed
Recorded
TP
W
DescriptionThe Fifth International Symposium on Quantitative Co-Design of Supercomputers considers combining two methodologies—collaborative co-design and data-driven analysis—to realize the potential of supercomputing more fully. The rapidly evolving nature of HPC and its importance to scientific discovery make it appropriate for both co-design processes and data-driven approaches. By narrowing in on these two proven methodologies with the broad-community attention of an SC25 audience, we will address many identified challenges impacting HPC. Our scope includes applications, system software, workflows, and hardware health. Without a clear and standard set of best practices in place, we have many missed opportunities. This symposium will bring together leaders in the field to review current efforts across centers and discuss areas that show potential. This year, we will focus on opportunities and challenges in holistic performance engineering. We consider the question: How can quantitative co-design be applied to address integrated and comprehensive concerns surrounding supercomputer performance engineering?
Community Engagement and Support
Not Livestreamed
Not Recorded
TP
W
TUT
XO/EX
Community Engagement and Support
Not Livestreamed
Not Recorded
TP
W
TUT
XO/EX
Paper
Algorithms
HPC for Machine Learning
Performance Measurement, Modeling, & Tools
State of the Practice
Livestreamed
Recorded
TP
DescriptionAs high performance computing (HPC) systems scale in size, system-wide hardware failure rates increase. Historical data from previous large-scale HPC installations illustrate this trend, with the mean time between failures (MTBF) decreasing steadily over the past decade. Recent studies from artificial intelligence and machine learning (AI/ML) training extrapolate MTBF declining even further for future GPU-accelerated systems. As MTBF decreases, mean time to repair (MTTR) becomes more pronounced, highlighting the need for efficient recovery strategies.
This paper presents an automated failure management system that addresses this issue by minimizing MTTR through real-time decision-making based on failure statistics. Our key contributions include a centralized meta-database for event history analysis including correlated events, fine-grained multi-strike repair policies, and an automated recovery framework. Deployed on the Aurora supercomputer, the proposed system has reduced MTTR by up to 84X compared to manual servicing, leading to significant cost savings and decreased system downtime.
This paper presents an automated failure management system that addresses this issue by minimizing MTTR through real-time decision-making based on failure statistics. Our key contributions include a centralized meta-database for event history analysis including correlated events, fine-grained multi-strike repair policies, and an automated recovery framework. Deployed on the Aurora supercomputer, the proposed system has reduced MTTR by up to 84X compared to manual servicing, leading to significant cost savings and decreased system downtime.
Workshop
Livestreamed
Recorded
TP
W
DescriptionExtreme-scale workflows play a crucial role in increasing scientific productivity by helping scientists orchestrate today’s scientific campaigns. With the recent developments in artificial intelligence (AI) and its growing application in scientific campaigns, we have started to witness the integration of AI tasks into scientific workflows (workflows for AI). There can also be significant benefits for existing scientific workflows when augmenting them with AI (AI for workflows). Given the early stage of these emerging topics, more effort is needed to outline the role of AI in scientific workflows. The First International Symposium on Artificial Intelligence and Extreme-Scale Workflows will provide the scientific community with a dedicated platform for discussing current efforts, opportunities, and open challenges in AI and scientific workflows. This symposium will feature invited talks given by the leaders in the field and aims to further advance AI workflows by fostering new connections and ideas among the workshop participants.
Workshop
Livestreamed
Recorded
TP
W
DescriptionIncorporating Quantum Computing (QC) into High Performance Computing (HPC) environments (commonly referred to as HPC+QC integration) marks a pivotal step in advancing computational capabilities for scientific research. This paper provides a firsthand account of integrating a superconducting 20-qubit quantum computer into the HPC infrastructure at [InstitutionAnonymizedForReview], one of the first practical implementations of its kind. This yielded four key lessons: (1) quantum computers have stricter facility requirements than classical systems, yet their deployment in HPC environments is feasible when preceded by a rigorous site survey to ensure compliance; (2) quantum computers are inherently dynamic systems that require regular recalibration that is automatic and controllable by the HPC scheduler; (3) redundant power and cooling infrastructure is essential; and (4) effective hands-on onboarding should be provided for both quantum experts and new users. By sharing these experiences, we aim to provide a roadmap for other HPC centers considering similar integrations.
Workshop
Livestreamed
Recorded
TP
W
DescriptionWe present the Federated Inference Resource Scheduling Toolkit (FIRST), a framework enabling Inference-as-a-Service across distributed High-Performance Computing (HPC) clusters. FIRST provides cloud-like access to diverse AI models, like Large Language Models (LLMs), on existing HPC infrastructure. Leveraging Globus Auth and Globus Compute, the system allows researchers to run parallel inference workloads via an OpenAI-compliant API on private, secure environments. This cluster-agnostic API allows requests to be distributed across federated clusters, targeting numerous hosted models. FIRST supports multiple inference backends (e.g., vLLM), auto-scales resources, maintains "hot" nodes for low-latency execution, and offers both high-throughput batch and interactive modes. The framework addresses the growing demand for private, secure, and scalable AI inference in scientific workflows, allowing researchers to generate billions of tokens daily on-premises without relying on commercial cloud infrastructure.
Workshop
Livestreamed
Recorded
TP
W
DescriptionWe present an assignment successfully implemented in a third-year Parallel Computing course of a Computer Engineering degree program. Since 2017/2018, we have proposed a different problem each academic year to illustrate the conceptual and technical differences of using different parallel programming models. The problem chosen for this year implements a flood simulation. Cloud fronts move across a scenario, dropping water. The simulation computes the flow of water from the highest ground to the lowest ground, leaking out at the scenario boundaries or accumulating in sinks and dams to form pools and lakes. The assignment addresses foundational concepts, such as race conditions, reductions, collective operations, and point-to-point communications. It also offers critical choices related to cache-aware programming or using atomic operations vs. more memory accesses with ancillary structures. The supporting materials for previous assignments in this series are available at https://gamuva.infor.uva.es/peachy-assignments/
Art of HPC
Art of HPC
Not Livestreamed
Not Recorded
TP
W
TUT
XO/EX
DescriptionThis video showcases in situ computational steering of a 2D lattice Boltzmann methods (LBM) computational fluid dynamics (CFD) simulation. In the shown application, users can dynamically modify the barriers in the fluid's path by selecting a file that describes barrier locations. For this artistic simulation, barrier locations were generated by using an edge detection algorithm on Van Gogh's "Starry Night."
Art of HPC
Art of HPC
Not Livestreamed
Not Recorded
TP
W
TUT
XO/EX
DescriptionWe showcase two distinct applications of computational fluid dynamics for patient health monitoring driven by the HARVEY fluid dynamics solver. The first portion of the animation depicts blood flow coursing through a representative human aorta. The inlet of the flow begins at the ascending aorta, passing through the aortic arch until it reaches the descending aorta at the bottom of the geometry. The fluid flow diverges upon entering the aortic arch, dividing into right and left subclavian and common arteries. Flow is constantly pulsing throughout the lifetime of the animation to mimic circulation in vivo. The second portion of the animation portrays the movement of a circulating tumor cell (CTC) as it progresses through a geometry. Fluid flow streamlines guide the path of the CTC as it traverses through the grid-like structure. Visualization of CTC deformation, primarily during interactions with geometry and blood flow, provides critical insights for long-term monitoring of patient health.
Research and ACM SRC Posters
Research & ACM SRC Posters
TP
DescriptionMany tensor processing algorithms require computing a tensor times matrix chain (TTMc) operation, and this operation is frequently the bottleneck in such algorithms. This work develops strategies for accelerating a TTMc using low-precision hardware.
We present a novel scheme for scaling the TTMc operands to prevent overflow. Our scheme exploits the Kronecker Product structure of a TTMc to allow for efficient application. Additionally, we present the first forward error bound for TTMc, and we develop a heuristic for ordering the individual TTM operations within a TTMc to reduce the forward error.
Our scaling scheme allows for a TTMc on the Miranda Tensor to be computed without overflow on an NVIDIA A100 GPU using FP16 arithmetic, exhibiting a speedup of up to 2× over FP64 arithmetic, even when accounting for the overhead of applying scaling. We show that our TTM ordering heuristic is effective for some tensors in certain cases.
We present a novel scheme for scaling the TTMc operands to prevent overflow. Our scheme exploits the Kronecker Product structure of a TTMc to allow for efficient application. Additionally, we present the first forward error bound for TTMc, and we develop a heuristic for ordering the individual TTM operations within a TTMc to reduce the forward error.
Our scaling scheme allows for a TTMc on the Miranda Tensor to be computed without overflow on an NVIDIA A100 GPU using FP16 arithmetic, exhibiting a speedup of up to 2× over FP64 arithmetic, even when accounting for the overhead of applying scaling. We show that our TTM ordering heuristic is effective for some tensors in certain cases.
Workshop
Livestreamed
Recorded
TP
W
DescriptionWe introduce open-source frameworks for deploying and running large language models (LLMs) within high-performance computing (HPC) environments. One such framework targets high-
throughput batch inference, enabling users to submit LLM requests in an OpenAI-compatible format as traditional HPC jobs. Another framework is based on Ray Serve and it provides dynamic, on-demand allocation of HPC resources for interactive LLM serving via APIs, supporting applications such as chatbots and AI agents. The third framework is a production-grade, always-
on platform for real-time interaction, that relies on a dedicated GPU server for model inference. These frameworks are designed to abstract away underlying computer system complexities, allowing researchers to request and utilize GPU resources for model inference without manual environment setup. We describe these systems and report LLM-specific performance metrics. Results demonstrate that the proposed frameworks enable scalable and resource-efficient LLM serving across both batch and interactive workloads in support of diverse user needs.
throughput batch inference, enabling users to submit LLM requests in an OpenAI-compatible format as traditional HPC jobs. Another framework is based on Ray Serve and it provides dynamic, on-demand allocation of HPC resources for interactive LLM serving via APIs, supporting applications such as chatbots and AI agents. The third framework is a production-grade, always-
on platform for real-time interaction, that relies on a dedicated GPU server for model inference. These frameworks are designed to abstract away underlying computer system complexities, allowing researchers to request and utilize GPU resources for model inference without manual environment setup. We describe these systems and report LLM-specific performance metrics. Results demonstrate that the proposed frameworks enable scalable and resource-efficient LLM serving across both batch and interactive workloads in support of diverse user needs.
Paper
Algorithms
Applications
Data Analytics
Livestreamed
Recorded
TP
DescriptionSubgraph counting (SGC) is a fundamental component of many important applications, including cybersecurity, drug discovery, social network analysis, and natural language processing. However, current SGC approaches can only handle very small patterns (aka subgraphs) because the computational load increases exponentially with the size of the pattern. To overcome this limitation for certain patterns, we introduce a new technique and algorithm called Fringe-SGC for counting the exact number of times a subgraph occurs in a larger graph. Our approach conventionally searches only for the “core” of the subgraph and then uses set-based methods to compute the number of occurrences that the “fringes” add. Our evaluation shows that Fringe-SGC is able to count the instances of many subgraphs that are too large for state-of-the-art SGC frameworks. Furthermore, Fringe-SGC running on a GPU outperforms the state-of-the-art GPU-based SGC frameworks by up to 20× on average, especially on patterns with many fringes.
Workshop
Livestreamed
Recorded
TP
W
DescriptionGraphics Processing Units (GPUs) have become essential in accelerating artificial intelligence workloads. We developed and implemented a hands-on, lab-intensive Special Topics course in GPU programming for undergraduate and graduate STEM students. This paper describes the course design, pedagogy, lessons learned, student feedback, and recommendations for integrating GPU programming into STEM curricula.
Workshop
Livestreamed
Recorded
TP
W
DescriptionIn this paper, we investigate three cross-facility data streaming architectures, Direct Streaming (DTS), Proxied Streaming (PRS), and Managed Service Streaming (MSS). We examine their architectural variations in dataflow paths and deployment feasibility, and detail their implementation using the DS2HPC architectural framework and the SciStream memory-to-memory streaming toolkit on the production-grade ACE infrastructure at OLCF. We present a workflow-specific evaluation of these architectures using three synthetic workloads derived from the streaming characteristics of scientific workflows. Through simulated experiments, we measure streaming throughput, round-trip time, and overhead under work sharing, work sharing with feedback, and broadcast and gather messaging patterns commonly found in AI-HPC communication motifs. Our study shows that DTS offers a minimal-hop path, resulting in higher throughput and lower latency, whereas MSS provides greater deployment feasibility and scalability across multiple users but incurs significant overhead. PRS lies in between, offering a scalable architecture whose performance matches DTS in most cases.
Workshop
AI, Machine Learning, & Deep Learning
Clouds & Distributed Computing
Performance Evaluation, Scalability, & Portability
Scientific & Information Visualization
Partially Livestreamed
Partially Recorded
TP
W
DescriptionModern datacenters operate at unprecedented scale, supporting HPC and AI workloads while consuming hundreds of megawatts of power. Their reliability is challenged by complex interdependencies across cooling, power, and network subsystems, where failures can cascade into downtime and degraded performance. Existing monitoring approaches, largely threshold or only correlation-based, struggle to isolate root causes within high-dimensional, evolving telemetry. We present PACE (Pattern and Causal Exploration), an ML-based framework that combines unsupervised correlation clustering with supervised, lag-aware Granger causality to uncover subsystem structure and directed causal pathways from multivariate telemetry. PACE yields interpretable causal graphs, subsystem heatmaps, that align with physical processes and control logic, providing actionable insights for operations. Finally, we discuss how embedding PACE into digital twin architectures enables causal-informed \emph{what-if} reasoning, advancing the reliability and efficiency in datacenters.
Invited Talk
AI, Machine Learning, & Deep Learning
Big Data
Weather Prediction
Livestreamed
Recorded
TP
DescriptionBuilding directly upon our work that was a finalist for the Gordon Bell Prize for Climate Modelling at SC23, this talk presents the next stage of our pioneering research in real-time weather prediction with unprecedented precision on the supercomputer Fugaku. Our new experiment for the Expo 2025 Osaka Kansai marks a world's first: the simultaneous use of two Multi-Parameter Phased Array Weather Radars for data assimilation (DA). This novel configuration provided an unprecedented data stream assimilated in real time by our Big Data Assimilation system on the supercomputer Fugaku, enabling 30-second-refresh, 30-minute-lead precipitation forecasts at a 500-meter resolution.
Building on the capabilities of this system, the talk then shifts focus to the future of prediction science, exploring our multifaceted research at RIKEN to fuse DA with AI/ML. Motivated by the need to combine the strengths of physics-based models with data-driven techniques and adapt to modern GPU-centric architectures, we will present five examples of our hybrid methodologies. These include integrating convolutional LSTMs with numerical weather prediction, developing deep neural network observation operators for satellite data, and using DA to iteratively refine AI surrogate models. The talk will conclude with a forward-looking perspective on fully Bayesian estimation, discussing how emerging techniques like conditional diffusion models could achieve the ultimate goal of DA: directly sampling atmospheric states from observations.
Building on the capabilities of this system, the talk then shifts focus to the future of prediction science, exploring our multifaceted research at RIKEN to fuse DA with AI/ML. Motivated by the need to combine the strengths of physics-based models with data-driven techniques and adapt to modern GPU-centric architectures, we will present five examples of our hybrid methodologies. These include integrating convolutional LSTMs with numerical weather prediction, developing deep neural network observation operators for satellite data, and using DA to iteratively refine AI surrogate models. The talk will conclude with a forward-looking perspective on fully Bayesian estimation, discussing how emerging techniques like conditional diffusion models could achieve the ultimate goal of DA: directly sampling atmospheric states from observations.
Awards and Award Talks
Livestreamed
Recorded
TP
DescriptionThe SC25 Test of Time Award recognizes the lasting impact of The Globus Striped GridFTP Framework and Server, presented at SC05. This talk will trace the journey from GridFTP to the Globus Transfer service, highlighting the architectural innovations that enabled secure, high-performance, and scalable data movement for science. The SC05 paper introduced design principles of modularity, extensibility, and robustness that have stood the test of time. These principles supported deployment in research networks worldwide, adoption across major scientific collaborations in physics, climate science, and astronomy, and high-performance demonstrations at successive SC conferences. Building on this foundation, GridFTP evolved into today’s Globus service, which now supports hundreds of thousands of researchers, tens of thousands of endpoints, and billions of transfers annually. We will reflect on key lessons learned in building infrastructure that not only met immediate needs but also continues to adapt over decades of scientific and technological change.
Exhibitor Forum
Data Analytics
Livestreamed
Recorded
TP
XO/EX
DescriptionAs AI workloads grow, traditional air-cooling methods in data centers are proving inefficient and environmentally costly due to high water and energy use, lower packing densities, and larger footprints. This talk explores how the high performance computing (HPC) community’s experience—especially in liquid cooling—can guide more sustainable AI infrastructure. Case studies from Bavaria’s LRZ and NHR@FAU show how hot-water cooling improves power efficiency, reduces resource use, and enables waste heat reuse. Additionally, GPU power capping can enhance energy savings with minimal performance loss while improving system stability. The AI field is urged to adopt proven HPC thermal management strategies for sustainable scaling.
Exhibitor Forum
Networking
Livestreamed
Recorded
TP
XO/EX
DescriptionAs AI workloads continue to demand the latest and most powerful GPUs, the TDP (thermal design power) is pushing data center thermal management toward the limits of conventional cooling technologies. Advanced thermal management for AI clusters, such as liquid cooling/direct-to-chip with cold plates and next-gen thermal management technology, such as immersion liquid cooling, may require new types of optical connectivity solutions for the next wave of infrastructure.
This session will examine why the shift to next-generation optics is no longer optional but essential. We will highlight how today’s data center thermal management solutions are increasingly hybrid, utilizing various cooling solutions that can operate seamlessly in air, liquid, and immersion systems. The discussion will demonstrate how sealed optical cables meet the infrastructure of the future.
Attendees will gain insights into how sealed optical cables are evaluated for reliability and performance, such as material compatibility, mean time between failures (MTBF), and signal integrity. These factors directly influence reliability, scalability, and cost efficiency in high-performance environments.
Next-generation optical connectivity solutions should be viewed as strategic enablers that future-proof an infrastructure from becoming the weakest link, providing a foundation for growth, adaptability, and resilience in the era of AI-driven data centers.
This session will examine why the shift to next-generation optics is no longer optional but essential. We will highlight how today’s data center thermal management solutions are increasingly hybrid, utilizing various cooling solutions that can operate seamlessly in air, liquid, and immersion systems. The discussion will demonstrate how sealed optical cables meet the infrastructure of the future.
Attendees will gain insights into how sealed optical cables are evaluated for reliability and performance, such as material compatibility, mean time between failures (MTBF), and signal integrity. These factors directly influence reliability, scalability, and cost efficiency in high-performance environments.
Next-generation optical connectivity solutions should be viewed as strategic enablers that future-proof an infrastructure from becoming the weakest link, providing a foundation for growth, adaptability, and resilience in the era of AI-driven data centers.
Research and ACM SRC Posters
Research & ACM SRC Posters
TP
DescriptionLegacy Fortran codes remain central to many scientific applications but are poorly suited to today’s GPU-accelerated heterogeneous architectures. Manual porting to performance-portable frameworks like Kokkos is time-consuming and requires deep domain expertise, creating a major barrier to modernization. We present a novel autonomous agentic AI workflow that leverages large language models (LLMs) to translate and optimize Fortran kernels into portable Kokkos C++ implementations. Our framework employs specialized agents for translation, compilation, execution, error handling, testing, and optimization, orchestrated with SLURM and Spack on diverse GPU platforms. Using OpenAI’s proprietary models, we achieved fully autonomous kernel translation in under $3.5 per kernel, while iterative optimization consistently improved GFLOPS performance. In contrast, open-source models like Llama 4 Maverick performed poorly. In the poster session, we will present the workflow design, benchmark results across architectures, token cost analysis, and optimization gains, highlighting opportunities for scalable, fully autonomous modernization of scientific codebases.
Workshop
short paper
Livestreamed
Recorded
TP
W
DescriptionSince the advent of software defined networking (SDN), its architecture allows better network flexibility, capacity planning and improved performance, especially for traffic engineering. Additionally, network operators are using in-band network telemetry (INT) to build efficient programmable networks via controlling various network flow patterns. The capability of
programmable data plane intertwined with Artificial Intelligence (AI) have established self-driven network services, such as Hecate tool, for seamless and salable network management and control. In this paper, we explore different in-production traffic patterns to perform AI-driven traffic control and engineering. We then develop a novel queuing algorithm based on the observed traffic patterns to enhance traffic engineering of Hecate with source routing at the edge. This work feeds into P4 programmability to show how source routing can use machine learning to deploy self-engineering networks.
programmable data plane intertwined with Artificial Intelligence (AI) have established self-driven network services, such as Hecate tool, for seamless and salable network management and control. In this paper, we explore different in-production traffic patterns to perform AI-driven traffic control and engineering. We then develop a novel queuing algorithm based on the observed traffic patterns to enhance traffic engineering of Hecate with source routing at the edge. This work feeds into P4 programmability to show how source routing can use machine learning to deploy self-engineering networks.
Research and ACM SRC Posters
Research & ACM SRC Posters
TP
DescriptionMedication non-adherence is a major public health issue, especially within the behavioral health domain, with traditional measurement methods often being unreliable. This study uses a machine learning approach to predict medication adherence in a large cohort of over 446,000 patients with major depressive disorder, filtered out of a very large-scale dataset containing over 36 million patient records, leveraging de-identified electronic health record data. Our XGBoost model achieved 88% accuracy and an ROC-AUC of 0.94, demonstrating strong predictive performance. Crucially, the use of SHAP provided clinical interpretability, identifying key drivers of adherence, primarily from prescription data. This research highlights the potential of large-scale data and machine learning to enable targeted interventions, improving patient care and reducing healthcare costs.
Community Engagement and Support
Not Livestreamed
Not Recorded
TP
XO/EX
DescriptionFrom rural South Africa to continent-wide deployments, I share my journey making HPC accessible and inclusive. Learn strategies for building clusters, fostering community, and empowering new HPC users across Africa.
Flash Session
Not Livestreamed
Not Recorded
TP
XO/EX
DescriptionDiscover how Dell Pro Max with GB10 empowers AI developers to build, fine-tune, and deploy large-scale models—right from their desk. This session introduces a purpose-built AI device designed for prototyping, local inference, and edge deployment, all within a secure, high-performance environment. Learn how to start in a local sandbox and seamlessly scale into enterprise-grade infrastructure using the Dell AI Factory with NVIDIA. Whether you're an individual developer or part of a research team, this presentation will show how GB10 transforms the AI development lifecycle with unmatched speed, flexibility, and data control.
Art of HPC
Panel
Art of HPC
Creativity
Not Livestreamed
Not Recorded
TP
DescriptionHigh performance computing and AI are sparking a thrilling new era in creativity and art. This dynamic panel brings together visionary artists and cutting-edge technologists to reveal how they harness the power of large-scale computation to craft works that dazzle the senses, resonate culturally, and often surprise us in delightful ways. From mesmerizing generative imagery to immersive, interactive experiences, the panelists will showcase how HPC opens up artistic possibilities at scales and resolutions once unimaginable.
But the excitement doesn’t stop there—this is a two-way street. Artistic practices aren’t just benefiting from big tech; they’re reshaping how we think about computing itself, transforming its purpose from mere efficiency and scientific rigor to a wellspring of inspiration and expressive power. Join us for an accessible and fascinating peek into a future where silicon and creativity join forces, co-designing bold new forms of culture and unlocking realms of imagination we’ve yet to explore.
But the excitement doesn’t stop there—this is a two-way street. Artistic practices aren’t just benefiting from big tech; they’re reshaping how we think about computing itself, transforming its purpose from mere efficiency and scientific rigor to a wellspring of inspiration and expressive power. Join us for an accessible and fascinating peek into a future where silicon and creativity join forces, co-designing bold new forms of culture and unlocking realms of imagination we’ve yet to explore.
Workshop
Livestreamed
Recorded
TP
W
DescriptionIn this paper, we present our work on developing a sensor-based device capable of large-scale environmental data collection. We also outline how we integrated this technology into a STEM education workshop for high school students. Conducted over three consecutive days as part of the Toledo EXCEL program, the workshop aimed to introduce students to foundational computing concepts, including data acquisition, machine learning, and artificial intelligence through accessible, hands-on activities. Participants used custom-built IoT sensor system in conjunction with visual programming tools like MakeCode to create a smart plant care assistant. They also explored basic machine learning by training classifiers using Teachable Machine. We describe both the technical development of the sensor device and its role in engaging students with real-world computing applications. Finally, we outline our plans to enhance future workshops with advanced topics such as parallel computing and real-time data visualization dashboards.
Exhibitor Forum
Data Analytics
Livestreamed
Recorded
TP
XO/EX
DescriptionOver a million people die in traffic collisions each year—making autonomous driving not just a technical challenge, but a humanitarian imperative. Advanced Driver Assistance Systems (ADAS) and self-driving systems powered by AI are advancing rapidly, but traditional architectures create massive inefficiencies. Each stage of the AI lifecycle—from data ingestion to simulation and training—often replicates 100+ PB of data multiple times, leading to petabyte-scale duplication, delays, and millions in wasted infrastructure. By unifying the entire physical AI pipeline—including HiL, SiL, LiDAR, telemetry, FMV, CAN-Bus, and neural network training—into a single data platform, innovators can eliminate redundancy and accelerate delivery. This approach supports millions of GPU cores running real-world simulations across thousands of virtual vehicle models, with seamless access to data across global sites. It reduces provisioning times from weeks to minutes and saves $50–$100M per program—enabling faster validation and dramatically accelerating the path to safer roads through scalable, real-time AI.
Workshop
Livestreamed
Recorded
TP
W
DescriptionRealizing the promise of large-scale foundation models for scientific discovery—enabling self-driving laboratories, hypothesis generation, and more—requires unprecedented computational scale and multidisciplinary efforts to prepare diverse scientific data. While only a few organizations can train state-of-the-art models from scratch (e.g., trillions of parameters, tens of trillions of tokens), advances in training strategies and fine-tuning have expanded accessibility. Simultaneously, breakthroughs in training methodologies and data quality are dramatically reducing training costs and improving the performance of even smaller AI models. As AI models advance in general-purpose tasks, the scientific community is refining methods to evaluate and enhance their scientific reasoning capabilities, a critical challenge for trustworthy AI in science. This workshop, catalyzed by the Trillion Parameter Consortium (TPC), will highlight collaborations in scientific skills evaluation, performance optimization, federated learning, responsible AI, and other topics. SC24 drew 33 submissions, with 13 presented to nearly 200 attendees, underscoring the rapid evolution of this field.
Paper
Algorithms
HPC for Machine Learning
Performance Measurement, Modeling, & Tools
State of the Practice
Livestreamed
Recorded
TP
DescriptionTransformer models rely on high performance computing (HPC) resources for inference, where soft errors are inevitable in large-scale systems, making the reliability of the model particularly critical. Existing fault tolerance frameworks for transformers are designed at the operation level without architectural optimization, leading to significant computational and memory overhead, which in turn reduces protection efficiency and limits scalability to larger models. In this paper, we implement module-level protection for transformers by treating the operations within the attention module as a single kernel and applying end-to-end fault tolerance. This method provides unified protection across multi-step computations, while achieving comprehensive coverage of potential errors in the nonlinear computations. For linear modules, we design a strided algorithm-based fault tolerance (ABFT) that avoids inter-thread communication. Experimental results show that our end-to-end fault tolerance achieves up to 7.56x speedup over traditional methods, with an average fault tolerance overhead of 13.9%.
Flash Session
Not Livestreamed
Not Recorded
TP
XO/EX
DescriptionAs unprecedented AI investment lands into cloud infrastructure, HPC is no longer just simulation and number-crunching, it’s becoming a strategic capability. This session will show how today’s cloud investments for AI create an on-ramp for HPC growth, how HPC customers can leverage that tide, and what cloud providers need to deliver to meet this moment.
Workshop
Livestreamed
Recorded
TP
W
DescriptionModern scientific simulations and instruments generate data volumes that overwhelm memory and storage, throttling scalability. Lossy compression mitigates this by trading controlled error for reduced footprint and throughput gains, yet optimal pipelines are highly data and objective specific, demanding compression expertise. GPU compressors supply raw throughput but often hard‑code fused kernels that hinder rapid experimentation, and underperform in rate–distortion. We present FZModules, a heterogeneous framework for assembling error‑bounded custom compression pipelines from high‑performance modules through a concise extensible interface. We further utilize an asynchronous task-backed execution library that infers data dependencies, manages memory movement, and exposes branch and stage level concurrency for powerful asynchronous compression pipelines. Evaluating three pipelines built with FZModules on four representative scientific datasets, we show they can compare end‑to‑end speedup of fused‑kernel GPU compressors while achieving similar rate–distortion to higher fidelity CPU or hybrid compressors, enabling rapid, domain-tailored design.
Research and ACM SRC Posters
Research & ACM SRC Posters
TP
DescriptionHigh performance computing (HPC) systems face an urgent sustainability crisis, with leading facilities consuming 10–60 MW and incurring multimillion-dollar annual energy costs. Traditional schedulers like SLURM and PBS treat energy as secondary, leading to 30%–50% energy waste above theoretical optimal levels. We present GATSched, a multi-objective graph attention network scheduler that models HPC workloads as dynamic graphs with specialized attention heads. Our approach jointly optimizes energy efficiency, performance, and resource utilization using four attention mechanisms: energy, performance, balance, and temporal. Through trace-driven simulation validation on 389,604 production jobs across three HPC architectures, GATSched achieves 27%–35% energy reduction while maintaining substantial resource utilization. In the poster session, we will demonstrate the GAT architecture and benchmark comparisons through interactive visualizations.
Workshop
Livestreamed
Recorded
TP
W
DescriptionHigh-performance computing drives scientific discovery, but increasing system complexity and user demands generate a growing volume of diverse technical support issues. This trend underscores the need for automated tools that can extract clear, accurate, and relevant frequently asked questions from support tickets. We addressed this need by developing a novel pipeline for autonomous technical support that began by filtering tickets by anomaly frequency and recency. An instruction-tuned a large language model, then cleaned and summarized tickets. Next, unsupervised semantic clustering identified subclusters of similar tickets within broader topics, which were globally ranked by size, cohesion, and separation. A generation module powered by a large language model produced structured lists of frequently asked questions from the top-ranked subclusters. Evaluation by subject matter experts confirmed that our method produced understandable, accurate, and pertinent content. The extraction of detailed insights from ticket data enhances efficiency of support workflows and facilitates scientific research.
Workshop
Livestreamed
Recorded
TP
W
DescriptionIn this paper, we introduce a new algorithm for generating large-scale permutations on distributed systems. Permutations get used in many applications, including statistical analysis, machine learning, sampling, graph neural networks, matching, crypto-analysis, and bootstrapping. In data science, the permutation is also commonly referred to as a shuffle operation as it reorganizes elements in an entirely random manner by applying the permutation.
Our algorithm is computationally efficient, easy to understand, and scales to large systems. We measure the performance of our new permutation generation scheme on a cluster of NVIDIA DGX-A100s, using up to 256 NVIDIA A100 GPUs. We show that we can generate a permutation of 137 billion values in approximately 1.1 seconds, with a throughput of 124 billion elements per second.
Our algorithm is computationally efficient, easy to understand, and scales to large systems. We measure the performance of our new permutation generation scheme on a cluster of NVIDIA DGX-A100s, using up to 256 NVIDIA A100 GPUs. We show that we can generate a permutation of 137 billion values in approximately 1.1 seconds, with a throughput of 124 billion elements per second.
Panel
AI, Machine Learning, & Deep Learning
Architectures
SC Community Hot Topics
Not Livestreamed
Not Recorded
TP
DescriptionGenerative AI (GenAI) is rapidly emerging as a transformative force in chip design, offering new opportunities to automate and optimize stages across the electronic design automation (EDA) workflow. This panel will explore the current capabilities of AI-driven tools, assess key challenges such as benchmarking, data scarcity, regulatory compliance, and semantic understanding, and examine how GenAI must evolve to meet the specialized demands of HPC and exascale architectures. Experts from industry, academia, and national laboratories will share diverse perspectives on how AI is reshaping chip development and discuss whether we are ready to trust AI in critical hardware design tasks. Attendees will gain insights into practical applications, future research directions, and the broader impact of GenAI on the EDA business and HPC ecosystem.
Paper
Algorithms
Applications
State of the Practice
Livestreamed
Recorded
TP
DescriptionGenerative models have demonstrated strong performance in conditional settings and can be viewed as a form of data compression, where the condition serves as a compact representation. However, their limited controllability and reconstruction accuracy restrict their practical application to data compression. In this work, we propose an efficient latent diffusion framework that bridges this gap by combining a variational autoencoder with a conditional diffusion model. Our method compresses a small number of keyframes into latent space and uses them as conditioning inputs to reconstruct the remaining frames via generative interpolation, eliminating the need to store latent representations for every frame. This approach enables accurate spatiotemporal reconstruction while significantly reducing storage costs. Experimental results across multiple datasets show that our method achieves up to 10× higher compression ratios than rule-based state-of-the-art compressors such as SZ3, and up to 63% better performance than leading learning-based methods under the same reconstruction error.
Keynote
Community Meetings
Future Trends
Keynote
TP
W
TUT
XO/EX
DescriptionOur SC25 keynote speaker will participate in a book club-style chat about his latest publication, "Gigatrends." Bring your curiosity and join the discussion!
Keynote
Keynote
Livestreamed
Recorded
TP
W
TUT
XO/EX
DescriptionIn his talk, Thomas Koulopoulos will explore the forces driving change in the 21st century and offer a roadmap to help us navigate the disruption, and the opportunities, that come with it. From the future of healthcare and work, to the rise of our own digital selves in a new era of trust, his keynote will spark both curiosity and optimism about what lies ahead.
Invited Talk
AI, Machine Learning, & Deep Learning
Power Use Monitoring & Optimization
Livestreamed
Recorded
TP
DescriptionThere are a number of 100 megawatt-scale AI facilities currently in operation around the world. The next generation of AI facilities arriving in 2026-2027 are 1-2 gigawatt campuses, and with their order-of-magnitude increase in scale come both challenges and opportunities. This talk explores three notable challenges, and some solutions and best practices:
1. Phased delivery: A 100MW facility can be built as a single building and fully delivered weeks after the first server goes live. A 2GW campus is delivered in phases, typically 150-400MW per quarter, resulting in a one- to two-year period where the campus capacity is partially available. This timeline collides unhelpfully with the AI hardware lifecycle, in which new, superior products are being introduced every 9-15 months.
2. Concentration and lifecycle management: A 2GW campus allows high-capacity network interconnect perfect for intensive AI training. However, within two years the AI hardware deployed in the campus will be surpassed by later generations of hardware. At this point the AI hardware is typically used for inference serving, and for that purpose a concentrated deployment is pessimal for both latency and redundancy. Managing the lifecycle of a 2GW campus across multiple generations of cutting-edge AI servers requires intentional, planned "crop rotation."
3. More power, more problems: The goodput challenge. Hardware failure rates increase linearly with scale, thus a 2GW campus will see 20X the number of server and network failures experienced by a 100MW facility. The goal remains to maximize goodput for the large jobs running in the facility, and both local redundancy and fast recovery become critical as scale increases.
1. Phased delivery: A 100MW facility can be built as a single building and fully delivered weeks after the first server goes live. A 2GW campus is delivered in phases, typically 150-400MW per quarter, resulting in a one- to two-year period where the campus capacity is partially available. This timeline collides unhelpfully with the AI hardware lifecycle, in which new, superior products are being introduced every 9-15 months.
2. Concentration and lifecycle management: A 2GW campus allows high-capacity network interconnect perfect for intensive AI training. However, within two years the AI hardware deployed in the campus will be surpassed by later generations of hardware. At this point the AI hardware is typically used for inference serving, and for that purpose a concentrated deployment is pessimal for both latency and redundancy. Managing the lifecycle of a 2GW campus across multiple generations of cutting-edge AI servers requires intentional, planned "crop rotation."
3. More power, more problems: The goodput challenge. Hardware failure rates increase linearly with scale, thus a 2GW campus will see 20X the number of server and network failures experienced by a 100MW facility. The goal remains to maximize goodput for the large jobs running in the facility, and both local redundancy and fast recovery become critical as scale increases.
Paper
HPC for Machine Learning
System Software and Cloud Computing
Livestreamed
Recorded
TP
DescriptionPipeline parallelism has emerged as a predominant approach for deploying LLMs across distributed nodes. However, it often suffers from performance limitations caused by pipeline bubbles, which primarily result from imbalanced computation delays across batches. Existing methods attempt to address this through hybrid scheduling of chunked prefill and decode tokens. However, such methods may experience significant fluctuations due to either insufficient prefill tokens or uneven distribution of decode tokens, ultimately leading to computational imbalance. To overcome these inefficiencies, we present gLLM, a globally balanced system incorporating token throttling. Our token throttling mechanism is a fine-grained scheduling policy that independently regulates the quantities of prefill and decode tokens by leveraging global information from the inference system. Furthermore, gLLM runtime adopts an asynchronous execution and message passing architecture. Evaluations show that gLLM delivers 11% to 398% higher maximum throughput compared to state-of-the-art pipeline or tensor parallelism systems, while simultaneously maintaining lower latency.
SCinet Network Research Exhibition
Not Livestreamed
Not Recorded
SCinet Network Research Exhibition
Not Livestreamed
Not Recorded
DescriptionNRI103s1
Exhibitor Forum
Hardware and Architecture
Livestreamed
Recorded
TP
XO/EX
DescriptionDespite the extraordinary heat dissipation potential that liquid cooling provides to microprocessors, the global demand for compute performance has already surpassed the capabilities of entry-level liquid cooling technologies due to the compounding challenge of higher heat flux coinciding with lower allowable processor temperatures. The response has been a growing effort to decrease facility water temperatures using historically inefficient methods such as mechanical chillers. As processor thermal resistance targets plummet, the coolant distribution unit (CDU) is quickly becoming a significant contributor to the temperature drop between the chip and facility water. Strategic Thermal Labs teamed up with a North American hyperscaler to perform a detailed datacenter cooling analysis to provide insight on the capital and operational expense reduction that is possible through elimination of CDUs in various global climates. Furthermore, discussion is provided on the viability of merging the FWS and TCS water loops at scale.
Research and ACM SRC Posters
Research & ACM SRC Posters
TP
DescriptionThe field of graph machine learning has seen significant growth with the success of graph neural networks (GCNs). However, most traditional GCNs are designed for static graphs. In the real world, graphs are constantly evolving—new users join social networks, molecules change shape, and data streams into a network. Re-computing a GCN's embeddings for the entire graph every time a small change occurs is computationally expensive and inefficient. This research explores two more efficient approaches: a standard incremental update method and a novel meta-learning approach, which are then benchmarked to compare their performance.
SCinet Network Research Exhibition
Not Livestreamed
Not Recorded
DescriptionNo NRI.
Paper
Data Analytics, Visualization & Storage
Livestreamed
Recorded
TP
DescriptionLSM tree-based key-value stores are widely deployed in modern cloud storage systems thanks to high data storage efficiency and retrieval capabilities. The compaction in the LSM tree, however, results in severe performance bottlenecks, especially in large-sized value cases. While key-value separation methods mitigate the performance bottlenecks caused by compaction, the existing methods do not fully address merge-sorting during compaction and expensive garbage collection (GC). We propose gParaKV, a GPGPU-empowered KV store with a KV separation mechanism, leveraging the GPGPU parallel technology to accelerate merge-sorting in compaction and GC. gParaKV embraces a GPGPU bitmap structure, parallel data marking, and a parallel GC mechanism. These critical components curtail the overhead of merge-sorting and GC by virtue of parallel computing. We compare it with state-of-the-art KV stores under various workloads. The experimental results show that gParaKV can improve the write performance and GC efficiency compared to existing key-value separation-based KV stores.
Research and ACM SRC Posters
Research & ACM SRC Posters
TP
DescriptionThe Sparsely-Gated Mixture of Experts (MoE) has seen a surge in use over the last year. This is primarily motivated by a desire to increase the size of language models, without a proportional increase in the total number of FLOPs. Due to their popularity, there is a large volume of work studying distributed MoEs for large-scale training. In this work, we find that there is room to improve the performance of MoEs when run on a single GPU—an important case for inference and fine-tuning. In our efforts to improve single-GPU performance, we implement triton kernels for grouped matrix multiplications and gated linear units. These kernels support fusing operations for token routing, in order to reduce the number of access to slow off chip memory.
Paper
Data Analytics, Visualization & Storage
Livestreamed
Recorded
TP
DescriptionThis work proposes VGC, a versatile and ultra-fast GPU lossy compression framework designed to address the growing data challenges in high performance computing (HPC). VGC captures dimension information in scientific data and supports three compression algorithms, achieving high compression ratios across diverse HPC domains. Built with a highly optimized GPU kernel, VGC delivers state-of-the-art throughput with error control. In addition to compression ratio and speed, VGC supports two distinctive modes that enhance its versatility. Memory-efficient compression uses a kernel fission design to compute compressed size, allocate only the required GPU memory, and compress data without waste, effectively reducing memory footprint. Selective decompression introduces an early stopping mechanism that enables direct access to regions of interest without decompressing the entire dataset.
Workshop
Livestreamed
Recorded
TP
W
DescriptionWe present the design, implementation, and evaluation of an elective course on GPU architecture and programming, offered to undergraduate and graduate students during Fall 2024 and Spring 2025. Aimed at equipping students with skills to build AI agents and workflows using AWS GPUs and SageMaker, the course began with foundational GPU architecture and parallel computing and progressed to hands-on development using Python. Students gained experience configuring cloud-based GPU instances, implementing parallel algorithms, and deploying scalable AI solutions. Learning outcomes were evaluated via assessments, course evaluations, and anonymous surveys. The results reveal that (1) AWS is an effective and economical platform for practical GPU programming, (2) experiential learning significantly enhanced technical proficiency, and (3) the course strengthened students’ problem-solving and critical thinking skills through tools such as TensorBoard and HPC profilers, which exposed performance bottlenecks and scaling issues. Our findings underscore the pedagogical value of integrating parallel computing into STEM education.
Paper
Algorithms
Applications
Data Analytics
Data Analytics, Visualization & Storage
Livestreamed
Recorded
TP
DescriptionSSD-based graph processing systems have emerged as a cost-effective solution for handling large-scale graphs. However, the large access granularity (e.g., 4KB) of an SSD often leads to low I/O efficiency. In this paper, we propose Graphago, an activity-aware graph preprocessing technique for SSD-based graph processing systems. The main idea of Graphago is the combined use of three key designs that synergistically optimize the graph storage and organization based on the active extent of graph data, thereby achieving both high I/O efficiency and satisfactory processing performance: 1) a dual-centrality activity prediction model to efficiently predict the active extent of each vertex, 2) an activity-neighborhood graph ordering technique to minimize read amplification without sacrificing graph traversal efficiency, and 3) an active-data-balanced graph partitioning scheme to address the I/O imbalance problem. Our evaluation results show that Graphago outperforms state-of-the-art SSD-based graph processing systems by up to 4.8×.
Paper
Architectures & Networks
BP
System Software and Cloud Computing
Livestreamed
Recorded
TP
DescriptionGreenMix is motivated by the renewed interest in asymmetric multi-core processors and the emergence of the serverless computing model. Asymmetric multi-cores offer better energy and performance trade-offs by placing different core types on the same die. However, existing serverless scheduling techniques do not leverage these benefits. GreenMix is the first serverless work to reduce energy and serverless keep-alive costs while meeting QoS targets by leveraging asymmetric multi-cores. GreenMix employs randomized sketching, tailored for serverless function execution and keep-alive, and leverages processor asymmetry, to perform within 10% of the optimal solution in terms of energy efficiency and keep-alive cost reduction. GreenMix’s effectiveness is demonstrated through evaluations with production-grade serverless function invocation traces on different clusters made up of ARM big.LITTLE and Intel Alder asymmetric multi-core processors. GreenMix outperforms competing serverless frameworks and asymmetric core-aware schedulers, offering a novel approach for energy-efficient serverless computing.
Workshop
Livestreamed
Recorded
TP
W
DescriptionThe complexity of traditional power system analysis workflows presents significant barriers to efficient decision-making in modern electric grids. This paper presents GridMind, a multi-agent AI system that integrates Large Language Models (LLMs) with deterministic engineering solvers to enable conversational scientific computing for power system analysis. The system employs specialized agents coordinating AC Optimal Power Flow and N-1 contingency analysis through natural language interfaces while maintaining numerical precision via function calls. GridMind addresses workflow integration, knowledge accessibility, context preservation, and expert decision-support augmentation. Experimental evaluation on IEEE test cases demonstrates that the proposed agentic framework consistently delivers correct solutions across all tested language models, with smaller LLMs achieving comparable analytical accuracy with reduced computational latency. This work establishes agentic AI as a viable paradigm for scientific computing, demonstrating how conversational interfaces can enhance accessibility while preserving numerical rigor essential for critical engineering applications.
Early Career
Not Livestreamed
Not Recorded
TP
W
TUT
XO/EX
DescriptionIn this session, three professionals will share inspiring stories from their mentoring journeys—how learning from mentors, and later becoming mentors themselves, has shaped their careers, perspectives, and purpose. Whether you’re just starting out or well into your professional journey, this conversation will offer insights and inspiration on the lasting impact of mentorship.
Workshop
Livestreamed
Recorded
TP
W
DescriptionWe develop several machine learning (ML)-based methods to estimate resources required for massively-parallel chemistry computations, e.g., coupled-cluster methods, to guide application users before they run expensive simulations on supercomputers. By estimating computational resources, our ML-based methods predict optimal runtime parameters (number of nodes, tile sizes, etc.). With these predictions, we answer users' questions such as i) what is the minimum execution time for a given problem size?, ii) what are the number of nodes and tiles sizes to achieve this minimum execution time?, and iii) how about a supercomputer for which the number of past application runs that an ML model can be trained by is limited? Our work offers several ML models trained by the simulations of a coupled-cluster method run on Frontier, Aurora and Perlmutter supercomputers. We devise two strategies based on active and generative learning. By inquiring about costs beforehand, users can save significant amount of expenses.
Panel
Architectures
SC Community Hot Topics
Livestreamed
Recorded
TP
DescriptionHardware specialization can provide large performance boosts, but special-purpose systems require significant investments. Therefore, scientific communities that do not have enough resources risk falling behind because they can no longer piggyback on the advancement of general-purpose hardware. Scientific workloads include a wide variety of important applications, such as climate modeling and fluid simulation. This realization motivates modular HPC systems where specialized silicon and hardware can be easily generated and integrated into future systems. Realizing the goal of a modular HPC system requires pathfinding, multi-disciplinary research, and community engagement. In this panel, we will debate diverse strategies from different communities around the globe; how silicon, hardware, and software should evolve to support modularity; the need for standardization and how to realize it; as well as what figures of merit we should strive for.
Research and ACM SRC Posters
Research & ACM SRC Posters
TP
DescriptionEffectively leveraging quantum computing requires generating and manipulating a desired quantum state using a quantum circuit. Quantum circuit synthesis (QCS) is bottlenecked by the exponential complexity of circuit verification via quantum simulation. Diffusion models are promising QCS candidates, because they circumvent quantum simulation during training. Existing diffusion-based QCS models demonstrate success for unconstrained circuits, but prove insufficient for producing hardware topology-constrained circuits—a common restriction for modern quantum machines. This work introduces a novel hardware-aware conditioning framework that enables topology-constrained QCS. Our approach delivers up to 8x higher success rate compared to the baseline for a state-of-the-art hardware-agnostic QCS model, proving the necessity for hardware-aware QCS.
Workshop
Livestreamed
Recorded
TP
W
DescriptionData movement is a key bottleneck in applications such as machine learning and scientific computing. Some software techniques address this by computing on subsets of data but this still requires reading the entire dataset to determine the subset. We propose a hardware-software co-design approach for iterative methods centered around two operations--filtering and updating. We introduce a domain-specific language that supports these computational patterns to enable PIM programming. Since filter and update are simple pointwise operations, PIM hardware requires only limited compute capability.
In this work, we investigate gradient descent for an ill-conditioned convex optimization function using this approach and map it to a PIM architecture using the PIMEval architectural simulator. Filter load and update store operations sparsify the data set by 83% while requiring as few as 1.5x more iterations to converge compared to traditional gradient descent approaches, with a net reduction in data movement of 3.9x.
In this work, we investigate gradient descent for an ill-conditioned convex optimization function using this approach and map it to a PIM architecture using the PIMEval architectural simulator. Filter load and update store operations sparsify the data set by 83% while requiring as few as 1.5x more iterations to converge compared to traditional gradient descent approaches, with a net reduction in data movement of 3.9x.
Research and ACM SRC Posters
Research & ACM SRC Posters
TP
DescriptionIn high-performance computing, scratch storage holds intermediate data while archival storage holds long-term data. These distinct objectives lead to separate implementations, implying additional costs and requiring explicit data transfers. This separation is inefficient, reduces reliability, and increases complexity. We propose a policy-driven, unified solution that reconfigures existing technologies for seamless adaptation between modes. Our synthetic and real-world benchmarks demonstrate that naively combining scratch and archive degrades performance by 35%. However, our policy enhancements eliminate this performance difference while maintaining system stability.
Birds of a Feather
Storage
Livestreamed
Recorded
TP
XO/EX
DescriptionHDF5 has been a vital HPC I/O library for over 25 years, continually evolving to integrate modern technologies and architectures. This session brings together HDF5 developers and community members to discuss best practices and showcase exciting new features for utilizing HDF5 on today's HPC systems. We will begin with a panel of community experts who will focus on forthcoming, new, and established HDF5 features that represent best practices. Following this, the audience will have the opportunity to share their experiences, insights, and questions, making them an integral part of this collaborative journey to advance HDF5.
Paper
Architectures & Networks
System Software and Cloud Computing
Livestreamed
Recorded
TP
DescriptionUnified memory (UM) technologies simplify memory management across CPU and GPU domains in GPU-accelerated heterogeneous architectures through transparent data migration. However, the default migration mechanism can severely degrade performance when applications oversubscribe GPU memory. Existing approaches to mitigating this performance degradation often fail to generalize, as they target specific application types, require specialized hardware, or integrate opaque classification methods.
We introduce HEterogeneous Locality Metrics (HELM), a novel set of semantically meaningful metrics designed to characterize UM access patterns across diverse applications. These metrics are quantified using readily accessible UM driver telemetry data, providing users with tractable and interpretable UM memory characterizations. Such insight is critical for selecting optimal UM migration and placement policies under oversubscription. We demonstrate HELM’s accuracy and interpretability through access pattern analysis across various UM workloads. Experimental results on real systems show that HELM effectively guides policy selection, which outperforms default UM behavior by 3.5X on average.
We introduce HEterogeneous Locality Metrics (HELM), a novel set of semantically meaningful metrics designed to characterize UM access patterns across diverse applications. These metrics are quantified using readily accessible UM driver telemetry data, providing users with tractable and interpretable UM memory characterizations. Such insight is critical for selecting optimal UM migration and placement policies under oversubscription. We demonstrate HELM’s accuracy and interpretability through access pattern analysis across various UM workloads. Experimental results on real systems show that HELM effectively guides policy selection, which outperforms default UM behavior by 3.5X on average.
Research and ACM SRC Posters
Research & ACM SRC Posters
TP
DescriptionModern supercomputing systems exhibit heterogeneous node configurations, where seemingly identical hardware exhibits significant performance variations due to memory capacity differences, manufacturing tolerances, and deployment conditions. This heterogeneity impacts the efficiency of scientific applications built on frameworks like AMReX, leading to substantial computational waste on leadership-class systems. We present performance-aware and relation-aware load balancing algorithms specifically designed for scientific applications, like AMReX on heterogeneous HPC clusters. Our approach uses empirically measured node performance characteristics and a relative performance matrix to optimize task distribution across diverse computational resources.
Evaluation of NERSC Perlmutter with 14 representative AMReX computational kernels demonstrates 99.9% scheduling efficiency, achieving performance improvements of 4.4%-11.5% over traditional methods in moderate heterogeneity scenarios (A100 40GB vs. 80GB) and up to 300x improvements in extreme CPU-GPU mixed configurations where homogeneous methods fail to utilize CPU resources effectively. The algorithms handle million-task workloads with O(nlogn + nm) complexity while maintaining practical deployment feasibility.
Evaluation of NERSC Perlmutter with 14 representative AMReX computational kernels demonstrates 99.9% scheduling efficiency, achieving performance improvements of 4.4%-11.5% over traditional methods in moderate heterogeneity scenarios (A100 40GB vs. 80GB) and up to 300x improvements in extreme CPU-GPU mixed configurations where homogeneous methods fail to utilize CPU resources effectively. The algorithms handle million-task workloads with O(nlogn + nm) complexity while maintaining practical deployment feasibility.
Doctoral Showcase
Research & ACM SRC Posters
Livestreamed
Recorded
TP
DescriptionEfficient workload mapping and scheduling in heterogeneous HPC environments connecting from IoT, edge devices to cloud is essential for optimizing resource use, reducing makespan, and ensuring adaptability. This research explores advanced solutions addressing mapping and scheduling by investigating the gaps in surveying the available tools and techniques that include classical optimization methods, emerging AI-driven models, and hybrid quantum-inspired approaches.
For workflow-based workload mapping and scheduling, the study employs a proper system and workload modeling and evaluates mixed-integer linear programming (MILP) for optimal assignment in smaller scenarios. In larger environments, a graph neural networks and reinforcement learning (GNN-RL) framework scales efficiently by learning adaptive policies reflecting task dependencies and system characteristics.
For task-based workload mapping and scheduling, outlined integrated AI scheduler (IAIS) framework dynamically manages resources in distributed, cloud, and HPC environments. IAIS combines recurrent neural networks (RNNs) and temporal convolutional networks (TCNs) for predicting optimal task allocation. Enhanced with proximal policy optimization (PPO)-based reinforcement learning, IAIS effectively predicts throughput, minimizes latency, and maximizes resource utilization. Complementary machine-learning models (e.g., simpler RNNs) further expedite allocation of independent tasks, notably in cloud contexts.
Comparative evaluations indicate notable tools and techniques for optimization performance, scalability, and resource efficiency applying IAIS, MILP, and GNN-RL. Specifically, IAIS and GNN-RL demonstrate strong adaptability and scalability within heterogeneous compute continuum environments, laying the groundwork for future cognitive scheduling assistants capable of real-time autonomous optimization.
For workflow-based workload mapping and scheduling, the study employs a proper system and workload modeling and evaluates mixed-integer linear programming (MILP) for optimal assignment in smaller scenarios. In larger environments, a graph neural networks and reinforcement learning (GNN-RL) framework scales efficiently by learning adaptive policies reflecting task dependencies and system characteristics.
For task-based workload mapping and scheduling, outlined integrated AI scheduler (IAIS) framework dynamically manages resources in distributed, cloud, and HPC environments. IAIS combines recurrent neural networks (RNNs) and temporal convolutional networks (TCNs) for predicting optimal task allocation. Enhanced with proximal policy optimization (PPO)-based reinforcement learning, IAIS effectively predicts throughput, minimizes latency, and maximizes resource utilization. Complementary machine-learning models (e.g., simpler RNNs) further expedite allocation of independent tasks, notably in cloud contexts.
Comparative evaluations indicate notable tools and techniques for optimization performance, scalability, and resource efficiency applying IAIS, MILP, and GNN-RL. Specifically, IAIS and GNN-RL demonstrate strong adaptability and scalability within heterogeneous compute continuum environments, laying the groundwork for future cognitive scheduling assistants capable of real-time autonomous optimization.
Paper
HPC for Machine Learning
System Software and Cloud Computing
Livestreamed
Recorded
TP
DescriptionSignificant resource demands in LLM serving prompts for full utilization on heterogeneous GPUs. However, existing works often struggle to scale efficiently in heterogeneous environments due to their coarse-grained and static parallelization strategies.
In this paper, we introduce Hetis, a system optimized for heterogeneous GPU clusters. Hetis addresses two critical challenges: memory inefficiency caused by the mismatch between memory capacity and computational power, and computational inefficiency arising from performance gaps across different LLM modules. To tackle these issues, Hetis employs a fine-grained and dynamic parallelism design. Specifically, it selectively parallelizes compute-intensive operations to reduce latency and dynamically distributes attention computations to low-end GPUs at a head granularity, leveraging the distinct characteristics of each module. Additionally, Hetis features an online load dispatching policy, continuously optimizing performance by balancing network latency, computational load, and memory intensity. Evaluation results demonstrate Hetis can improve serving throughput by up to $2.25\times$ and reduce latency by $1.49\times$.
In this paper, we introduce Hetis, a system optimized for heterogeneous GPU clusters. Hetis addresses two critical challenges: memory inefficiency caused by the mismatch between memory capacity and computational power, and computational inefficiency arising from performance gaps across different LLM modules. To tackle these issues, Hetis employs a fine-grained and dynamic parallelism design. Specifically, it selectively parallelizes compute-intensive operations to reduce latency and dynamically distributes attention computations to low-end GPUs at a head granularity, leveraging the distinct characteristics of each module. Additionally, Hetis features an online load dispatching policy, continuously optimizing performance by balancing network latency, computational load, and memory intensity. Evaluation results demonstrate Hetis can improve serving throughput by up to $2.25\times$ and reduce latency by $1.49\times$.
Research and ACM SRC Posters
Research & ACM SRC Posters
TP
DescriptionWe consider the problem of computing the singular value decomposition (SVD) of many relatively small matrices using GPUs. This is an essential component in various scientific applications, including computational chemistry, low-rank approximations, and others. Our approach is based on the parallel one-sided Jacobi algorithm, which has a large degree of parallelism, and also heavily relies on compute-bound level-3 BLAS operations, such as matrix multiply. Our approach uses two design strategies. The first one targets very small matrices using a single GPU kernel for the entire SVD operation. The second design strategy uses a blocked version of the parallel Jacobi algorithm, which supports matrices of arbitrary dimensions. The proposed solution supports any matrix shape (square, tall-skinny, or short-wide), requires no limitations on the matrix dimensions, and delivers superior performance against state-of-the-art solutions. This work is set to be released in the MAGMA library.
SCinet Network Research Exhibition
High Performance Networking with the São Paulo Backbone SP Linking 8 Universities and the Bella Link
1:00pm - 1:20pm CST Wednesday, 19 November 2025 Booth 3537 - SCinet TheaterNot Livestreamed
Not Recorded
DescriptionNRI104
Workshop
Livestreamed
Recorded
TP
W
DescriptionThis symposium-style workshop aims to connect researchers, developers, and Python practitioners to share their experiences scaling Python-based applications and workflows on supercomputers. The goal is to provide a platform for topical discussion of best practices, hands-on demonstrations, and community engagement via open-source contributions to new libraries, runtimes, and frameworks. Based on talks and demos that survey and summarize best practices and recent success stories and developments, the workshop provides attendees a forum for expanding their knowledge of tools and techniques as well as opportunities to provide feedback to tool developers.
Birds of a Feather
Community Meetings
Livestreamed
Recorded
TP
XO/EX
DescriptionThe High Performance Software Foundation (HPSF) is a hub for open-source, high performance software with a growing set of member organizations and projects. It aims to advance portable software for diverse hardware by increasing adoption, aiding community growth, and enabling development efforts. It also fosters collaboration through working groups such as Continuous Integration and Benchmarking.
This BoF will feature a panel of HPSF community leaders who will give an overview of developments in HPSF over the past year. This will include a status update on CI/CD, Benchmarking, and Binary Packaging working group activities, news about HPSFCon 2026, news from new members and projects, and new outreach activities that HPSF is undertaking.
Join the High Performance Software Foundation BoF to connect with foundation members, learn directly from leadership about its impactful activities, and explore how you can contribute to and benefit from leading HPC open-source initiatives.
This BoF will feature a panel of HPSF community leaders who will give an overview of developments in HPSF over the past year. This will include a status update on CI/CD, Benchmarking, and Binary Packaging working group activities, news about HPSFCon 2026, news from new members and projects, and new outreach activities that HPSF is undertaking.
Join the High Performance Software Foundation BoF to connect with foundation members, learn directly from leadership about its impactful activities, and explore how you can contribute to and benefit from leading HPC open-source initiatives.
Workshop
Livestreamed
Recorded
TP
W
DescriptionRecent architectures integrate high-performance and power-efficient matrix engines.
These engines demonstrate remarkable performance in low-precision matrix multiplication, which is crucial in deep learning.
Several techniques have been proposed to emulate single- and double-precision general matrix-matrix multiplication (SGEMM and DGEMM, respectively) by leveraging such low-precision matrix engines.
In this study, we present emulation methods that significantly outperforms conventional approaches.
On a GH200 Grace Hopper Superchip, the proposed DGEMM emulation achieves a 1.4x speedup and a 43% improvement in power efficiency compared to native DGEMM for sufficiently large problems.
The proposed SGEMM emulation achieves a 3.0x speedup and a 154% improvement in power efficiency compared to native SGEMM for sufficiently large problems.
Furthermore, compared to conventional emulation methods, the proposed emulation achieves more than 2x higher performance and superior power efficiency.
These engines demonstrate remarkable performance in low-precision matrix multiplication, which is crucial in deep learning.
Several techniques have been proposed to emulate single- and double-precision general matrix-matrix multiplication (SGEMM and DGEMM, respectively) by leveraging such low-precision matrix engines.
In this study, we present emulation methods that significantly outperforms conventional approaches.
On a GH200 Grace Hopper Superchip, the proposed DGEMM emulation achieves a 1.4x speedup and a 43% improvement in power efficiency compared to native DGEMM for sufficiently large problems.
The proposed SGEMM emulation achieves a 3.0x speedup and a 154% improvement in power efficiency compared to native SGEMM for sufficiently large problems.
Furthermore, compared to conventional emulation methods, the proposed emulation achieves more than 2x higher performance and superior power efficiency.
Tutorial
Livestreamed
Recorded
TUT
DescriptionHigh-performance networking technologies are generating a lot of excitement towards building next-generation high-end computing (HEC) systems for HPC and AI with GPUs, accelerators, data center processing units (DPUs), and a variety of application workloads. This tutorial provides an overview of these continuously evolving technologies, their architectural features, current market standing, and suitability for designing HEC systems. We present a bottom-up view of various major scale-out interconnects (IB, HSE, RoCE, Omni-Path, AWS-EFA, Cray/HPE Slingshot, and Fujitsu Tofu-D) as well as scale-up interconnects like NVLink/NVSwitch and AMD Infinity Fabric. Integration of these technologies into libraries such as UCX and Libfabric is also discussed. Emerging standards like Ultra Ethernet, UALink, and Scale-Up Ethernet (SUE) are also presented. Next, we provide an overview of GPU Direct RDMA technology, DPU/IPU technology (NVIDIA BlueField, AMD Pensando, Intel IPUs), and AI-specific hardware (Cerebras Wafer-Scale Engines and Intel/Habana-Gaudi processors). Finally, we provide an overview of sample performance numbers that can be harnessed from these networking technologies. The tutorial also includes a set of hands-on exercises to help attendees understand these technologies from the ground up, following the flow of the tutorial (networking technologies, MPI library integration, GPU-Awareness in MPI libraries, and DPU-Awareness in MPI libraries).
Paper
Algorithms
Applications
Architectures & Networks
Livestreamed
Recorded
TP
DescriptionWe present new branch-free algorithms for floating-point arithmetic at double, triple, or quadruple the native machine precision. These algorithms are the fastest known by at least an order of magnitude and are conjectured to be optimal, not only in an asymptotic sense, but in their exact FLOP count and circuit depth. Unlike previous algorithms, which either use complex branching logic or are only correct on specific classes of inputs, our algorithms have computer-verified proofs of correctness for all floating-point inputs within machine overflow and underflow thresholds. Compared to state-of-the-art multiprecision libraries, our algorithms achieve up to 11.7x the peak performance of QD, 34.4x over CAMPARY, 35.6x over MPFR, and 41.4x over FLINT.
Community Engagement and Support
Not Livestreamed
Not Recorded
TP
XO/EX
DescriptionThis work explores how HPC enhances self-adaptive machine learning for large-scale health and energy data, enabling efficient multimodal simulations and optimal trade-offs between computational cost, accuracy, and cloud deployment.
Invited Talk
AI, Machine Learning, & Deep Learning
Creativity
Livestreamed
Recorded
TP
DescriptionCreativity is often thought of as the pinnacle of human achievement, but now there is growing use of AI technologies for creative endeavors. We will talk about several examples of AI-supported creativity in applications such as culinary arts, music, and sustainable building materials, which have achieved large-scale industrial deployment and impact. Then we will discuss fundamental mathematical limit theorems for creativity, approaching those limits, and breaking those limits in moving from combinational creativity to transformational creativity. A key theme within the discussion is the high-performance computational requirements for creativity, including the large-scale computational group theory problems that must be solved in the information lattice learning approach to discovery and creativity.
Workshop
Livestreamed
Recorded
TP
W
DescriptionWe present a Python-native, GPU-accelerated LArTPC simulation (larnd-sim) built with Numba and CuPy and scaled on NERSC Perlmutter (AMD-Milan + A100) and TACC Vista (Arm64 + GH200). Guided by Nsight Systems/Compute and profiling, we reshape data (jagged-arrays, sub-batching), reduce allocations and transfers via buffer reuse, and tune kernels (grid/block, register ceilings). A targeted refactor replaces Python loops with vectorized bulk operations and moves function evaluations out of kernels to precomputed lookups, cutting CPU overhead and GPU math. Runs show >50% peak-memory cuts and >1.5x speedups, retained at scale. These profiling techniques and optimization strategies generalize to other accelerated Python workloads.
Research and ACM SRC Posters
Research & ACM SRC Posters
TP
DescriptionSparse attention is a core building block in many leading neural network models, from graph-structured learning to sparse sequence modeling. It can be decomposed into a sequence of three sparse matrix operations (3S): sampled dense-dense matrix multiplication (SDDMM), softmax normalization, and sparse matrix multiplication (SpMM). Efficiently executing the 3S computational pattern on modern GPUs remains challenging due to (a) the mismatch between unstructured sparsity and tensor cores optimized for dense operations, and (b) the high cost of data movement.
Previous works have optimized these sparse operations individually or addressed one of these challenges. This poster introduces Fused3S, the first fused 3S algorithm that jointly maximizes tensor core utilization and minimizes data movement. Across real-world graph datasets, Fused3S achieves significant speedup over state-of-the-art kernels on H100 and A30 GPUs.
Previous works have optimized these sparse operations individually or addressed one of these challenges. This poster introduces Fused3S, the first fused 3S algorithm that jointly maximizes tensor core utilization and minimizes data movement. Across real-world graph datasets, Fused3S achieves significant speedup over state-of-the-art kernels on H100 and A30 GPUs.
Art of HPC
Art of HPC
Not Livestreamed
Not Recorded
TP
W
TUT
XO/EX
DescriptionThis video visualizes data obtained from large-scale computations of an acoustic–gravity model implemented with the MFEM finite element library and performed on the El Capitan supercomputer. The 3D high-fidelity model computes the coupled ocean acoustic and surface gravity waves for a magnitude 8.7 earthquake scenario spanning the full margin of the Cascadia subduction zone that stretches 1,000 km from northern California to British Columbia. These computations are fundamental to enabling a newly developed digital twin methodology for real-time warning of tsunamis. Specifically, this Bayesian inversion-based digital twin employs acoustic pressure data from seafloor sensors, along with 3D coupled acoustic–gravity wave equations, to infer earthquake-induced spatiotemporal seafloor motion in real time and forecast tsunami propagation toward coastlines for early warning with quantified uncertainties. Details of this work are available in Henneking et al., 2025, to appear in the Proceedings of SC25.
Workshop
Performance Evaluation, Scalability, & Portability
Livestreamed
Recorded
TP
W
DescriptionHigher-level abstractions enable greater implementation freedom, which can be harnessed to yield greater portability and performance. At the same time, that higher level of abstraction can increase productivity through fewer lines of code, less learning, greater rigor, and fewer bugs.
This panel will feature representatives from both established and emerging high-level abstractions, to discuss their role in enabling high performance, portability, and productivity in scientific computing. We explore how the right level of abstraction can align with semantic intent while delegating performance, portability, and productivity concerns to implementation. Achieving this balance can be straightforward in some cases and challenging in others. This panel will examine when abstraction empowers P3 and when it falls short, offering insights drawn from practical experience and real-world examples.
This panel will feature representatives from both established and emerging high-level abstractions, to discuss their role in enabling high performance, portability, and productivity in scientific computing. We explore how the right level of abstraction can align with semantic intent while delegating performance, portability, and productivity concerns to implementation. Achieving this balance can be straightforward in some cases and challenging in others. This panel will examine when abstraction empowers P3 and when it falls short, offering insights drawn from practical experience and real-world examples.
Workshop
How effective is matrix reordering for improving performance of sparse matrix-vector multiplication?
12:10pm - 12:20pm CST Sunday, 16 November 2025 232Livestreamed
Recorded
TP
W
DescriptionThis work evaluates the impact of matrix reordering on the performance of sparse matrix-vector multiplication across different multicore CPU platforms. Reordering can enhance performance by optimizing the non-zero element patterns to reduce total data movement and improve the load-balancing. We examine how these gains vary over different CPUs for different reordering strategies, focusing on both sequential and parallel execution. We address multiple aspects, including appropriate measurement methodology, comparison across different kinds of reordering strategies, consistency across machines, and impact of load imbalance.
Birds of a Feather
Artificial Intelligence & Machine Learning
Livestreamed
Recorded
TP
XO/EX
DescriptionAs AI and HPC workloads scale, traditional architecture faces critical memory and bandwidth limitations. Compute Express Link® (CXL®) offers a transformative solution, enabling low-latency, coherent communication across CPUs, GPUs, and memory devices. This session will explore how CXL 2.0 and 3.x support memory disaggregation and composable infrastructure to unlock scalable and flexible deployment of large models and simulations. Attendees will learn how memory pooling and sharing reduce overprovisioning, improve utilization, and lower costs. We invite system architects, researchers, hardware developers, and operators to discuss real-world CXL adoption, implementation challenges, and opportunities to reshape the next-generation AI and HPC systems.
Exhibitor Forum
Exascale
Livestreamed
Recorded
TP
XO/EX
DescriptionToday, HPC institutions are surpassing the once-theoretical milestone of storing an exabyte of data. With the growth of AI-driven workflows, more organizations now face the challenge of scaling storage sustainably, cost-effectively, and securely.
In this technical session, Matt Starr, Field CTO and Global HPC Lead at Spectra Logic, and Dan Stanzione, PhD, Executive Director at the Texas Advanced Computing Center (TACC), will explore the design and deployment of TACC’s new exascale tape system that uses two Spectra TFinity tape libraries, Versity ScoutAM software, and the latest LTO technology.
Attendees will gain insight into sectors driving exabyte-capacity needs, understand the unique considerations for scalable, energy-efficient storage, learn about a real-world case study of TACC’s tape-based storage solution, access free technical resources, and more.
In this technical session, Matt Starr, Field CTO and Global HPC Lead at Spectra Logic, and Dan Stanzione, PhD, Executive Director at the Texas Advanced Computing Center (TACC), will explore the design and deployment of TACC’s new exascale tape system that uses two Spectra TFinity tape libraries, Versity ScoutAM software, and the latest LTO technology.
Attendees will gain insight into sectors driving exabyte-capacity needs, understand the unique considerations for scalable, energy-efficient storage, learn about a real-world case study of TACC’s tape-based storage solution, access free technical resources, and more.
Paper
Data Analytics, Visualization & Storage
Livestreamed
Recorded
TP
DescriptionScientific applications produce vast amounts of data, posing grand challenges for data management and analytics. Progressive compression is an approach to address this problem, as it allows for on-demand data retrieval with significantly reduced data movement cost. This work proposes HP-MDR, a high-performance and portable data refactoring and progressive retrieval framework for GPUs. Our contributions are threefold: (1) we optimize the bit-plane encoding and lossless encoding to achieve high performance on GPUs; (2) we propose pipeline optimization to further enhance the performance for large data processing; (3) we leverage our framework to enable data retrieval with guaranteed error control for quantities-of-interest; (4) we evaluate HP-MDR using five datasets. Evaluations demonstrate HP-MDR achieves 13.68x and 6.31x average throughput improvement for refactoring and progressive retrieval, respectively. It also leads to 11.22x throughput for recomposing data under quantity-of-interest error control and 6.04x performance for the corresponding end-to-end data retrieval compared with state-of-the-art solutions.
Students@SC
Not Livestreamed
Not Recorded
TP
W
TUT
XO/EX
Birds of a Feather
Applications
Livestreamed
Recorded
TP
XO/EX
DescriptionThis year continues the series of HPC and Cancer BoF sessions where the SC community gathers to share thoughts, insights, challenges, and opportunities in using HPC to ignite innovation and advance cancer research, improving outcomes of research and patients. Past BoFs have highlighted topics including data sharing, workforce development, digital twins, AI, drug discovery, and other areas where HPC plays a key role to impact cancer. This year's session will explore the topic of Cancer Team Data Science, taking the opportunity to discuss multiple collaborative topics while providing a key networking opportunity for those with an interest to impact cancer.
Workshop
Data Analytics
High Performance I/O, Storage, Archive, & File Systems
Storage
Livestreamed
Recorded
TP
W
DescriptionThe HPC division at Los Alamos National Laboratory employs specialized consultants who function as a conduit between users and the computing environment. The division uses a legacy ticketing system that tracks user issues and interactions with consultants. With over 100,000 tickets and decades of interactions there is an amazing amount of useful information within those tickets however, the ticket system makes it difficult to get that information out. The consultants want to know how this underutilized data can be used to better understand and serve their users. This project addresses their concerns by developing an AI-powered web application that provides a comprehensive analysis of consult tickets. The web application analyzes tickets with inference provider SambaNova to deliver fast inference on open-source large language models. This approach readily identifies user sentiment and ticket trends, determines the most recurring issues users face, like recurring I/O bottlenecks, and provides support through multiple functions.
Workshop
Livestreamed
Recorded
TP
W
DescriptionSchedulers are critical for optimal resource utilization in high-performance computing. Traditional methods to evaluate schedulers are limited to post-deployment analysis, or simulators, which do not model associated infrastructure. In this work, we present the first-of-its-kind integration of scheduling and digital twins in HPC. This enables what-if studies to understand the impact of parameter configurations and scheduling decisions on the physical assets, even before deployment, or regarching changes not easily realizable in production. We (1) provide the first digital twin framework extended with scheduling capabilities, (2) integrate various top-tier HPC systems given their publicly available datasets, (3) implement extensions to integrate external scheduling simulators. Finally, we show how to (4) implement and evaluate incentive structures, as-well-as (5) evaluate machine learning based scheduling, in such novel digital-twin based meta-framework to prototype scheduling. Our work enables what-if scenarios of HPC systems to evaluate sustainability, and the impact on the simulated system.
Birds of a Feather
Algorithms
Livestreamed
Recorded
TP
XO/EX
DescriptionGraph analytics is critical to scientific computing, artificial intelligence (AI), and national-scale data analysis. This BoF gathers the community developing high-performance systems for graph processing to discuss current capabilities, emerging challenges, and integration with graph databases, AI workflows, and scientific applications. We will explore both combinatorial and algebraic approaches, including updates from the GraphBLAS community. A key focus is identifying what capabilities—such as open, scalable graph toolchains and support for irregular workloads—require federal investment beyond what commercial vendors provide. The session will guide future research, software development, and funding priorities through expert discussion and broad community input.
HPC Ignites Plenary
Quantum & Other Post Moore Computing Technologies
Livestreamed
Recorded
TP
W
TUT
XO/EX
DescriptionJoin us for the SC25 HPC Ignites Plenary panel, “Why Should I Care About Quantum Computing?”, moderated by Dr. Charles Tahan, a visiting research professor at the University of Maryland, College Park, and former head of the U.S. National Quantum Coordination Office. This exciting session brings together thought leaders from universities, corporations — long established as well as startups — and government agencies to discuss the state of quantum computing and applications today. They will also look ahead to the short- and long-term impacts quantum computing will have on HPC and science in general.
Exhibitor Forum
Exascale
Livestreamed
Recorded
TP
XO/EX
DescriptionSupercomputing is hitting practical limits in power, cooling, and scale. Even hyperscale AI datacenters—often advertised as single, hundred-megawatt systems—are actually composed of multiple tightly coupled data centers. As a result, both the HPC and AI communities are increasingly focused on orchestrating complex, distributed workflows that span compute clusters, data systems, and edge facilities. While the workflows of HPC and hyperscale AI differ in implementation, they share many common demands on compute, data, and service infrastructure. Drawing on VAST Data’s work with leading HPC centers and AI providers, this talk explores concrete examples of how complex workflows are implemented in both domains. We will present key similarities between them, examine how hyperscale AI has built upon HPC-driven innovations, and suggest ways HPC can adjust its approach to complex workflows to leverage infrastructure advances from the AI ecosystem.
Community Engagement and Support
Not Livestreamed
Not Recorded
TP
XO/EX
DescriptionAt LabP2D/UDESC, we conduct HPC research in ML-based workflow scheduling, data center orchestration, and MPI optimization. This presentation highlights impactful results achieved through close collaboration between students and faculty.
Community Engagement and Support
Not Livestreamed
Not Recorded
TP
XO/EX
DescriptionThis talk highlights how an educational, training and research support HPC cluster supports postgraduate students from local institutions of higher learning, teaching of undergraduate and MSc courses, and researchers from local institutions. This HPC cluster was developed and provisioned through the Southern African Development Community (SADC) HPC Ecosystems Project. This talk will chronicle our journey, highlighting opportunities, regional collaborative efforts, and challenges.
Community Engagement and Support
Not Livestreamed
Not Recorded
TP
XO/EX
DescriptionWe share our experience launching the HPC Summer School in Colombia, from its start in 2017 to its current state, highlighting key lessons learned to support others aiming to create similar HPC education initiatives in developing regions.
Workshop
Livestreamed
Recorded
TP
W
DescriptionThe complexity of high performance computing (HPC) systems necessitates advanced techniques in system administration, configuration, and engineering and staff who are well versed on the best practices in this field. HPC systems professionals include system engineers, system administrators, network administrators, storage administrators, and operations staff who face problems unique to HPC systems. The ACM SIGHPC SYSPROS Virtual Chapter, the sponsor for this workshop, has been established to provide opportunities to develop and grow relationships focused specifically on the needs of HPC systems practitioners and to act as a support resource for them to help with the issues encountered in this specialized field.
This workshop is designed to share best practices for common HPC system deployment and maintenance, to provide a platform to discuss upcoming technologies, and to present the state-of-the-practice techniques that increase performance and reliability of systems, and in turn increase researcher and analyst productivity.
This workshop is designed to share best practices for common HPC system deployment and maintenance, to provide a platform to discuss upcoming technologies, and to present the state-of-the-practice techniques that increase performance and reliability of systems, and in turn increase researcher and analyst productivity.
Flash Session
Not Livestreamed
Not Recorded
TP
XO/EX
DescriptionHigh performance computing is evolving faster than ever, driven by breakthroughs in GPU-accelerated software. Discover how NVIDIA’s top HPC CUDA Libraries are powering the world’s leading scientific and engineering solvers—from CFD simulations to AI-driven drug discovery. We’ll highlight the most downloaded and fastest-growing libraries enabling next-generation HPC performance, introduce the top Enterprise NIMs like Nemotron Nano VLM, and show how you can easily deploy and scale these workloads on Oracle Cloud Infrastructure (OCI) with NVIDIA AI Enterprise.
Flash Session
Not Livestreamed
Not Recorded
TP
XO/EX
DescriptionThis session delves into the critical challenges and advanced solutions for managing power demand in high performance computing (HPC) environments, with a specific focus on intelligent rack power distribution. We explore the dynamic and often unpredictable power consumption patterns of modern HPC workloads, characterized by rapid transients, burstiness, and significant power spikes during events like job checkpointing and execution.
Workshop
Education & Workforce Development
Livestreamed
Recorded
TP
W
DescriptionWe present a proof-of-concept system for automating quality assurance (QA) in the HPC-ED federated training catalog using large language models (LLMs). The HPC-ED project aggregates metadata for training resources from multiple partner catalogs, improving discoverability for the high-performance computing (HPC) and cyberinfrastructure (CI) communities. While metadata publication processes have matured, QA remains largely manual, making it difficult to maintain accuracy and relevance at scale.
We have created an agent using commercial AI API calls to evaluate user submitted metadata and return a quality score, to help content providers improve their submitted metadata. The agent bases its score on a combination of the submitted catalog metadata and an AI created summary of a low-level crawl of the content item, with automated extraction of embedded content such as YouTube video transcripts.
We evaluated this agent on the HPC-ED Beta Catalog using four OpenAI models—GPT-3.5 Turbo, GPT-4o Mini, GPT-4.1 Nano, and GPT-4.1 Mini.
We have created an agent using commercial AI API calls to evaluate user submitted metadata and return a quality score, to help content providers improve their submitted metadata. The agent bases its score on a combination of the submitted catalog metadata and an AI created summary of a low-level crawl of the content item, with automated extraction of embedded content such as YouTube video transcripts.
We evaluated this agent on the HPC-ED Beta Catalog using four OpenAI models—GPT-3.5 Turbo, GPT-4o Mini, GPT-4.1 Nano, and GPT-4.1 Mini.
Paper
HPC for Machine Learning
Performance Measurement, Modeling, & Tools
Livestreamed
Recorded
TP
DescriptionLarge reasoning models (LRMs) are becoming increasingly popular as they offer advanced capabilities in logical inference, mathematical reasoning, and knowledge synthesis, even beyond those of standard language models. However, their complex training workflows present significant challenges in reproducibility, efficiency, and system-level optimization. This paper introduces HPC-R1, a comprehensive characterization of LRM training on a modern HPC cluster. We analyze all major stages, including supervised fine-tuning (SFT), Group Relative Policy Optimization (GRPO)-based reinforcement learning (RL), autoregressive generation, and distillation using customized state-of-the-art frameworks. Our detailed performance analysis reveals key system scaling behaviors. We find that GRPO-based reinforcement learning training is heavily communication-bound, with over 90% of GPU time spent in non-compute operations, and that SFT achieves stable GPU throughput near 9.8 TFLOPs. We also observe inference pipeline imbalance, where the performance gap between ranks can reach 64%. Based on these findings, we present recommendations to guide future AI-HPC system design.
Paper
Algorithms
Livestreamed
Recorded
TP
DescriptionStencil computations are fundamental to various HPC and intelligent computing applications, often consuming significant execution time. The emergence of specialized matrix units presents new opportunities to accelerate stencil computations. While scalable matrix compute units provide substantial computing horsepower, prior efforts fail to fully utilize the computing capabilities for stencils due to suboptimal matrix-unit utilization, limited instruction-level parallelism, and low cache hit rates. This paper introduces HStencil, a novel stencil computing framework utilizing matrix and vector units. HStencil addresses these challenges through three contributions: 1) microkernels that jointly leverage matrix and vector units to enhance hardware utilization; 2) fine-grained instruction scheduling with interleaved execution to enhance instruction-level parallelism; and 3) spatial prefetch to sustain high performance when working sets exceed cache capacity. Evaluations on representative benchmarks demonstrate that HStencil achieves maximum speedups of 1.81x–5.76x over auto-vectorization across different CPU platforms, and delivers 31%–91% higher performance versus state-of-the-art methods.
Workshop
Livestreamed
Recorded
TP
W
DescriptionAndrea will present her experiences working with Digital Twin technologies and the challenges faced and lessons learned. Dr. Townsend-Nicholson has an extensive track record of innovation in education and research, with a demonstrable and sustained ability to provide strategic vision and leadership. She delivers institutional, organisational and funder objectives with the support of colleagues and collaborators, whilst embracing fairness and equality, inclusion and opportunity.
Art of HPC
Art of HPC
Not Livestreamed
Not Recorded
TP
W
TUT
XO/EX
DescriptionParticipants perform independent tasks while contributing to a shared team objective in the Orchard scene during the Hummingbird VR performance.
Workshop
Livestreamed
Recorded
TP
W
DescriptionSupercomputing centers exist to drive scientific discovery by supporting researchers in computational science fields. To make users more productive in the complex HPC environment, HPC centers employ user support teams. These teams serve many roles, from setting up accounts, to consulting on math libraries and code optimization, to managing HPC software stacks. Often, support teams struggle to adequately support scientists. HPC environments are extremely complex, and combined with the complexity of multi-user installations, exotic hardware, and maintaining research software, supporting HPC users can be extremely demanding.
With the twelfth HPC User Support Tools (HUST) workshop, we continue to provide a necessary forum for system administrators, user support team members, tool developers, policy makers, and end users. We provide a forum to discuss support issues and we provide a publication venue for current support developments. Scope includes best practices, user support tools, and ideas to streamline user support at supercomputing centers.
With the twelfth HPC User Support Tools (HUST) workshop, we continue to provide a necessary forum for system administrators, user support team members, tool developers, policy makers, and end users. We provide a forum to discuss support issues and we provide a publication venue for current support developments. Scope includes best practices, user support tools, and ideas to streamline user support at supercomputing centers.
Workshop
Livestreamed
Recorded
TP
W
DescriptionWe present a hybrid GPU programming curriculum that allows students to choose between Python (Numba) and C++ (CUDA), supporting language flexibility in a distance-learning context. The dual-language design aims to engage both students familiar with high-level languages and those experienced in C/C++. The course was evaluated with 19 computer science students at Bachelor’s and Master’s levels. A pre-course survey assessed prior knowledge in programming and related tools. Students generally preferred the language they were more familiar with, and performance correlated with this prior experience. Although C/C++ users achieved slightly higher scores, regression analysis indicates that differences were largely due to prior knowledge, not language choice. Finally, we analyzes Python-specific pitfalls, including boundary errors, type mismatches in shared memory, and inefficient data transfers. These subtle issues often led to correctness or performance problems. We conclude with teaching recommendations to support pythonic GPU learning and help students avoid common mistakes.
Research and ACM SRC Posters
Research & ACM SRC Posters
TP
DescriptionThe prefill phase of large language model (LLM) inference, where the input prompt is processed to generate a key-value (KV) cache, is a critical latency bottleneck for input sequences. Existing serving architectures face a trade-off: data parallelism (DP) offers flexibility but cannot accelerate a single long prompt, while tensor parallelism (TP) parallelizes prefill but at the cost of rigid resource allocation and constant communication overhead at each layer. We introduce HydraCache, a system that resolves this problem by enabling a cluster of independent, data-parallel model replicas to collaborate on-demand to parallelize the prefill of a single long prompt. Our core contribution is DistBlendAttention, a lightweight mechanism that fuses distributed KV caches with minimal communication, avoiding the prohibitive overheads of both TP and traditional sequence parallelism. Our evaluation shows that HydraCache significantly reduces Time-to-First-Token (TTFT) up to 7x for requests and enables flexible, SLO-aware serving.
Paper
HPC for Machine Learning
State of the Practice
System Software and Cloud Computing
Livestreamed
Recorded
TP
DescriptionLarge models are evolving towards massive scale, diverse model architectures (dense and sparse), and long-context processing, which makes it very challenging to efficiently scale large models on parallel machines. The current widely-used parallelization strategies are often sub-optimal due to their limited parallelization strategy space. Therefore, we propose Hypertron, a scalable parallel large-model training framework which incorporates an unprecedented high-dimensional (up to 7D) parallelization space, a holistic scheme for efficient dimension fusion, and a comprehensive performance model to guide the high-dimensional exploration. By exploiting the high-dimensional space to discover the optimal strategy not supported by existing frameworks, Hypertron significantly reduces memory and communication cost while improving parallel scalability. Extensive evaluations demonstrate that Hypertron achieves up to 56.7% Model FLOPs Utilization (MFU) on 2,048 new-generation Ascend NPU accelerators (with supernodes) for different large models (such as sparse 141B and dense 310B), with 1.33x speedup over the best configuration of the state-of-the-art frameworks.
Paper
Algorithms
HPC for Machine Learning
Programming Frameworks
Livestreamed
Recorded
TP
DescriptionGeneral matrix-matrix multiplication (GEMM) is a core operation in both deep learning and scientific applications. However, as modern GPUs continue to scale in compute capability and adopt larger tile sizes, the wave quantization problem becomes increasingly unavoidable. Existing solutions either exhibit low execution efficiency or introduce additional synchronization overhead.
To address these challenges, we propose HyTiS, a hybrid tile scheduling framework that integrates two-level tile scheduling with adaptive tile layout selection. To enable this with minimal tuning overhead, throughput- and latency-oriented micro-kernels are identified during an offline profiling phase, forming an efficient runtime search space. Additionally, we investigate the impact of tile layouts on L2 cache and introduce an analytical model to select optimal layouts that minimize traffic from DRAM to the L2 cache at the wave granularity. Extensive evaluations on NVIDIA H100 and A100 demonstrate that HyTiS significantly outperforms cuBLAS, achieving speedups of up to 1.95x and 2.08x, respectively.
To address these challenges, we propose HyTiS, a hybrid tile scheduling framework that integrates two-level tile scheduling with adaptive tile layout selection. To enable this with minimal tuning overhead, throughput- and latency-oriented micro-kernels are identified during an offline profiling phase, forming an efficient runtime search space. Additionally, we investigate the impact of tile layouts on L2 cache and introduce an analytical model to select optimal layouts that minimize traffic from DRAM to the L2 cache at the wave granularity. Extensive evaluations on NVIDIA H100 and A100 demonstrate that HyTiS significantly outperforms cuBLAS, achieving speedups of up to 1.95x and 2.08x, respectively.
Workshop
Livestreamed
Recorded
TP
W
Workshop
Livestreamed
Recorded
TP
W
Workshop
Livestreamed
Recorded
TP
W
DescriptionMotivated by the properties of (unending) real-world cybersecurity streams, we summarize/sketch an algorithm to maintain a streaming graph and its connected components at single-edge granularity: every edge insertion could be followed by a query related to connectivity. To the best of our knowledge, this is the first streaming graph system to update at that granularity. In cybersecurity graph applications, the input stream typically consists of edge insertions; individual deletions are not explicit. Analysts will maintain as much history as possible and will trigger bulk deletions when necessary. During a bulk deletion (called "aging"), and the associated data-structure repairs, queries are disabled, but the system properly ingests any new edges that arrive. We briefly describe the (distributed parallel) algorithms. We present a (proved) relationship among four quantities for indefinite operation: the proportion of query downtime allowed, the proportion of edges that survive an aging event, the proportion of duplicated edges, and the bandwidth expansion factor. The latter is how much faster processors must communicate with each other than the stream arrival rate. We will also present some experimental results on Intel Sky Lake processors. This algorithm might be of increased interest now with the arrival of systems like Cerebras with extremely fast on-wafer networking.
Workshop
Livestreamed
Recorded
TP
W
DescriptionThe mantra of parallel algorithms is "minimize communication." We counter this view by showing instances where communicating more, not less, saves time. While these examples come primarily from applications in graph analytics with fine-grained communication and irregular parallelism, there are also instances in dense matrix computation where a similar conclusion may hold, at least in theory. The key technique is asynchronous, aggressively overlapped communication. A question I will pose is whether overlapping is merely a performance engineering concern, or whether there is anything algorithmically deeper about it “under the hood.”
Workshop
Livestreamed
Recorded
TP
W
Workshop
Livestreamed
Recorded
TP
W
DescriptionDue to the heterogeneous datasets they process, data-intensive applications employ diverse methods and data structures, exhibiting irregular data accesses, control flows, and communication patterns. Modern data analytics applications additionally require supporting dynamic data structures, asynchronous control flows, and mixed parallel programming models. Supercomputing systems are organized around software and hardware optimized for data locality and bulk synchronous computations. Managing irregular behaviors requires a substantial programming effort and lacks integration, leading to poor performance. Holistic solutions to these challenges emerge only by considering the problem from multiple perspectives: from micro- to system architectures, from compilers to languages, from libraries to runtimes, and algorithm design to data characteristics. Only collaborative efforts among researchers with different expertise, including domain experts and end users, can lead to significant breakthroughs. This workshop brings together scientists from different backgrounds to discuss methods and technologies for efficiently supporting irregular applications in current and future architectures.
Students@SC
Not Livestreamed
Not Recorded
TP
W
TUT
XO/EX
DescriptionAligned with SC25's "HPC Ignites" theme, this résumé course empowers students to showcase their contributions and potential in the dynamic field of high performance computing. Participants will learn strategies to highlight technical expertise, innovative projects, and collaborative achievements that spark interest from leading employers in HPC.
Workshop
Livestreamed
Recorded
TP
W
DescriptionModern HPC platforms increasingly adopt NUMA architectures, where OpenMP task-based programming model is a standard for enabling dynamic parallelism. However, the default OpenMP runtime is topology-agnostic, and the existing affinity policies are insufficient to ensure optimal performance on modern NUMA architectures. This lack of topology awareness results in suboptimal data locality and performance degradation. Additionally, the current OpenMP standard lacks mechanisms for detecting and mitigating the interference between concurrently executing tasks, further exacerbating the performance degradation. To enhance the performance of OpenMP task-based applications on NUMA architectures, we propose the ILAN scheduler: an interference- and locality-aware scheduler, employing moldability to dynamically minimize interference combined with hierarchical scheduling for improved data locality. We implement ILAN as an extension of LLVM OpenMP runtime. The results on a 64-core AMD Zen 4 platform show that ILAN achieves an average speedup of 13.2%, and a maximum speedup of 45.8% compared to the default scheduler.
Community Engagement and Support
Not Livestreamed
Not Recorded
TP
XO/EX
Art of HPC
Art of HPC
Not Livestreamed
Not Recorded
TP
W
TUT
XO/EX
DescriptionASCRIBE-VR transforms complex scientific datasets into immersive, explorable worlds. Running AI-driven segmentation on HPC systems, it isolates regions of interest from X-ray, CT, MRI, and electron microscopy 3D imaging, reshaping them into tangible virtual forms. Accessible through Meta Quest, this VR laboratory turns once-hidden structures into navigable spaces, with objects to walk through, examine, and interpret. By merging computational precision with embodied exploration, ASCRIBE-VR invites viewers to step inside the architecture of science itself.
Workshop
Livestreamed
Recorded
TP
W
DescriptionHigh-performance computing (HPC) datacenters must simultaneously support real-time data streams with sub-millisecond latency and bulk transfers requiring sustained multi-gigabit throughput—demands that compete for the same network resources. End-to-end performance guarantees are therefore essential, typically delivered through Quality of Service (QoS) mechanisms that classify traffic, reserve bandwidth, and enforce priorities across all network hops. While backbone and wide-area network providers already implement QoS, the local Ethernet ingress “last-mile” inside HPC facilities generally remains best-effort, creating a critical blind spot where latency builds and time-sensitive workflows can suffer. We address this gap with a standards-based Differentiated Services Code Point (DSCP) QoS configuration on existing leaf–spine switches: packets are marked at the host, queued per traffic class, and shaped on every hop through to the high-speed network (HSN) gateway NIC. Experiments on both intra-domain and inter-domain traffic show up to 60 percent more stable throughput and 30 percent fewer retransmissions, without hardware upgrades.
Workshop
Livestreamed
Recorded
TP
W
DescriptionModern day supercomputers are massively parallel, heterogeneous systems, many of which employ graphics processing units (GPUs) to accelerate applications. While C/C++, but also Python, gain traction in the high-performance computing (HPC) domain, Fortran continues to have a large developer base with new high performance code written every day. The OpenMP application binary interface (API) is a key ingredient to provide multi-threading and support for offloading execution to (GPU) for (HPC) applications. AMD is developing the AMD Next Generation Fortran compiler that will eventually replace the existing AMD Fortran Compiler, which is based on the Classic Flang compiler. This paper describes the general compilation pipeline of the AMD Next Generation Fortran Compiler. It shows how the compiler generates code for OpenMP target directives and their map clauses. The paper closes with a discussion of transformations in intermediate representation such as implementing DO CONCURRENT using OpenMP intermediate code.
Workshop
Livestreamed
Recorded
TP
W
DescriptionThis paper provides an example technique for how to run interactive and AI workloads alongside traditional High-Performance computing jobs. It explains how to use short max runtime for all jobs and a resource limited, high priority QOS, to allow quick interactive job starts without impacting system capacity for traditional HPC jobs. It includes background on the history and culture that contributed to the implementation and technique. The reference implementation, technique, and paper are Slurm centric in terminology, but the computer scheduling concepts and methods will translate to other implementations. The paper concludes with observations on the conditions that made these techniques effective and possible areas of future work to make them more broadly applicable.
Workshop
Livestreamed
Recorded
TP
W
DescriptionAs Moore's Law slows, superconducting electronics offer ultra-low-power, high-speed computation potential. This paper presents the first full-system superconducting architecture modeling in gem5, evaluating superconducting cores, caches, and interconnects under realistic workloads. We extend gem5 with cryogenic semiconductor (4 GHz) and superconducting (100 GHz) RISC-V cores and multi-level caches, evaluating RISC-V benchmarks and SPEC CPU2006 applications. We also integrate SRNoC, a superconducting interconnect, with the NOVA graph accelerator.
Results show superconducting cores and caches achieve up to 24x speedup for compute-intensive workloads, but memory-intensive applications are bottlenecked by room-temperature DRAM (1.2x improvement). High cache bandwidth requirements (800 GB/s) present design challenges. SRNoC provides 35-73x energy efficiency gains for narrow data paths but 1246x slowdown for wide data communication. Therefore, superconducting technology suits domain-specific accelerators better than general-purpose computing, with performance dependent on workload memory access patterns and data widths.
Results show superconducting cores and caches achieve up to 24x speedup for compute-intensive workloads, but memory-intensive applications are bottlenecked by room-temperature DRAM (1.2x improvement). High cache bandwidth requirements (800 GB/s) present design challenges. SRNoC provides 35-73x energy efficiency gains for narrow data paths but 1246x slowdown for wide data communication. Therefore, superconducting technology suits domain-specific accelerators better than general-purpose computing, with performance dependent on workload memory access patterns and data widths.
Doctoral Showcase
Research & ACM SRC Posters
Livestreamed
Recorded
TP
DescriptionHigh performance computing (HPC) applications generate massive volumes of data, placing sustained pressure on parallel file systems (PFS) that face limited bandwidth and resource contention. While file-per-process I/O allows lock-free access, reducing stripe contention, it creates excessive metadata overhead and poor manageability at scale. Aggregation—consolidating output from many processes into fewer shared files—helps mitigate these issues, but introduces new challenges related to concurrency, resource contention, complex I/O patterns, and their interactions with heterogeneous storage devices.
We identify and evaluate key I/O bottlenecks across these dimensions. To support system-level tuning, we introduce a lightweight OpenMP benchmark that helps users identify optimal aggregation parameters and found that interleaved, append-only I/O provides better performance when aggregating to a shared file. From this work, we present a novel, producer-consumer-based aggregation model designed to balance concurrency and resource usage efficiently. In microbenchmarks, our strategy achieved up to 2× higher write throughput than GIO and 1.6× higher than ADIOS2. In a real-world HPC application (HACC), it delivered 1.2× higher throughput with only 3% checkpoint overhead—compared to ~12% for GIO, which is optimized for HACC. Finally, we demonstrate the limitations of existing checkpointing approaches using DeepSpeed Megatron on the BLOOM 3B model, revealing significant inefficiencies during restore phase due to excessive reads and seeks.
Future work will extend our aggregation framework for large language model (LLM) C/R, which introduces highly concurrent, small, and random I/O patterns that pose new challenges for traditional PFS architectures.
We identify and evaluate key I/O bottlenecks across these dimensions. To support system-level tuning, we introduce a lightweight OpenMP benchmark that helps users identify optimal aggregation parameters and found that interleaved, append-only I/O provides better performance when aggregating to a shared file. From this work, we present a novel, producer-consumer-based aggregation model designed to balance concurrency and resource usage efficiently. In microbenchmarks, our strategy achieved up to 2× higher write throughput than GIO and 1.6× higher than ADIOS2. In a real-world HPC application (HACC), it delivered 1.2× higher throughput with only 3% checkpoint overhead—compared to ~12% for GIO, which is optimized for HACC. Finally, we demonstrate the limitations of existing checkpointing approaches using DeepSpeed Megatron on the BLOOM 3B model, revealing significant inefficiencies during restore phase due to excessive reads and seeks.
Future work will extend our aggregation framework for large language model (LLM) C/R, which introduces highly concurrent, small, and random I/O patterns that pose new challenges for traditional PFS architectures.
Paper
Algorithms
Livestreamed
Recorded
TP
DescriptionSparse matrix-sparse matrix multiplication (SpGEMM) is a key kernel in many scientific applications and graph workloads. Significant work has been devoted to developing row reordering schemes towards improving locality in sparse operations, but prior studies mostly focus on the case of sparse-matrix vector multiplication (SpMV).
In this paper, we address these issues with hierarchical clustering for SpGEMM that leverages both row reordering and cluster-wise computation to improve reuse in the B matrix with a novel row-clustered matrix format and access pattern in the left-hand side matrix. We find that hierarchical clustering can speed up SpGEMM by 1.39× on average with low preprocessing cost.
Additionally, this paper sheds light on the role of both row re-ordering and clustering for SpGEMM with a comprehensive empirical study of the effect of 10 different reordering algorithms and three clustering schemes on SpGEMM performance on a suite of 110 matrices.
In this paper, we address these issues with hierarchical clustering for SpGEMM that leverages both row reordering and cluster-wise computation to improve reuse in the B matrix with a novel row-clustered matrix format and access pattern in the left-hand side matrix. We find that hierarchical clustering can speed up SpGEMM by 1.39× on average with low preprocessing cost.
Additionally, this paper sheds light on the role of both row re-ordering and clustering for SpGEMM with a comprehensive empirical study of the effect of 10 different reordering algorithms and three clustering schemes on SpGEMM performance on a suite of 110 matrices.
Workshop
Livestreamed
Recorded
TP
W
DescriptionLifetime of electronic devices has a critical impact on their environmental footprint. In addition, the high-demand by AI companies of GPU has reduced tremendously their availability for supercomputing centers. Consequently, improving the duration of CPUs and GPUs is becoming a major issue in High Performance Computing (HPC) domain.
This paper investigates the optimization of a machine usage before a fatal failure and the trade-offs with performance. The lifetime of computing devices is strongly connected with the temperature and thus with the running frequency. We investigate the node frequency reconfiguration to optimize HPC usage. We estimate the benefit of a dedicated scheduling algorithm compared with a constant frequency.
We show that a correct decision can increase considerably the number of FLOP of a machine with a trade-off in terms of performance. Because aging models are currently inaccurate, we consider different models and discuss the robustness of our algorithms to inaccuracy.
This paper investigates the optimization of a machine usage before a fatal failure and the trade-offs with performance. The lifetime of computing devices is strongly connected with the temperature and thus with the running frequency. We investigate the node frequency reconfiguration to optimize HPC usage. We estimate the benefit of a dedicated scheduling algorithm compared with a constant frequency.
We show that a correct decision can increase considerably the number of FLOP of a machine with a trade-off in terms of performance. Because aging models are currently inaccurate, we consider different models and discuss the robustness of our algorithms to inaccuracy.
Tutorial
Livestreamed
Recorded
TUT
DescriptionThe past few years have witnessed an increased level of support for and deployment of programmable network adapters, known as "SmartNICs." These enhanced network devices offer standard packet processing capabilities as well as advanced "in-network" computing features built around programmable lightweight processing cores, FPGAs, and even CPU- and GPU-based platforms capable of running separate operating systems. SmartNICs have gained rapid adoption for data center tasks, including infrastructure management, packet filtering, and I/O acceleration. Increasingly, these devices are also being explored for high performance computing (HPC) and AI application acceleration. This tutorial offers an in-depth exploration of the state of the art for SmartNICs and the emerging software ecosystems supporting them. Attendees will engage in hands-on exercises to better understand how to take advantage of SmartNICs for accelerating HPC and AI applications. Specific topics include MPI and OpenMP offloading, algorithmic modifications to utilize SmartNIC processors, in-line packet processing frameworks like P4, security and containerization efforts, and I/O acceleration techniques. Participants will have the opportunity to execute these exercises using cutting-edge SmartNICs like NVIDIA’s BlueField-3 data processing unit (DPU) and a cloud-based netlab environment. Tutorial presenters will discuss additional techniques for optimizing applications to harness SmartNICs as communication accelerators in HPC systems.
SCinet Network Research Exhibition
Not Livestreamed
Not Recorded
DescriptionNo NRI.
Workshop
AI, Machine Learning, & Deep Learning
Clouds & Distributed Computing
Performance Evaluation, Scalability, & Portability
Scientific & Information Visualization
Partially Livestreamed
Partially Recorded
TP
W
DescriptionCoupled AI-Simulation workflows are becoming the major workloads for HPC facilities, and their increasing complexity necessitates new tools for performance analysis and prototyping of new in-situ workflows. We present SimAI-Bench, a tool designed to both prototype and evaluate these coupled workflows. In this paper, we use SimAI-Bench to benchmark the data transport performance of two common patterns on the Aurora supercomputer: a one-to-one workflow with co-located simulation and AI training instances, and a many-to-one workflow where a single AI model is trained from an ensemble of simulations. For the one-to-one pattern, our analysis shows that node-local and DragonHPC data staging strategies provide excellent performance compared Redis and Lustre file system. For the many-to-one pattern, we find that data transport becomes a dominant bottleneck as the ensemble size grows. Our evaluation reveals that file system is the optimal solution among the tested strategies for the many-to-one pattern.
Research and ACM SRC Posters
Research & ACM SRC Posters
TP
DescriptionSpiking neural networks (SNNs) are a promising alternative to conventional artificial neural networks (ANNs) due to their biological interpretability and capability to exploit sparse computation. Specialized hardware for SNNs has advantages over general-purpose devices in terms of power and performance. However, the computational requirements of modern spiking convolutional neural networks (SCNNs) render most SNN hardware inefficient for SCNN acceleration. Therefore, we present IncineRate, a flexible FPGA-based SCNN accelerator architecture. IncineRate has built-in support for many SCNNs, such as AlexNet, VGG16, and ResNets, and can be extended to support other network models. The number of simulation time steps, the network architecture, and other settings are specified at run time, allowing an already deployed device to execute multiple networks without reconfiguration. Our results show that IncineRate achieves state-of-the-art classification accuracy among FPGA-based SCNNs on CIFAR10 and CIFAR100.
Workshop
Livestreamed
Recorded
TP
W
DescriptionAnalyzing large-scale scientific datasets presents substantial challenges due to their sheer volume, structural complexity, and the need for specialized domain knowledge. Automation tools, such as PandasAI, typically require full data ingestion and lack context of the full data structure, making them impractical as intelligent data analysis assistants for datasets at the terabyte scale. To overcome these limitations, we propose InferA, a multi-agent system that leverages Large Language Models to enable scalable and efficient scientific data analysis. At the core of the architecture is a supervisor agent that orchestrates a team of specialized agents responsible for distinct phases of the data retrieval and analysis. The system engages interactively with users to elicit their analytical intent and confirm query objectives, ensuring alignment between user goals and system actions. To demonstrate the framework usability, we evaluate the system using ensemble runs from the HACC cosmology simulation which comprises several terabytes.
Workshop
Livestreamed
Recorded
TP
W
DescriptionIn this paper, we propose inferCT, an efficient framework that enables 3D deep learning for computed tomography (CT) during inference. Our baseline approach addresses this issue by partitioning CT volumes into cubic sub-volumes that fit into GPU memory and distributing them across multiple GPUs. Building on this, we introduce further vendor-agnostic optimizations, including a lock-free shared memory data structure to reduce synchronization overhead, pipeline execution to hide data prefetching and post-processing latency, and a parallel data loader to improve I/O efficiency. Results on both AMD and NVIDIA GPUs show that our optimized framework achieves speedups of 1.97× and 2.32× over the baseline for the 10243 and 40963 datasets, respectively. For the scalability tests, experiments demonstrate strong scaling efficiencies of 89.25% and 75.75% when scaling from 1 to 4 GPUs within a single NUMA node, and from 1 to 8 GPUs across two NUMA nodes, respectively, using the 40963 dataset.
Research and ACM SRC Posters
Research & ACM SRC Posters
TP
DescriptionThe increasing scale and complexity of scientific experiments has led to a growing need for efficient and scalable machine learning model inference serving systems. High-energy physics experiments and simulations of complex climate models involve petabytes of data and massive amounts of computational resources to produce accurate results. Thus, scientists are increasingly turning to utilize ML techniques to analyze and interpret the vast amount of data generated by these experiments.
However, the deployment of ML models in scientific applications poses significant challenges. Traditional approaches to deploying ML models by individual users with local resources or small clusters often suffer from long startup costs and inefficient resource utilization. To address this challenge, we present a prototyped system that provides on-demand inference serving capabilities for multiple scientific ML models. Our system is deployed across the NERSC Perlmutter supercomputer and the NERSC K8s cluster, enabling on-demand scalability.
However, the deployment of ML models in scientific applications poses significant challenges. Traditional approaches to deploying ML models by individual users with local resources or small clusters often suffer from long startup costs and inefficient resource utilization. To address this challenge, we present a prototyped system that provides on-demand inference serving capabilities for multiple scientific ML models. Our system is deployed across the NERSC Perlmutter supercomputer and the NERSC K8s cluster, enabling on-demand scalability.
Art of HPC
Art of HPC
Not Livestreamed
Not Recorded
TP
W
TUT
XO/EX
DescriptionThe image shows a fluid dynamics visualization of a periodic channel flow performed by x3d2, a newly developed high-fidelity CFD solver with CPU and GPU backends.
A channel flow is a ubiquitous test case for analyzing turbulence development and scales. The flow is bounded between two infinitely large planes and all the other directions are periodic, meaning the flow is recirculated once it leaves the computational domain. After an initial perturbation, all the scales of turbulence develop and are visible. In this image we highlight one of the infinitely long planes and the velocity field is displayed.
A channel flow is a ubiquitous test case for analyzing turbulence development and scales. The flow is bounded between two infinitely large planes and all the other directions are periodic, meaning the flow is recirculated once it leaves the computational domain. After an initial perturbation, all the scales of turbulence develop and are visible. In this image we highlight one of the infinitely long planes and the velocity field is displayed.
Paper
State of the Practice
Livestreamed
Recorded
TP
DescriptionHigh Performance LINPACK (HPL) remains the primary benchmark for evaluating supercomputing performance. It includes many parts with substantial internal complexity, and its performance is affected by a large number of parameters that interact in ways that are difficult to predict on large-scale heterogeneous supercomputer systems.
We present a comprehensive performance analysis of HPL on Frontier, the world's first exascale supercomputer, which achieved HPL performance of 1.35 exaflops. Through empirical parameter tuning, detailed modeling, and comparative evaluation, we uncover critical performance insights, share lessons learned, and outline best practices for effective parameter tuning on exascale systems.
We introduce and evaluate two novel PDFACT strategies: a dedicated-thread (DT) variant and a GPU-based variant (GPUPDFACT) implementation using HIP cooperative groups, demonstrating that GPU-based factorization outperforms conventional CPU-based PDFACT on Frontier's architecture.
Our findings establish key performance factors for HPL on exascale systems and offer valuable guidance for future high-performance computing and benchmarking efforts.
We present a comprehensive performance analysis of HPL on Frontier, the world's first exascale supercomputer, which achieved HPL performance of 1.35 exaflops. Through empirical parameter tuning, detailed modeling, and comparative evaluation, we uncover critical performance insights, share lessons learned, and outline best practices for effective parameter tuning on exascale systems.
We introduce and evaluate two novel PDFACT strategies: a dedicated-thread (DT) variant and a GPU-based variant (GPUPDFACT) implementation using HIP cooperative groups, demonstrating that GPU-based factorization outperforms conventional CPU-based PDFACT on Frontier's architecture.
Our findings establish key performance factors for HPL on exascale systems and offer valuable guidance for future high-performance computing and benchmarking efforts.
Birds of a Feather
Community Meetings
Livestreamed
Recorded
TP
XO/EX
DescriptionScience and computing are increasingly integrated, as large-scale scientific and computing challenges take on a national, and even international, scale. The need for resources at multiple facilities may be driven by access to site-specific hardware, security policy, or to ensure resilient operations. Deeper integration between facilities can create efficiencies for scientists, funding agencies, and the facilities themselves, but also exposes site incompatibilities in both technology and culture.
In this BoF we bring together seasoned experts in integrating supercomputing resources across institutions to discuss the challenges and opportunities of creating and managing the frameworks (political and technical) needed to integrate HPC.
In this BoF we bring together seasoned experts in integrating supercomputing resources across institutions to discuss the challenges and opportunities of creating and managing the frameworks (political and technical) needed to integrate HPC.
Workshop
Livestreamed
Recorded
TP
W
DescriptionIn this Integrated Research Infrastructure panel we will explore the network innovations needed for the emerging Artificial Intelligence, Super- and Quantum Computing facilities in the context of the mission infrastructures and American Science Cloud. Moreover we will explore several different testbed initiatives around the world that will support the research and innovation on network infrastructure.
Workshop
Livestreamed
Recorded
TP
W
DescriptionScientific workflows increasingly involve both HPC and machine-learning tasks, combining MPI-based simulations, training, and inference in a single execution. Launchers such as Slurm’s srun constrain concurrency and throughput, making them unsuitable for dynamic and heterogeneous workloads. We present a performance study of RADICAL-Pilot (RP) integrated with Flux and Dragon, two complementary runtime systems that enable hierarchical resource management and high-throughput function execution. Using synthetic and production-scale workloads on Frontier, we characterize the task execution properties of RP across runtime configurations. RP+Flux sustains up to 930 tasks/s, and RP+Flux+Dragon exceeds 1,500 tasks/s with over 99.6% utilization. In contrast, srun peaks at 152 tasks/s and degrades with scale, with utilization below 50%. For IMPECCABLE.v2 drug discovery campaign, RP+Flux reduces makespan by 30--60% relative to srun/Slurm and increases throughput more than four times on up to 1,024. These results demonstrate hybrid runtime integration in RP as a scalable approach for hybrid AI-HPC workloads.
Workshop
Livestreamed
Recorded
TP
W
DescriptionExisting object storage systems like AWS S3 and MinIO offer only limited in-storage compute capabilities, typically restricted to simple SQL WHERE-clause filtering. Conse-
quently, high-impact operators—such as aggregation and top-N—are still executed entirely at the compute layer. Recent advances in Object-based Computational Storage (OCS) enable these complex operators to run natively within storage, creating opportunities for substantial reductions in data movement and query time. To demonstrate these benefits in distributed SQL engines, we used Presto as a case study and developed the Presto-OCS connector, which analyzes execution plans to identify pushdown-eligible operators and offloads them to OCS for efficient in-storage execution. Evaluations with real-world HPC analytics queries and the TPC-H benchmark show that our approach achieves up to 4.07× speedup and 99% data movement reduction compared to filter-only pushdown. When combined with compression techniques, our approach delivers 1.39×speedup over compressed filter-only pushdown, demonstrating that advanced query pushdown complements existing optimizations.
quently, high-impact operators—such as aggregation and top-N—are still executed entirely at the compute layer. Recent advances in Object-based Computational Storage (OCS) enable these complex operators to run natively within storage, creating opportunities for substantial reductions in data movement and query time. To demonstrate these benefits in distributed SQL engines, we used Presto as a case study and developed the Presto-OCS connector, which analyzes execution plans to identify pushdown-eligible operators and offloads them to OCS for efficient in-storage execution. Evaluations with real-world HPC analytics queries and the TPC-H benchmark show that our approach achieves up to 4.07× speedup and 99% data movement reduction compared to filter-only pushdown. When combined with compression techniques, our approach delivers 1.39×speedup over compressed filter-only pushdown, demonstrating that advanced query pushdown complements existing optimizations.
Workshop
Livestreamed
Recorded
TP
W
DescriptionWith the end of Moore's law and Dennard scaling, efficient training increasingly requires rethinking data volume. Can we train better models with significantly less data via intelligent subsampling? To explore this, we develop SICKLE, a sparse intelligent curation framework for efficient learning, featuring a novel maximum entropy (MaxEnt) sampling approach, scalable training, and energy benchmarking. We compare MaxEnt with random and phase-space sampling on large direct numerical simulation (DNS) datasets of turbulence. Evaluating SICKLE at scale on Frontier, we show that subsampling as a preprocessing step can, in many cases, improve model accuracy and substantially lower energy consumption, with observed reductions of up to 38x.
Research and ACM SRC Posters
Research & ACM SRC Posters
TP
DescriptionHigh performance computing (HPC) schedulers must balance runtime and power. We present a surrogate-assisted multi-objective Bayesian optimization (MOBO) framework using TabNet regressors and models trained on attention-based embeddings, coupled with active-learning sample selection. The surrogates predict runtime and power, enabling MOBO to efficiently discover Pareto-optimal node allocations. We quantify trade-offs with Pareto fronts, hypervolume (HV), and Spread across PM100 and Adastra production traces. MOBO improves HV over single-objective baselines by 24% (PM100) and 37% (Adastra) and attains lower Spread in 75% of surrogate families. Active learning reduces evaluations by ~53%–70%. To our knowledge, this is the first demonstration of embedding-informed surrogates for MOBO applied to HPC job scheduling traces, optimizing runtime–power trade-offs on production datasets.
Workshop
Livestreamed
Recorded
TP
W
DescriptionDisaggregation is an emerging compute paradigm that splits existing monolithic servers into a number of consolidated single-resource pools that communicate over a fast interconnect. This model decouples individual hardware resources and enables the creation of logical compute platforms with flexible and dynamic hardware configurations. The concept of disaggregation is driven by various recent trends in computation. From an application perspective, the increasing importance of data analytics and machine learning workloads brings unprecedented need for memory capacity, which is in stark contrast with the growing imbalance in the peak compute-to-memory capacity ratio of traditional system board based servers. At the hardware front, the proliferation of heterogeneous, special-purpose computing elements promotes the need for composable platforms, while the increasing maturity of optical interconnects elevates the prospects of distance independence in networking infrastructure. The workshop intends to explore various aspects of resource disaggregation, composability, and their implications for future HPC platforms.
Workshop
Livestreamed
Recorded
TP
W
DescriptionIntroduction to the session, including outline and goals.
Workshop
Partially Livestreamed
Partially Recorded
TP
W
Students@SC
Not Livestreamed
Not Recorded
TP
W
TUT
XO/EX
DescriptionThe afternoon session of the HPC and AI Crash course (12:30 PM–3:30 PM) explores the fundamentals of artificial intelligence, starting with an overview of machine learning and deep learning. Students will then complete guided, hands-on AI challenges using HPC resources, including Anvil and optionally ORNL’s Frontier system. This is a great opportunity for students to build practical AI skills with real supercomputing access.
**Register here (limited to 100 students): https://forms.gle/c5n89rbJXCGR19E3A
**Register here (limited to 100 students): https://forms.gle/c5n89rbJXCGR19E3A
Students@SC
Not Livestreamed
Not Recorded
TP
W
TUT
XO/EX
DescriptionStart your SC25 experience on Sunday, November 16, with an engaging introduction to high performance computing (HPC). The morning session (8:30 AM–11:30 AM) covers programming environments, job schedulers, and key parallel programming models for CPUs and GPUs. After the overview, students will dive into hands-on HPC challenges using Purdue’s Anvil supercomputer. This session is ideal for those new to HPC or looking to sharpen their foundational skills.
**Register here (limited to 100 students): https://forms.gle/c5n89rbJXCGR19E3A
**Register here (limited to 100 students): https://forms.gle/c5n89rbJXCGR19E3A
Tutorial
Not Livestreamed
Not Recorded
TUT
DescriptionQuantum computing offers the potential to revolutionize high performance computing by providing a means to solve certain computational problems faster than any classical computer. Relatively recently, quantum computing has advanced from a theoretical possibility to engineered reality, with commercial entities offering early prototype quantum processors representing a variety of qubit technologies and computational paradigms. The media have been showcasing each new development and implicitly conveying the message that quantum computing ubiquity is nigh. Here, we will respond to this hype and provide an overview of the exciting but still early state of the field. We introduce participants to the computational models underlying quantum computing. We work through examples of its immense computational power while highlighting what the quantum computing community still does not know in terms of quantum algorithms and where the power of quantum computing comes from. We examine the thought processes that programmers use to map problems to circuit-model quantum computers, quantum annealers, measurement-based quantum systems, analog Rydberg atom arrays, and other recent inventions in the quantum computing space. We conclude with an overview of the hardware and algorithmic challenges that must be overcome before quantum computing becomes a component of the HPC developer's repertoire.
Workshop
Livestreamed
Recorded
TP
W
DescriptionWe propose a conditional normalizing flow (CNF) surrogate model to solve generative, many-to-one inverse problems in scientific simulations governed by partial differential equations (PDEs) with time-evolving interactions between heterogeneous materials. We present two case studies: electrostatic potential and heat diffusion, which serve as proxy simulations for generating diverse sets of initial conditions that can reproduce an observed output state (transient or steady). Finally, we provide a comprehensive overview of the synthetic datasets, the model specification, each stage of the experimental workflow, evaluation of training performance, and uncertainty quantification for the generated samples.
Workshop
Education & Workforce Development
Livestreamed
Recorded
TP
W
DescriptionHigh-Performance Computing clusters for research computing, hosted by universities, are essential for the institution's ongoing teaching, learning, and research. The learners have a range of experience and comfort with such platforms and require support regularly. To assist users on the Unity Research Computing Platform, the support team provides a Slack channel to get help, find relevant documentation, learn new information, and troubleshoot. However, this mode of support requires significant staff time. This study explores the design and implementation of an AI assistant aimed at augmenting existing support and helping users in their Self-Regulated Learning process, directing them to relevant learning resources, and answering simple questions. We discuss the Human-Centered AI Design and testing process and its significance for large-scale interventions.
Workshop
Partially Livestreamed
Partially Recorded
TP
W
DescriptionMixture of Experts (MoE) allows one to increase the model capacity with minimal training/inference cost. Recent LLMs such as Qwen3-235B-A22B, gpt-oss-120B, Kimi-K2, GLM-4.5, DeepSeek-R1 are very sparse MoEs, though there are some subtle differences in the details of the architecture. The first part of this talk will focus on our recent efforts to measure the effect of sparsity on memorization tasks and reasoning tasks. We initially find that increasing the total parameters without increasing the active parameters increases the performance on memorization tasks but shows an inverse scaling on reasoning tasks. However, when the dataset is carefully constructed we show that the inverse scaling on reasoning tasks disappears. The second part of this talk will describe a tool to estimate the memory consumption during distributed training, and its effectiveness when trying to maximize per GPU Flop/s on a given system.
Workshop
Not Livestreamed
Not Recorded
Partially Livestreamed
Partially Recorded
TP
W
DescriptionThe future of sustainable HPC depends on reducing energy consumption without sacrificing performance, especially with today’s diversification of accelerators. While advancements in hardware, workload management, and system implementation will dramatically influence energy savings, software must also play a key role. Recent advances with large language models (LLMs) show promise for automated code generation. However, most efforts prioritize functionality over energy efficiency, portability, and architecture-specific optimization. We discuss our latest progress in LASSI, an automated LLM-driven refactoring framework that generates translations and energy-efficient code on target parallel systems for given parallel code as input. Through multi-stage iterative refinement incorporating self-prompting, domain-specific context, and self-correcting feedback loops, LASSI demonstrates effectiveness across multiple device architectures as evaluated through functional equivalence metrics and expected energy reductions in generated codes. Such capabilities are paving the way toward AI-driven automation in heterogeneous HPC code development with a focus on performant and portable parallelized scientific software.
Workshop
Livestreamed
Recorded
TP
W
Description
This talk will focus on high-performance communication software for GPU supercomputers. It will explain NCCL and NVSHMEM, including their historical context from MPI and SHMEM. The functionality and performance will be demonstrated through an example from linear algebra. Real world results from both scientific and commercial AI use cases will be described.
This talk will focus on high-performance communication software for GPU supercomputers. It will explain NCCL and NVSHMEM, including their historical context from MPI and SHMEM. The functionality and performance will be demonstrated through an example from linear algebra. Real world results from both scientific and commercial AI use cases will be described.
Workshop
Livestreamed
Recorded
TP
W
DescriptionSoftware for heterogeneous systems. Integration of (S+D+L) by h3-Open-BDEC enables significant reduction of computations and power consumption, compared to those by conventional simulations. In January 2025, we started to operate the Miyabi system together with University of Tsukuba. Miyabi consists of GPU Cluster with 1,120 nodes of NVIDIA GH200 (Miyabi-G) and 380 sockets of Intel Max 9480 with HBM2e. In this talk, we will introduce activities related to AI-for-Science by integration on heterogeneous systems, such as Wisteria/BDEC-01 and Miyabi. Recently, RIKEN launched international initiative with Fujitsu and NVIDIA for "FugakuNEXT" development, where FugakuNEXT is based on heterogeneous compute nodes consisting of CPUs by Fujitsu and GPUs by NVDIA. This presentation will also introduce the prospects for developing AI-for-Science type applications on the Fugaku NEXT system based on the experiences with Wisteria/BDEC-01 and Miyabi.
Workshop
Livestreamed
Recorded
TP
W
DescriptionThe convergence of AI and large-scale scientific workflows presents a transformative opportunity to accelerate discovery across domains. However, this integration is challenged by fragmented data lifecycles, heterogeneous infrastructure, and the need for scalable orchestration frameworks that are both AI- and HPC-aware. In this talk, I will present an end-to-end vision and practical strategies for enabling AI-ready scientific workflows at scale. This includes integrating domain-specific foundation models, automating data staging across distributed resources, and leveraging adaptive workflow systems to optimize performance, cost, and energy usage. Drawing from real-world use cases within DOE science domains, I will outline a community roadmap for building interoperable, scalable, and FAIR-aligned ecosystems that support both traditional simulations and next-generation AI models in federated environments.
Workshop
Livestreamed
Recorded
TP
W
DescriptionIn this talk, I will describe my perspective as a researcher for 40 years on the topic of parallel programming models. Given the time constraints, I will only present a small sample of programming models including my group's, and focus on broader themes. Before MPI, there were parallel logic and functional languages, and then precursors to MPI like nxlib. These and the process of development of the MPI standard and its consequence will be discussed. Among the many alternative models that were developed, some were meant as competitors, some raised level of abstraction, or supported specialization. We will review a few models that survive today, including Charm++, Chapel, HPX and a few more. We will examine what the future programming model landscape may look like and what issues will shape them, including but not focusing solely on the needs of machine learning, data analytics or accelerators.
Workshop
Livestreamed
Recorded
TP
W
DescriptionSandia National Laboratories is charting a full-stack path that takes experimental accelerator hardware from prototype testbeds into the high-performance production systems running our mission codes. In the first part of this talk, I’ll survey our hardware prototyping journey—from the Advanced Architecture Testbed lineage, through the Kingfisher wafer-scale AI/HPC engine and our El Dorado deployment, to the Vanguard systems Astra and Spectra (in partnership with NextSilicon). In the second part, I’ll dive into two software initiatives I co-lead, each designed to fuse these novel accelerators into production workflows with performance portability:• CommBench/HiCCL, a unified micro-benchmarking and hierarchical collective-communication framework for multi-GPU, multi-NIC nodes• A Kokkos execution space for the NextSilicon Maverick-2 dataflow accelerator
Workshop
Livestreamed
Recorded
TP
W
DescriptionHPC, AI, and Quantum Computing have emerged as essential capabilities for future computing. The impact of integration of these capabilities will be revolutionary for the whole spectrum of research and enterprise computing. In this presentation we will discuss some of the potential impacts of such integration and describe approaches to the ideal deployment of these heterogeneous computing platforms.
Workshop
Livestreamed
Recorded
TP
W
DescriptionThe RISC-V ISA, with the RVA23 compatibility profile, is enabling innovation to meet the power-performance-area (PPA) needs of the most demanding computing workloads. Condor Computing presents Cuzco, a novel RISC-V processor IP design, for inclusion in customer System-on-Chips (SoCs) which will compete with current top-of-the-line processors in high-end data center and high-performance computing (HPC) applications, while delivering better PPA results. This paper describes how Cuzco’s design demonstrates that a rapidly maturing RISC-V ecosystem provides the Instruction Set Architecture (ISA), tools and platforms which, when enhanced with internal efforts, can achieve high PPA efficiencies without sacrificing performance. Cuzco is a new class of processor design built around a novel compiler-like instruction scheduling for runtime execution to enable high performance while optimizing power and area dimensions through dynamic and physical scaling, including dynamic and/or physical %optimization of resources for dispatch and retire width. Basline 8-way design achieves a performance of 15-20 SpecInt2K6/GHz.
Workshop
Livestreamed
Recorded
TP
W
DescriptionThe field of AI (including Machine Learning (ML), Deep Learning (DL), Big Data, and Data Science) are rapidly evolving. The effective development and usage of many AI models and the associated inference schemes depend on a good understanding of the underlying HPC hardware and software technologies. Thus, it is becoming a challenge for students and professionals to have a holistic understanding of this new field. In this context, I will share experiences from the following four initiatives in which I am engaged with: 1) A semester long course on `High-Performance Deep/Machine Learning’ for combined undergraduate and graduate students at the Ohio State University; 2) A two series 14-week course (developed through an NSF funding) on 'AI Bootcamp for Cyberinfrastructure Professionals' working in many different HPC centers; 3) A Half-day/full-day conference tutorial on “Principles and Practice of High-Performance Deep/Machine Learning”, and 4) Nurturing Next-Gen students for democratizing AI through the NSF-funded ICICLE (icicle.ai) Institute. An overview of these initiatives and the associated approaches will be presented.
Workshop
Livestreamed
Recorded
TP
W
DescriptionScientific applications have long been a driving force in parallel and distributed computing, as their ever-growing demand for performance, scalability, and data handling has consistently pushed the boundaries of computational technologies. Traditionally, these applications have been executed within a single high-performance computing (HPC) cluster, relying on tightly coupled parallelism and localized resource management. In the recent years, this landscape has changed dramatically. Increasing data volumes, the emergence of edge and IoT devices, and the growing reliance on cloud services have collectively transformed the execution model of scientific workloads. Today, applications are no longer confined to isolated clusters but increasingly span large-scale, geo-distributed infrastructures that extend from sensors at the edge to powerful HPC systems in the cloud. This emerging environment—often referred to as the Continuum—presents unique challenges in heterogeneity, dynamism, and coordination, while also offering opportunities for pervasive, resilient, and efficient scientific computing.
In this talk, we present COLMENA, a programming model tailored for swarm computing within the Continuum. At its core, the COLMENA runtime establishes a collaborative, peer-to-peer environment where each device operates as an Autonomous Agent (ANT), fully aware of both its computational resources and its contextual circumstances. The programming model offers abstractions to define the roles and functionalities that compose an application, as well as the mechanisms through which these roles interact—whether by exchanging messages, sharing data, or distributing computational workload. During execution, ANTs can make autonomous or consensus-based decisions on which roles to assume, dynamically adapting to the application’s requirements. This decentralized approach allows COLMENA to unlock the full potential of the Continuum while maintaining ease of programmability, flexibility, and scalability.
In this talk, we present COLMENA, a programming model tailored for swarm computing within the Continuum. At its core, the COLMENA runtime establishes a collaborative, peer-to-peer environment where each device operates as an Autonomous Agent (ANT), fully aware of both its computational resources and its contextual circumstances. The programming model offers abstractions to define the roles and functionalities that compose an application, as well as the mechanisms through which these roles interact—whether by exchanging messages, sharing data, or distributing computational workload. During execution, ANTs can make autonomous or consensus-based decisions on which roles to assume, dynamically adapting to the application’s requirements. This decentralized approach allows COLMENA to unlock the full potential of the Continuum while maintaining ease of programmability, flexibility, and scalability.
Workshop
Livestreamed
Recorded
TP
W
DescriptionScientific research increasingly depends on the movement, management, and analysis of massive data volumes. Globus, a widely used research IT platform, addresses these needs by providing secure, reliable, and high-performance capabilities for data management, computation, and workflows across global research cyberinfrastructure. Serving more than 700,000 researchers and applications across 60,000 active data collections in over 80 countries, Globus has become a critical enabler of data-intensive science. In this talk, I will highlight two ways in which Globus supports innovation and sustainability in research computing. First, I will describe a framework built on Globus that integrates error-bounded lossy compression into data transfers, using machine learning-based quality estimation and optimized transfer strategies to achieve performance improvements while maintaining user-specified quality. Second, I will discuss how Globus itself provides a model for sustainable research software, via hybrid cloud and "freemium" subscription approaches that balance accessibility with long-term viability.
Workshop
Livestreamed
Recorded
TP
W
DescriptionOne of the biggest challenges in scientific supercomputing today is the increasing complexity of workflows, driven by the rapid increase in AI adoption in all aspects of scientific computing. These place new demands on traditional HPC infrastructure, which now need to support workflows combining data preparation and movement, AI model training and inference, analysis and large-scale simulations. In this talk, I will describe the complex AI HPC workflows that are driving the design of Doudna, NERSC’s next supercomputer, and the design of a broader architecture across DOE to support an Integrated Research Infrastructure (IRI). Through case studies and real-life challenges, I will describe how AI-driven requirements translate to technical innovations for Doudna and IRI
Workshop
Livestreamed
Recorded
TP
W
DescriptionUnderstanding the evolution of the Universe from its earliest moments to the present is one of the central goals of modern physics. Achieving this requires not only next-generation observational instruments but also the ability to simulate the Universe at extreme scale. These large-scale cosmological simulations, run on leadership-class supercomputers, generate massive datasets and require complex, multi-step workflows for analysis and interpretation. In this talk, I will present a recent suite of simulations performed with the Hardware/Hybrid Accelerated Cosmology Code (HACC), a highly scalable code designed for performance on heterogeneous architectures. To support analysis at scale, we developed OpenCosmo, a cross-platform workflow system that orchestrates simulation data processing across multiple supercomputing environments. Building on this foundation, we are now integrating an agentic AI system that simplifies and automates intricate workflows, supports adaptive exploration, and enables more intuitive human–machine interaction with the data. While still in early stages, this integration of HPC, workflow technologies, and AI illustrates the emerging paradigm of intelligent scientific computing and its potential to accelerate discovery in complex domains like cosmology.
Workshop
Livestreamed
Recorded
TP
W
DescriptionModern HPC and AI workloads challenge fundamental limitations in traditional operating system designs. Shared kernels introduce unpredictable performance interference, virtualization imposes unacceptable overhead, and monolithic architectures waste resources on unnecessary features.
Multikernel architectures address these challenges by providing each application with a dedicated, customized kernel instance. Combined with elastic resource management, this approach delivers predictable performance through kernel-level isolation, near-native performance without hypervisor overhead, and automatic workload-specific optimization for both HPC and AI.
This talk examines why current hardware capabilities and workload demands create the right conditions to fundamentally rethink OS design, and presents the architectural principles and system design of our multikernel proposal.
Multikernel architectures address these challenges by providing each application with a dedicated, customized kernel instance. Combined with elastic resource management, this approach delivers predictable performance through kernel-level isolation, near-native performance without hypervisor overhead, and automatic workload-specific optimization for both HPC and AI.
This talk examines why current hardware capabilities and workload demands create the right conditions to fundamentally rethink OS design, and presents the architectural principles and system design of our multikernel proposal.
Workshop
Livestreamed
Recorded
TP
W
DescriptionFrontier AI models have crossed a threshold. They no longer merely assist scientists, but now co-design not only which questions to pursue, but how to pursue them. This keynote examines how we can accelerate scientific discovery using these advanced models. Drawing an analogy to Amdahl’s Law, we’ll see how extraordinary speed-ups in hypothesis generation, simulation, and data interpretation collide with bottlenecks in chemistry, fabrication, and field observation, forcing a strategic rebalancing of the entire research pipeline. We’ll explore embedding human values in autonomous goal setting, preserving trust and reproducibility amid synthetic data, and redesigning the workforce to align automated cognition with irreplaceable human judgment. Last, we’ll introduce high-level considerations and concrete actions to collectively explore how we can navigate this rapidly changing landscape.
Workshop
Livestreamed
Recorded
TP
W
DescriptionWe will start the discussion with a landmark achievement: the first global simulation of the full Earth system at a 1.25 km grid spacing. Our talk will focus on how we used the Alps supercomputer to model the intricate flow of energy, water, and carbon across the atmosphere, ocean, and land. We will detail the heterogeneous setup and optimization techniques that enabled us to achieve an exceptional time compression of 82.5 simulated days per day, allowing for extensive studies of the Earth system. Throughout the talk, we focus on programmability and performance of the languages used and the frameworks employed. Specifically, we will describe the use of the Data-Centric Parallel Programming (DaCe) framework developed in Switzerland. We show how this reduced code complexity by half while increasing both performance and portability. Finally, we will put this all into context of emerging AI methods and systems and provide an outlook into how to combine physics-based simulation and data-driven AI methods.
Workshop
Livestreamed
Recorded
TP
W
DescriptionAbstractEarly demonstrations of operations relevant to quantum error correction and fault-tolerant quantum computation signal that the noisy intermediate-scale quantum (NISQ) era might be coming to an end. Vendor roadmaps indicate that gigaquop machines with hundreds of logical qubits capable of executing circuits with billions of operations might exist before the end of the decade. But many of the most detailed quantum resource estimates suggest that achieving certain notions of quantum utility might require teraquop computers. I will argue that there is a surprising dearth of prospective applications, even for quantum computers at this scale. Worse yet, I will suggest that there is an application gap and that there appear to be even fewer "useful" things that we can do with mega and gigaquop machines. Nevertheless, there is cause for optimism and I hope to make a compelling case for why the answer to the question in the title is actually a cautious and qualified "yes".
Birds of a Feather
Storage
Livestreamed
Recorded
TP
XO/EX
DescriptionThe IO500 is the de facto benchmarking standard for HPC storage. We have released official lists at ISC and SC events since SC17 and now have over 300 entries. The purpose is to foster the IO500 community and ensure forward progress towards the common goals of creating, sharing, and benefiting from a large corpus of shared storage data. IO500 also serves as the largest repository of detailed HPC storage information for researchers and system designers to analyze and evaluate over time. A key highlight is the presentation of the latest Research and Production IO500 lists.
Flash Session
Not Livestreamed
Not Recorded
TP
XO/EX
DescriptionIntroducing IQM Halocene: the open and transparent system designed to unlock quantum error correction.
We will only reach fault-tolerant quantum computing if we build ecosystems together. It means that you need to get full control and develop your own research and tools with and for the hardware you are using as a testbed—with the support of IQM's expertise. Today we bring out our next product line—for the quantum error correction era. This is the first step towards fault-tolerance. IQM Halocene is a modular and versatile platform that allows you to develop and own your research results, especially in the field of quantum error correction.
Unlike black-box systems, IQM Halocene enables full transparency, pulse-level access, and integration with open-source FTQC stacks. It lets users innovate, publish, and commercialize new quantum technologies directly on a high-performance system.
We will only reach fault-tolerant quantum computing if we build ecosystems together. It means that you need to get full control and develop your own research and tools with and for the hardware you are using as a testbed—with the support of IQM's expertise. Today we bring out our next product line—for the quantum error correction era. This is the first step towards fault-tolerance. IQM Halocene is a modular and versatile platform that allows you to develop and own your research results, especially in the field of quantum error correction.
Unlike black-box systems, IQM Halocene enables full transparency, pulse-level access, and integration with open-source FTQC stacks. It lets users innovate, publish, and commercialize new quantum technologies directly on a high-performance system.
Birds of a Feather
State of the Practice
Livestreamed
Recorded
TP
XO/EX
DescriptionDOE has recently launched the Integrated Research Infrastructure (IRI) program, which is designed to enable new modes of integrated science across DOE user facilities. Common or unified interfaces are needed for these workflows to seamlessly orchestrate resources across high performance computing, data, and network providers.
The IRI Interfaces working group, composed of members of ASCR user facilities, has spent the last year designing and implementing APIs for compute facilities. Here we’d like to present our current development status, deployed prototypes, and future roadmap for discussion. We are seeking feedback from the user community to help guide these IRI efforts.
The IRI Interfaces working group, composed of members of ASCR user facilities, has spent the last year designing and implementing APIs for compute facilities. Here we’d like to present our current development status, deployed prototypes, and future roadmap for discussion. We are seeking feedback from the user community to help guide these IRI efforts.
Exhibitor Forum
Data Analytics
Livestreamed
Recorded
TP
XO/EX
DescriptionWith the bones of the software turning 30 this year, the iRODS Consortium is pleased to present iRODS 5. iRODS 5 provides a new process model with standard service manager communication and faster startup, a unified server configuration file, modernized TLS support, access time tracking, delay rule locking, and a new GenQuery parser. These changes, along with the use of distribution-provided packages for many underlying libraries, show a continued commitment to open-source, production-ready, backwards-compatible, policy-based data management.
Alongside the server updates, multiple updates to client libraries and client applications provide new ways to interact with the existing iRODS ecosystem. These include updates to the HTTP API, OpenID Connect support, the S3 API, Cyberduck, an MCP server, Metalnx, the Zone Management Tool, and more.
The iRODS platform is ready to provide the foundation of your enterprise analytics and AI integration efforts. Build on what is already proven.
Alongside the server updates, multiple updates to client libraries and client applications provide new ways to interact with the existing iRODS ecosystem. These include updates to the HTTP API, OpenID Connect support, the S3 API, Cyberduck, an MCP server, Metalnx, the Zone Management Tool, and more.
The iRODS platform is ready to provide the foundation of your enterprise analytics and AI integration efforts. Build on what is already proven.
Workshop
Livestreamed
Recorded
TP
W
DescriptionHPC resources are becoming increasingly complex, while HPC itself is becoming more popular among novice researchers across a wide range of research domains. These novice researchers often lack typical HPC skills, which results in a steep learning curve that leads to frustration and inefficient use of HPC resources. To address this, we developed Drona Workflow Engine. Drona offers an intuitive Graphical User Interface that assists researchers in running their scientific workflows. The researcher provides the required information for their specific scientific workflow, and Drona generates all the scripts needed to run that workflow. For transparency, Drona will display all generated scripts in a fully editable preview window, allowing the researcher to make any final adjustments as needed. Drona also provides a flexible framework for importing, creating, adapting, and sharing custom scientific workflows. Drona significantly enhances researcher productivity by abstracting the underlying HPC complexities while retaining full control over their workflows.
Workshop
Livestreamed
Recorded
TP
W
DescriptionThe pace of RISC-V adoption continues to grow rapidly, yet for the successes enjoyed in areas such as embedded computing, RISC-V is yet to gain ubiquity in High Performance Computing (HPC). The Sophon SG2044 is SOPHGO's next generation 64-core high performance CPU that has been designed for workstation and server grade workloads. Building upon the SG2042, subsystems that were a bottleneck in the previous generation have been upgraded.
In this paper we undertake the first performance study of the SG2044 for HPC. Comparing against the SG2042 and other architectures, we find that the SG2044 is most advantageous when running at higher core counts, delivering up to 4.91 greater performance than the SG2042 over 64-cores. Two of the most important upgrades in the SG2044 are support for RVV v1.0 and an enhanced memory subsystem. This results in the SG2044 significantly closing the performance gap with other architectures, especially for compute-bound workloads.
In this paper we undertake the first performance study of the SG2044 for HPC. Comparing against the SG2042 and other architectures, we find that the SG2044 is most advantageous when running at higher core counts, delivering up to 4.91 greater performance than the SG2042 over 64-cores. Two of the most important upgrades in the SG2044 are support for RVV v1.0 and an enhanced memory subsystem. This results in the SG2044 significantly closing the performance gap with other architectures, especially for compute-bound workloads.
Birds of a Feather
Standards
Livestreamed
Recorded
TP
XO/EX
DescriptionC++ was named the second most popular programming language in 2025, according to the TIOBE Index of language popularity. C and C++ are used in 79.4% of parallel programming languages, based on the Hyperion Research HPC Briefing at ISC 2021.
C++26 is an exciting release for the HPC C++ developer community, with new features including reflection, contracts, erroneous behavior, linear algebra, SIMD, and structured concurrency, amongst others.
This BoF will pull together important leaders and contributors within the ISO C++ Standards Committee who are responsible for key features such as ML, executors, mdspan, inplace_vector, library, concurrency, parallelism, and GPU support.
C++26 is an exciting release for the HPC C++ developer community, with new features including reflection, contracts, erroneous behavior, linear algebra, SIMD, and structured concurrency, amongst others.
This BoF will pull together important leaders and contributors within the ISO C++ Standards Committee who are responsible for key features such as ML, executors, mdspan, inplace_vector, library, concurrency, parallelism, and GPU support.
Workshop
Livestreamed
Recorded
TP
W
DescriptioniSTaRT - in Silico Targeted Radionuclide Therapy: Designing Inhibitor-Chelator Conjugates
Targeted radiopharmaceutical therapy (TRT) offers a precise and potent cancer treatment modality by delivering radioactive payloads directly to tumor cells. However, designing effective TRT agents - those that are cell-permeable, stable, and tumor-specific - remains a multifactorial challenge involving molecular, cellular, and tissue-level considerations. To address this, we developed iSTaRT (in Silico Targeted Radionuclide Therapy): an HPC-enabled framework that unites generative AI, multiscale simulation, and multicellular agent-based modeling to accelerate the design and optimization of TRT candidates.
Our pipeline begins with GEMMINI, a GenAI platform that generates linker molecules optimized for key physicochemical and ADMET properties and customized property predictors. Generated molecules are filtered using toxicity screens and permeability heuristics based on target membrane permeabilities ranges to ensure selectivity between oncogenic and normal cells.
We then use LipidLure, an HPC-based multiscale molecular dynamics pipeline, to compute the membrane permeability of selected complex TRT constructs across asymmetric bilayers reflective of normal versus oncogenic cells. These simulations of over ~15 µs of sampling on Frontier leverage the Inhomogeneous Solubility-Diffusion (ISD) model to compute permeability coefficients and membrane transport energetics.
For radionuclide-chelator stability, iSTART incorporates membrane-aware QM/MM simulations to evaluate actinium-DOTA complex binding across distinct bilayer environments. This approach captures relativistic and electronic effects unique to Ac3+ that surrogates like La3+ fail to model and quantifies environmental free energy penalties (ΔΔG) that affect stability in low-dielectric regions such as lipid cores.
Crucially, our framework links molecular and physical modeling with multicellular agent-based modeling (ABM) to simulate therapeutic impact within a digital twin of the tumor microenvironment. ABM enables spatial modeling of TRT diffusion, cellular uptake, and radiation-induced damage across heterogeneous cancer cell populations. This allows us to assess compound efficacy, selectivity, and synergy with radiosensitizers under realistic biological scenarios.
Our proof-of-principle centers on Ac-225–labeled sotorasib analogs targeting oncogenic protein KRAS G12C. By combining GenAI molecule generation with HPC-scale physical modeling and multicellular simulation, we can downselect high-performing candidates in days - substantially reducing the timeline for early-stage radiopharmaceutical design. These prioritized constructs are now advancing to experimental validation.
iSTART exemplifies how HPC can ignite innovation in cancer care by bridging molecular design, physical modeling, and systems-level prediction into a cohesive framework. The approach is modular and extensible, enabling application to other cancer targets, payloads, and patient-specific digital twins, with a long-term vision of guiding individualized therapy design through simulation.
Targeted radiopharmaceutical therapy (TRT) offers a precise and potent cancer treatment modality by delivering radioactive payloads directly to tumor cells. However, designing effective TRT agents - those that are cell-permeable, stable, and tumor-specific - remains a multifactorial challenge involving molecular, cellular, and tissue-level considerations. To address this, we developed iSTaRT (in Silico Targeted Radionuclide Therapy): an HPC-enabled framework that unites generative AI, multiscale simulation, and multicellular agent-based modeling to accelerate the design and optimization of TRT candidates.
Our pipeline begins with GEMMINI, a GenAI platform that generates linker molecules optimized for key physicochemical and ADMET properties and customized property predictors. Generated molecules are filtered using toxicity screens and permeability heuristics based on target membrane permeabilities ranges to ensure selectivity between oncogenic and normal cells.
We then use LipidLure, an HPC-based multiscale molecular dynamics pipeline, to compute the membrane permeability of selected complex TRT constructs across asymmetric bilayers reflective of normal versus oncogenic cells. These simulations of over ~15 µs of sampling on Frontier leverage the Inhomogeneous Solubility-Diffusion (ISD) model to compute permeability coefficients and membrane transport energetics.
For radionuclide-chelator stability, iSTART incorporates membrane-aware QM/MM simulations to evaluate actinium-DOTA complex binding across distinct bilayer environments. This approach captures relativistic and electronic effects unique to Ac3+ that surrogates like La3+ fail to model and quantifies environmental free energy penalties (ΔΔG) that affect stability in low-dielectric regions such as lipid cores.
Crucially, our framework links molecular and physical modeling with multicellular agent-based modeling (ABM) to simulate therapeutic impact within a digital twin of the tumor microenvironment. ABM enables spatial modeling of TRT diffusion, cellular uptake, and radiation-induced damage across heterogeneous cancer cell populations. This allows us to assess compound efficacy, selectivity, and synergy with radiosensitizers under realistic biological scenarios.
Our proof-of-principle centers on Ac-225–labeled sotorasib analogs targeting oncogenic protein KRAS G12C. By combining GenAI molecule generation with HPC-scale physical modeling and multicellular simulation, we can downselect high-performing candidates in days - substantially reducing the timeline for early-stage radiopharmaceutical design. These prioritized constructs are now advancing to experimental validation.
iSTART exemplifies how HPC can ignite innovation in cancer care by bridging molecular design, physical modeling, and systems-level prediction into a cohesive framework. The approach is modular and extensible, enabling application to other cancer targets, payloads, and patient-specific digital twins, with a long-term vision of guiding individualized therapy design through simulation.
Workshop
Livestreamed
Recorded
TP
W
DescriptionSpiking Neural Network processing promises to provide high energy efficiency due to the sparsity of the spiking events. However, when realized on general-purpose hardware -- such as a RISC-V processor -- this promise can be undermined and overshadowed by the inefficient code, stemming from repeated usage of basic instructions for updating all the neurons in the network. One of the possible solutions to this issue is the introduction of a custom ISA extension with neuromorphic instructions for spiking neuron updating, and realizing those instructions in bespoke hardware expansion to the existing ALU. In this paper, we present the first step towards realizing a large-scale system based on the RISC-V-compliant processor called IzhiRISC-V, supporting the custom neuromorphic ISA extension.
Research and ACM SRC Posters
Research & ACM SRC Posters
TP
DescriptionOur JACC poster for SC25 presents a completely updated version of the Best Poster finalist at SC24, showcasing the latest added features in the JACC library and ecosystem for productive scientific computing. First, we describe the new and stable JACC API components: (i) a portable memory model, (ii) kernel launching, and (iii) CPU/GPU backend selection without code changes. Second, we present two added features: (i) JACC’s shared, for exploiting cached shared memory among threads, and (ii) JACC’s Multi module to program nodes with an increasing number of GPUs. Third, we present JACC ports of five science applications: XSBench, miniBUDE, LULESH, BabelStream, and Hartree–Fock, showing performance comparisons against C++ programming models for the first three on NVIDIA’s A100 and H100, and AMD’s MI100 and MI250X (Frontier’s) GPUs. Our work shows that as JACC and Julia continue to mature they allow developing performance-portable science codes at a fraction of the cost.
Research and ACM SRC Posters
Research & ACM SRC Posters
TP
DescriptionIn the current large-scale computing systems, users from various scientific backgrounds submit batch jobs with a set of requested resources. Manual resource selection in HPC facilities leads to early job terminations and out-of-memory errors due to underestimation of resources, or compute and memory resources sitting idle because of overallocation. In this work, we provide a recommendation framework based on job grouping and intelligent prediction methods to provision HPC application resource needs before they are submitted to the system. Our work achieves less than 2\% of cases experiencing underpredicted resource requests, and results in fewer overestimations compared to the baseline methods. We also implement a module to deploy the framework on a real HPC system, which comprises the future plans of this work.
Workshop
Livestreamed
Recorded
TP
W
DescriptionA scheme and script for combining system-independent scripts with system-specific localization information to make it easy to write job scripts to run on any system. The basic structure of such scripts is discussed, with templates of how to set up new job scripts, localizations, and examples of use. The paper discusses an implementation of a script that automates the combining process.
Art of HPC
Art of HPC
Not Livestreamed
Not Recorded
TP
W
TUT
XO/EX
DescriptionThe video shows a computational fluid dynamics (CFD) simulation of the evolution of Taylor-Green vortices over time. The Taylor-Green vortex test case is a common setup for benchmarking and validating CFD solvers. The flow is initialized with sine wave velocities. Vortices decay over time, showing the different scales of turbulence.
The simulation at the center of the visualization was performed by ASiMoV-ccs (https://github.com/asimovpp/asimov-ccs), a CFD and combustion code designed for large-scale simulations. The simulation was run on the UK's national supercomputer ARCHER2 (https://www.archer2.ac.uk). The visualization sheds new light and a new perspective on the ubiquitous Taylor-Green vortex test case, putting the complexity of fluid flow on full display.
The simulation at the center of the visualization was performed by ASiMoV-ccs (https://github.com/asimovpp/asimov-ccs), a CFD and combustion code designed for large-scale simulations. The simulation was run on the UK's national supercomputer ARCHER2 (https://www.archer2.ac.uk). The visualization sheds new light and a new perspective on the ubiquitous Taylor-Green vortex test case, putting the complexity of fluid flow on full display.
Tutorial
Livestreamed
Recorded
TUT
DescriptionThis hands-on tutorial introduces participants to the Julia language and demonstrates its growing role in scientific computing and high performance computing (HPC). We will give a brief introduction to the Julia language, targeting a beginner audience. Participants will gain familiarity with Julia’s syntax, multiple dispatch, package environment, and code parallelization. Following the introduction we will dive into Julia for HPC, showcasing its ability to express high-performance workloads with minimal effort. We will explore Julia’s support for shared-memory parallelism using multithreading, NVIDIA and AMD GPUs, distributed-memory parallelism via MPI.jl, and performance portability layers. Julia combines the productivity of high-level languages with low-level performance aspects, thanks to its LLVM-based JIT compilation. Attendees will have hands-on access to NERSC’s systems to explore hands-on examples involving computation, communication, parallel I/O, and data analysis. All materials will be made publicly available, and we will maintain a Slack channel to offer continued support and answer participants’ questions after the event.
Birds of a Feather
Programming Frameworks
Livestreamed
Recorded
TP
XO/EX
DescriptionThe Julia for HPC BoF provides a place for the HPC community with interest in the Julia programming language as an LLVM front-end for science to close the gap between high-productivity languages and the performance of compiled languages. We invite participants from industry, government, and academia to discuss their experiences, identify and learn about opportunities and gaps. Topics include: community, adoption and support in HPC facilities, and new areas like quantum computing. The proposed fourth consecutive BoF continues the Julia for HPC working group’s engagement with the SC community, and complements the accepted tutorials on Julia for HPC at SC25.
Research and ACM SRC Posters
Research & ACM SRC Posters
TP
DescriptionJulia, a high-performance, high-level language, harnesses dynamic typing and LLVM’s Just-in-Time compiler to match the speed of C and Fortran in production. Meanwhile, IRIS serves as a heterogeneous runtime that discovers devices dynamically and schedules concurrent work on CPUs, GPUs, FPGAs, and DSPs today. Integrating Julia with IRIS unlocks high-performance, portable, and productive computing for workloads. This synergy simplifies kernel APIs for both data-parallel and task-parallel execution, and it also builds task graphs with intelligent flow-dependency detection via kernel analysis to optimize performance across multiple device types. We report early results of AXPY executing on CUDA GPUs today. A tiled heterogeneous math library for DGEMM uses vendor kernels, demonstrating the system’s versatility.
Workshop
Livestreamed
Recorded
TP
W
DescriptionThis talk focuses on the quantum computing integration strategy at the JUNIQ user facility at JSC, FZJ. The outline includes the main philosophy, current progress in HPC QC integration, and future plans. It provides an overview of user services offered, including visualization features such as continuous benchmark dashboards and job reporting tools.
Paper
Algorithms
HPC for Machine Learning
Programming Frameworks
Livestreamed
Recorded
TP
DescriptionEfficient general matrix multiplication (GEMM) has attracted significant research attention in HPC and AI workloads. While large-scale GEMM has nearly achieved the peak floating-point performance of GPUs, substantial opportunities for optimization remain in small and batched GEMM operations.
In this paper we propose KAMI, a set of 1D, 2D, and 3D GEMM algorithms that extend the theory of communication-avoiding (CA) techniques within a single GPU. KAMI optimizes thread block-level GEMM by utilizing tensor cores as computational units, low-latency thread registers as local memory, and high-latency on-chip shared memory as a communication medium. We provide a theoretical analysis of CA performance from the perspective of GPU clock cycles, rather than the traditional execution time. Also, we implement SpMM and SpGEMM with this compute-communication pattern. Experimental results for general, low-rank, batched and sparse multiplication operations on the latest NVIDIA, AMD, and Intel GPUs show significant performance improvements over existing libraries.
In this paper we propose KAMI, a set of 1D, 2D, and 3D GEMM algorithms that extend the theory of communication-avoiding (CA) techniques within a single GPU. KAMI optimizes thread block-level GEMM by utilizing tensor cores as computational units, low-latency thread registers as local memory, and high-latency on-chip shared memory as a communication medium. We provide a theoretical analysis of CA performance from the perspective of GPU clock cycles, rather than the traditional execution time. Also, we implement SpMM and SpGEMM with this compute-communication pattern. Experimental results for general, low-rank, batched and sparse multiplication operations on the latest NVIDIA, AMD, and Intel GPUs show significant performance improvements over existing libraries.
Workshop
Livestreamed
Recorded
TP
W
DescriptionWe present KDRSolvers, a novel framework for representing sparse linear systems and implementing Krylov subspace methods on modern heterogeneous supercomputers. KDRSolvers uses dependent partitioning to uniformly represent sparse matrix storage formats as abstract maps between a matrix's domain, range, and set of nonzero entries. This enables KDRSolvers to define universal co-partitioning operators for matrices and vectors independent of underlying storage formats, allowing changes in data partitioning strategies to automatically propagate through an application with no code modification. KDRSolvers also introduces multi-operator systems in which matrix and vector data can be ingested and processed in multiple non-contiguous pieces without data movement. Our implementation of KDRSolvers, targeting the Legion runtime system, achieves greater flexibility and competitive performance compared to PETSc and Trilinos. In experiments with up to 1,024 GPUs on the Lassen supercomputer, our implementation achieves up to a 9.6% reduction in execution time per iteration.
Birds of a Feather
Programming Frameworks
Livestreamed
Recorded
TP
XO/EX
DescriptionThe OpenMP Architecture Review Board (ARB) releases new versions of OpenMP every two to three years. This year, between releases, we will share headline features planned for the OpenMP API version 6.1 ahead of its release before SC26. Feature champions from the OpenMP ARB and Language Committee will highlight OpenMP features in development, and attendees can provide their input on the standard ahead of the release. The BoF will have short lightning talks, and discussion rounds will give participants ample opportunity to interact with OpenMP experts, ask questions, and provide feedback. Vendor representatives will discuss support and timelines for OpenMP features.
Paper
Applications
GBC
Livestreamed
Recorded
TP
DescriptionKilometer-scale Earth system models (ESMs) necessitate exascale supercomputers to facilitate realistic simulations of weather phenomena and climate variability over a time span ranging from days to decades. We present AP3ESM, an ultra-high-resolution, AI-Powered, Performance-Portable ESM coupling atmosphere, land surface, ocean, and sea ice components. By leveraging the performance portability features of Kokkos and OpenMP, the AP3ESM operates efficiently on two heterogeneous systems while incurring minimal development overhead. Advanced optimization techniques, such as adaptive parallel algorithms, AI-enhanced physical parameterizations, and mixed-precision computations, have been implemented to further boost the computational efficiency. The standalone atmospheric and oceanic components of AP3ESM each reach 1-km resolution, attaining 0.60 and 1.98 simulated-years-per-day (SYPD) using 34.1 million cores and 16,085 GPUs, respectively. The holistic AP3ESM (3-km atmosphere, 2-km ocean) sustains 1.01 SYPD on 36.6 million cores. Notably, the forecast experiment successfully captures Super Typhoon Doksuri in 2023 and its associated extreme rainfall across China.
Workshop
Livestreamed
Recorded
TP
W
DescriptionProgramming irregular graph applications is challenging on today's scalable supercomputers.
We describe a novel programming model, KVMSR+UDWeave, that supports extreme scaling by exposing fine-grained parallelism. By enabling the expression of maximum parallelism, it opens the door for extreme scaling, even on both small and large graph problems.
KVMSR+UDWeave cleanly separates the three key dimensions of parallel programming: parallelism, computation binding, and data placement. This decomposition reduces effort to achieve scalable, high-performance for graph algorithms on real-world, highly skewed graphs. Key features of the UpDown supercomputer (computation location naming and shared global address space) enable decomposition and scalable, high performance.
In the IARPA AGILE program, we built numerous graph benchmarks and workflows, and use them to illustrate the programming model. Simulation results for UpDown show excellent strong-scaling to million-fold hardware parallelism and high absolute performance. Results suggest KVMSR+UDWeave enables reduced programming effort for scaling the most demanding irregular applications.
We describe a novel programming model, KVMSR+UDWeave, that supports extreme scaling by exposing fine-grained parallelism. By enabling the expression of maximum parallelism, it opens the door for extreme scaling, even on both small and large graph problems.
KVMSR+UDWeave cleanly separates the three key dimensions of parallel programming: parallelism, computation binding, and data placement. This decomposition reduces effort to achieve scalable, high-performance for graph algorithms on real-world, highly skewed graphs. Key features of the UpDown supercomputer (computation location naming and shared global address space) enable decomposition and scalable, high performance.
In the IARPA AGILE program, we built numerous graph benchmarks and workflows, and use them to illustrate the programming model. Simulation results for UpDown show excellent strong-scaling to million-fold hardware parallelism and high absolute performance. Results suggest KVMSR+UDWeave enables reduced programming effort for scaling the most demanding irregular applications.
Workshop
Livestreamed
Recorded
TP
W
DescriptionLarge Language Models (LLMs) are unreliable to make decisions due to their potential to hallucinate, and unable to perform complex tasks like running simulations essential to fields like Material Science. We introduce LABMATE (LAnguage model Based Multi-agent system to Accelerate caTalysis Experiments), a human-in-the-loop copilot framework that utilizes LLM agents to make catalysis research faster. LABMATE allows human experts to run simulations, track particle sizes, run data analysis, conduct literature review, and generate potential hypothesis all in one framework, thereby expediting the research process. When evaluated on the major benchmarks, LABMATE performs comparable to or better than most frontier LLMs, showing that in addition to accelerating the experimental process, our framework is also on par in domain knowledge compared to using a simple LLM. Furthermore, since the core architecture of the system is domain-agnostic, it can easily be adapted to other domains.
Workshop
Performance Evaluation, Scalability, & Portability
Livestreamed
Recorded
TP
W
DescriptionSince its inception in 1995, LAMMPS has grown to be a world-class molecular dynamics code, with thousands of users, over one million lines of code, and multi-scale simulation capabilities. We discuss how LAMMPS has adapted to the modern heterogeneous computing landscape by integrating the Kokkos performance portability library into the existing C++ code. We investigate performance portability of simple pairwise, many-body reactive, and machine-learned force-field interatomic potentials. We present results on GPUs across different vendors and generations, and analyze performance trends, probing FLOPS throughput, memory bandwidths, cache capabilities, and thread-atomic operation performance. Finally, we demonstrate strong scaling on three exascale machines -- OLCF Frontier, ALCF Aurora, and NNSA El Capitan -- as well as on the CSCS Alps supercomputer, for the three potentials.
Workshop
Livestreamed
Recorded
TP
W
DescriptionThe Large Language Model (LLM) can improve its performance in answering questions beyond its contextual understanding by running external tools, such as online query for real-time weather, etc. For scientific applications, this enables the LLM to perform and analyze simulation runs for more accurate answers. However, the increasing scale of scientific computing requires the high-performance computers (HPCs), which are managed by job schedulers. In this work, we implemented Parsl to the LangChain tool calling to bridge the gap between the LLM agent and the HPC resource. Two implementations were set up and tested on a local Nvidia GPU workstation and the Polaris/ALCF HPC system. The LLM agent workflow was prompted to run molecular dynamics simulations, with different protein structures and simulation conditions. The results show that our Parsl implementations enabled parallel execution of scientific tools that invoked by LLM agents on both local workstations and HPC platforms.
Workshop
Livestreamed
Recorded
TP
W
DescriptionNear the full scale of exascale supercomputers, latency can dominate the cost of all-to-all communication even for very large message sizes. We describe GPU-aware all-to-all implementations designed to reduce latency for large message sizes at extreme scales, and we present their performance using 65536 tasks (8192 nodes) on the Frontier supercomputer at the Oak Ridge Leadership Computing Facility. Two implementations perform best for different ranges of message size, and all outperform the vendor-provided MPI_Alltoall. Our results show promising options for improving implementations of MPI_Alltoall_init.
Panel
AI, Machine Learning, & Deep Learning
Power Use Monitoring & Optimization
SC Community Hot Topics
Livestreamed
Recorded
TP
DescriptionToday's AI supercomputers push power boundaries, often exceeding 200MW, far beyond traditional HPC. While HPC has pioneered operational efficiency for large-scale workloads, AI's explosive growth now accelerates innovations like photonics and small modular reactors (SMRs). This panel will discuss the profound impact of these high-density AI/HPC supercomputers on energy, CO2 emissions, and water usage. We'll explore AI's carbon footprint (currently 1.2%–1.5% of global electricity), the shift toward co-located energy generation, and water-efficient cooling strategies. Crucially, we'll examine AI as an HPC workload, the 100x greater AI hardware investment, and evolving benchmarking like "tokens per kilowatt-hour." The discussion will highlight changing roles for HPC in an AI-driven future, emphasizing collaboration and leveraging HPC expertise for sustainable, scalable AI growth.
Workshop
Partially Livestreamed
Partially Recorded
TP
W
DescriptionHigh-performance computing (HPC) is vital for advancing AI research in computer vision, where training on high-resolution datasets requires significant computational power. Using the NSF-funded Accelerating Computing for Emerging Sciences (ACES) testbed at Texas A\&M University, we leveraged Graphcore IPUs, NVIDIA H100 GPUs, and A30 GPUs to conduct a large-scale empirical study of 170 model configurations spanning CNN-, GAN-, Transformer-, and Diffusion-based architectures. This enabled us to address an open question: \emph{Is underwater image enhancement (UIE) truly beneficial for underwater object detection?} We trained five object detectors across 17 enhancement domains and two datasets. Results show that most UIE methods degrade detection accuracy, while select diffusion-based approaches that preserve key features can mitigate this drop. HPC resources also allowed us to compare GPU and IPU performance. These findings guide the practical use of UIE in marine vision and highlight the importance of equitable HPC access for large-scale AI research.
Paper
Programming Frameworks
Livestreamed
Recorded
TP
DescriptionThe evolution of architectures, programming models, and algorithms is driving communication towards greater asynchrony and concurrency, usually in multithreaded environments. We present LCI, a communication library designed for efficient asynchronous multithreaded communication. LCI provides a concise interface that supports common point-to-point primitives and diverse completion mechanisms, along with flexible controls for incrementally fine-tuning communication resources and runtime behavior. It features a threading-efficient runtime built on atomic data structures, fine-grained non-blocking locks, and low-level network insights. We evaluate LCI on both Infiniband and Slingshot-11 clusters with microbenchmarks and two application-level benchmarks. Experimental results show that LCI significantly outperforms existing communication libraries in various multithreaded scenarios, achieving performance that exceeds the traditional multi-process execution mode and unlocking new possibilities for emerging programming models and applications. LCI is open-source and available at .
Early Career
Not Livestreamed
Not Recorded
TP
W
TUT
XO/EX
DescriptionThis session will introduce key principles of effective leadership for early career professions, highlighting different leadership styles. Participants will learn practical tools to lead with confidence and create positive impact early in their careers.
Workshop
short paper
Livestreamed
Recorded
TP
W
DescriptionDistributed cloud environments hosting data-intensive applications often experience slowdowns from network congestion, asymmetric bandwidth, and inter-node data shuffling. These factors are typically not captured by host-level metrics such as CPU or memory. Scheduling without considering them can cause poor placement, longer transfers, and degraded job performance. We present a network-aware scheduler that uses supervised learning to predict job completion time. Our system collects real-time telemetry from all nodes, applies a trained model to estimate job duration per node, and ranks them to select the best placement. We evaluate the scheduler on a geo-distributed Kubernetes cluster deployed on the FABRIC testbed using network-intensive Spark workloads. Compared to the default Kubernetes scheduler, which relies on current resource availability alone, our supervised scheduler achieved 34–54% higher accuracy in selecting optimal nodes. The novelty of our work lies in demonstrating supervised learning for real-time, network-aware scheduling on a multi-site cluster.
Research and ACM SRC Posters
Research & ACM SRC Posters
TP
DescriptionScientific and data science applications demand increasing computational performance, requiring effective scheduling and load balancing on high performance computing (HPC) systems. While OpenMP libraries such as LB4OMP provide several scheduling algorithms, selecting the best one for a given application-system pair remains an open challenge. This work addresses the scheduling algorithm selection problem by investigating automated approaches that can adapt to diverse workloads and architectures.
We propose and evaluate two automated selection strategies: expert- and reinforcement learning-based (RL). We use six applications and three systems to conduct the performance evaluation, revealing trade-offs between exploration overhead and optimality of selection of the methods. We further demonstrate that combining expert knowledge with RL improves overall performance.
With the poster we will present the methodology, results, and insights of expert- versus RL-based approaches. We highlight implications for future heterogeneous and multi-level systems and advertising the open source library (LB4OMP) where the methods were implemented.
We propose and evaluate two automated selection strategies: expert- and reinforcement learning-based (RL). We use six applications and three systems to conduct the performance evaluation, revealing trade-offs between exploration overhead and optimality of selection of the methods. We further demonstrate that combining expert knowledge with RL improves overall performance.
With the poster we will present the methodology, results, and insights of expert- versus RL-based approaches. We highlight implications for future heterogeneous and multi-level systems and advertising the open source library (LB4OMP) where the methods were implemented.
Workshop
Livestreamed
Recorded
TP
W
DescriptionModern HPC runtime systems increasingly rely on sophisticated C++ template metaprogramming to achieve zero-cost abstractions and type safety. This paper presents lessons learned from implementing the AllScale runtime system, a production HPC framework that makes extensive use of variadic templates, SFINAE, and template specialization for distributed data management. Through analysis of our template-heavy architecture for data items, automatic serialization, and type-safe requirement systems, we identify key challenges, performance implications, and best practices for template metaprogramming in HPC contexts. Our findings show that careful template design can maintain compile-time performance while enabling zero-cost abstractions that achieve 92-105\% of hand-tuned MPI performance. We provide concrete recommendations for managing template complexity, compilation overhead, and debugging challenges in production HPC systems.
Panel
AI, Machine Learning, & Deep Learning
Applications & Application Frameworks
Parallel Programming Methods, Models, Languages, & Environments
Livestreamed
Recorded
TP
DescriptionHPC and AI are now partners—but come from different cultures and traditions. Where can they align? How will scientists, algorithms, and AI models be trained? The International Post-Exascale Project (inpex.science) brings together experts from Europe, the United States, and Japan to tackle challenges in HPC beyond the exascale era. At a recent global workshop, participants explored AI-HPC convergence, software sustainability, digital continuum strategies, generative AI, and how scientists will be trained in AI—and by AI. In a world of AI coders, massive science-tuned LLMs, and vast stores of untapped data, how can our global community collaborate to move both HPC and AI forward? Join our panelists as they debate opposing viewpoints on the future of these two fields. Will GenAI turn today’s HPC developers into the COBOL coders of tomorrow—or will HPC scientists harness AI to accelerate discovery?
Tutorial
Livestreamed
Recorded
TUT
DescriptionLarge language models (LLMs) are progressing at an impressive pace. They are becoming capable of solving complex problems while presenting the opportunity to leverage their capabilities for scientific computing. Despite their progress, even the most sophisticated models can struggle with simple reasoning tasks and make mistakes, necessitating careful verification of their outputs. This tutorial focuses on these two important aspects: (1) leveraging LLMs to assist and advance scientific computing code translation, and (2) presenting best practices for evaluating and comparing LLMs within the scientific computing context. Designed specifically for students, researchers, and engineers at beginner and intermediate levels, this half-day tutorial features presentations and demos. Attendees learn the fundamentals of LLM design, development, and use cases for scientific computing. The tutorial deep-dives into one key topic: code translation (Fortran to C++) with the CodeScribe tool. Attendees also learn various complementary methods to test and evaluate LLM responses rigorously. At the end of the tutorial, attendees are equipped with solid foundations, knowledge, and practical experience to leverage and evaluate LLMs for scientific computing and to transform theoretical insights into actionable solutions.
Research and ACM SRC Posters
Research & ACM SRC Posters
TP
DescriptionOrganic semiconductors (OSCs) are promising for next-generation electronics, but polymorphism complicates accurate property prediction and makes traditional methods costly. We investigate transformer-based large language models (LLMs) for predicting energy gaps in polymorphic OSC crystals. A Pegasus-managed workflow is deployed across heterogeneous hardware (PSC Bridges-2 and Neocortex Cerebras CS-2) to evaluate three crystal text encodings: Materials String, SLICES, and SLICES-PLUS against a baseline XGBoost Regressor model. The results show that the LLM-analyzed Materials String achieves the highest accuracy, particularly in polymorph-rich datasets, outperforming other representations in both pretraining efficiency and downstream tasks, as well as the baseline XGBoost results. These findings highlight the potential of LLM-driven crystal encodings to accelerate materials discovery and enable the scalable, data-driven design of organic semiconductors.
Doctoral Showcase
Research & ACM SRC Posters
Livestreamed
Recorded
TP
DescriptionAccurately interpreting observations from the Event Horizon Telescope (EHT) requires general relativistic magnetohydrodynamics (GRMHD) simulations that can model increasingly complex physics. To address this need within an evolving and heterogeneous computing landscape, we present some results from KHARMA, a performance-portable GRMHD code built upon the Kokkos library.
We leverage KHARMA to implement two high-fidelity scientific models. First, by incorporating an extended GRMHD (EGRMHD) model for weakly-collisional plasma, we produce synthetic observables that provide a better fit to EHT observations of the Galactic Center. Second, we simulate black hole accretion in alternate theories of gravity, using the resulting electromagnetic signatures to place new constraints on deviations from General Relativity. The computational demands of these advanced physical models are made tractable by KHARMA's efficient, performance-portable, modular implementation.
Our work demonstrates how an extensible, performance-portable framework enables the generation of high-fidelity models that directly address key questions in black hole accretion physics. The resulting library of synthetic data not only constrains fundamental physics but also enables a more direct and robust comparison between complex theoretical models and observational data.
We leverage KHARMA to implement two high-fidelity scientific models. First, by incorporating an extended GRMHD (EGRMHD) model for weakly-collisional plasma, we produce synthetic observables that provide a better fit to EHT observations of the Galactic Center. Second, we simulate black hole accretion in alternate theories of gravity, using the resulting electromagnetic signatures to place new constraints on deviations from General Relativity. The computational demands of these advanced physical models are made tractable by KHARMA's efficient, performance-portable, modular implementation.
Our work demonstrates how an extensible, performance-portable framework enables the generation of high-fidelity models that directly address key questions in black hole accretion physics. The resulting library of synthetic data not only constrains fundamental physics but also enables a more direct and robust comparison between complex theoretical models and observational data.
Workshop
AI, Machine Learning, & Deep Learning
Clouds & Distributed Computing
Performance Evaluation, Scalability, & Portability
Scientific & Information Visualization
Partially Livestreamed
Partially Recorded
TP
W
DescriptionHigh-performance computing (HPC) systems at the
Exascale generate monitoring and simulation data at rates that
exceed the capacity of traditional offline analysis, especially for
change point detection (CPD) in large-scale scientific workflows.
We present an adaptive in-situ sampling framework that combines
truncated singular value decomposition (SVD)-based manifold
learning with the Kernel Cumulative Sum (KCUSUM) method
for statistical CPD. Implemented with MPI4py for scalable
interprocess communication and ADIOS2 for high-throughput
streaming I/O, the framework dynamically adjusts sampling rates
in real time based on anomaly likelihood, enabling selective data
retention without sacrificing scientific information.
Evaluations on synthetic datasets and large-scale molecular
dynamics (MD) simulations from NWChem demonstrate up to
59% memory reduction, sub-second detection latency for critical
events, and near-perfect detection accuracy, all while eliminating
the need for storing full trajectories. These preliminary results
highlight the framework’s potential to deliver resource-efficient,
real-time anomaly detection in data-intensive Exascale scientific
computing environments,
Exascale generate monitoring and simulation data at rates that
exceed the capacity of traditional offline analysis, especially for
change point detection (CPD) in large-scale scientific workflows.
We present an adaptive in-situ sampling framework that combines
truncated singular value decomposition (SVD)-based manifold
learning with the Kernel Cumulative Sum (KCUSUM) method
for statistical CPD. Implemented with MPI4py for scalable
interprocess communication and ADIOS2 for high-throughput
streaming I/O, the framework dynamically adjusts sampling rates
in real time based on anomaly likelihood, enabling selective data
retention without sacrificing scientific information.
Evaluations on synthetic datasets and large-scale molecular
dynamics (MD) simulations from NWChem demonstrate up to
59% memory reduction, sub-second detection latency for critical
events, and near-perfect detection accuracy, all while eliminating
the need for storing full trajectories. These preliminary results
highlight the framework’s potential to deliver resource-efficient,
real-time anomaly detection in data-intensive Exascale scientific
computing environments,
Workshop
AI, Machine Learning, & Deep Learning
Clouds & Distributed Computing
Performance Evaluation, Scalability, & Portability
Scientific & Information Visualization
Partially Livestreamed
Partially Recorded
TP
W
DescriptionSimulation codes are extensively used to design and operate particle accelerators.
Significant effort is invested towards digital twins that closely mirror a physical system and inform tuning of particle accelerators in real time, during operation. Some aspects of these digital twins benefit from using differentiable simulation codes that can automatically compute the gradient of their output with respect to certain input parameters. Gradients can also be used to find the optimal operating point of an accelerator (e.g. maximizing beam energy) and thereby inform tuning of the physical accelerator.
This talk will point out the crucial need for in-situ diagnostic in this type of gradient-based workflow. It is crucial that modeling reproduces the accelerator diagnostics in-situ rather than in post-processing, so that the gradients can propagate throughout the simulation pipeline. We demonstrate a real-world accelerator beamline model from PyTorch and will describe how to incorporate differentiable diagnostics in a C++ code.
Significant effort is invested towards digital twins that closely mirror a physical system and inform tuning of particle accelerators in real time, during operation. Some aspects of these digital twins benefit from using differentiable simulation codes that can automatically compute the gradient of their output with respect to certain input parameters. Gradients can also be used to find the optimal operating point of an accelerator (e.g. maximizing beam energy) and thereby inform tuning of the physical accelerator.
This talk will point out the crucial need for in-situ diagnostic in this type of gradient-based workflow. It is crucial that modeling reproduces the accelerator diagnostics in-situ rather than in post-processing, so that the gradients can propagate throughout the simulation pipeline. We demonstrate a real-world accelerator beamline model from PyTorch and will describe how to incorporate differentiable diagnostics in a C++ code.
Workshop
AI, Machine Learning, & Deep Learning
Clouds & Distributed Computing
Performance Evaluation, Scalability, & Portability
Scientific & Information Visualization
Partially Livestreamed
Partially Recorded
TP
W
DescriptionHigh Performance Computing (HPC) simulations often run on GPUs while in situ rendering tasks are offloaded to CPUs. Deciding how frequently to perform these in situ renderings is a challenge: render too frequently, and the simulation, or producer, oversaturates the rendering pipeline, or consumer; render too sparsely, and the CPU resources remain idle. This project seeks to develop machine learning (ML) models that can predict rendering times based on available system resources as well as simulation parameters. By leveraging ML-driven insight, the goal of this project is to analyze the tradeoffs and determine an optimal rendering interval for simulations such as nekRS instrumented with Ascent, in this instance. and ensure balanced workloads between the simulation and rendering tasks, ultimately improving overall computational efficiency.
Workshop
Livestreamed
Recorded
TP
W
DescriptionLossy compression is widely used to reduce storage and transmission costs in large-scale scientific data, but it inevitably introduces artifacts that may compromise subsequent analysis. To address this issue, we propose a lightweight 3D convolutional architecture with a fixed-scale batch normalization strategy, ensuring stable training and fast inference. We further analyze the trade-offs related to network size and highlight an empirical relationship between the minimum achievable MSE loss and the corresponding training cost. We also validate the generalizability of the network.
Experimental results on five representative scientific lossy compressors and datasets from four diverse scientific domains demonstrate that our method consistently improves reconstruction quality: MSE is reduced by one to four orders of magnitude, while keeping the inference time comparable to the compression runtime. A network trained on a single file generalizes well to other files within the same data set.
Experimental results on five representative scientific lossy compressors and datasets from four diverse scientific domains demonstrate that our method consistently improves reconstruction quality: MSE is reduced by one to four orders of magnitude, while keeping the inference time comparable to the compression runtime. A network trained on a single file generalizes well to other files within the same data set.
Paper
Algorithms
HPC for Machine Learning
Programming Frameworks
Livestreamed
Recorded
TP
DescriptionQuantization is a critical technique for accelerating LLM inference by reducing memory footprint and improving computational efficiency. Among various schemes, 4-bit weight and 8-bit activation quantization (W4A8) offers a strong balance between accuracy and performance. However, existing W4A8 GEMM kernels fall short in practice due to inefficient dequantization on CUDA Cores, which cannot keep pace with the high throughput of Tensor Cores. In this paper, we present LiquidGEMM, a hardware-efficient W4A8 GEMM kernel for efficient LLM serving. LiquidGEMM designs two key techniques: LiquidQuant, a hardware-efficient quantization method that enables fast, overflow-safe dequantization using just two arithmetic instructions per four elements; and an implicit fine-grained pipeline that fully overlaps weight loading, dequantization, and MMA across warp groups without software synchronization or redundant memory traffic. Experimental results show that LiquidGEMM achieves up to 2.90x speedup over state-of-the-art W4A8 kernels and up to 4.94x end-to-end system-level speedup.
Art of HPC
Art of HPC
Not Livestreamed
Not Recorded
TP
W
TUT
XO/EX
DescriptionSodalite cages are structural units found in the mineral sodalite and related materials like zeolites. Machine learning models are used to understand growth processes and atomic-level dynamical transitions in silica. Naturally occurring and synthetic zeolites have numerous applications, including water purification, catalysis in oil refining, and as components in detergents.
Workshop
Livestreamed
Recorded
TP
W
DescriptionModern scientific discovery increasingly relies on workflows that process data across the Edge, Cloud, and High Performance Computing (HPC) continuum. Comprehensive and in-depth analyses of these data are critical for hypothesis validation, anomaly detection, reproducibility, and impactful findings. Although workflow provenance techniques support such analyses, at large scale, the provenance data become complex and difficult to analyze. Existing systems depend on custom scripts, structured queries, or static dashboards, limiting data interaction. In this work, we introduce an evaluation methodology, reference architecture, and open-source implementation that leverages interactive Large Language Model (LLM) agents for runtime data analysis. Our approach uses a lightweight, metadata-driven design that translates natural language into structured provenance queries. Evaluations across LLaMA, GPT, Gemini, and Claude, covering diverse query classes and a real-world chemistry workflow, show that modular design, prompt tuning, and Retrieval-Augmented Generation (RAG) enable accurate and insightful LLM agent responses beyond recorded provenance.
Workshop
Data Analytics
High Performance I/O, Storage, Archive, & File Systems
Storage
Livestreamed
Recorded
TP
W
DescriptionTraining large language models (LLMs) at scale generates significant I/O, and vendor guidance typically recommends provisioning performance based on the supply side: peak bandwidth required to keep GPUs busy. These recommendations often overstate requirements though, since they assume ideal GPU utilization. The demand side, or the I/O performance that training jobs actually drive, is not as well characterized. Drawing on telemetry from production VAST systems underpinning some of the world’s largest AI training supercomputers, we analyzed over 85,000 checkpoints from 40 production LLM training jobs and found that even trillion-parameter models require only a few hundred GB/s for efficient checkpointing. From these observations, we derive a simple, demand-side model that relates LLM size and checkpoint interval to the global bandwidth needed. This model offers a way to avoid overprovisioning I/O and to maximize the resources (power, cooling) that can go towards compute.
SCinet
Not Livestreamed
Not Recorded
DescriptionIn this work, we propose a large language model (LLM)-based framework for optimization algorithm selection to address the growing complexity of high-performance, multi-domain network orchestration. The proposed solution introduces a context-aware abstraction layer where LLMs analyze network logs, service requests, and algorithm metadata to dynamically select the most suitable optimization strategy. We validate our framework through a prototype deployed on the FABRIC FAB international testbed and through simulations across diverse scenarios (eMBB, URLLC, mMTC, V2X, AR/VR), showing that LLM-driven selection achieves higher success rates and SLA compliance while balancing efficiency, accuracy, and inference latency. Our preliminary results demonstrate the feasibility of this method and highlight its potential to enable scalable, adaptive, and privacy-preserving orchestration.
Workshop
short paper
Livestreamed
Recorded
TP
W
DescriptionThe rapid growth of Artificial Intelligence (AI) applications, particularly through the widespread adoption of Large Language Models (LLMs), has caused an unprecedented growth in computing and network infrastructures. Current infrastructure expansion cannot keep pace, resulting in suboptimal performance. This creates an urgent need for network automation capable of dynamically orchestrating services and exploiting all available resources. Manual optimization processes are slow, error-prone, and unable to meet the requirements of complex, multi-domain, and data-intensive networks. A fundamental challenge is the absence of a universal optimization algorithm that performs effectively across all scenarios. In this paper, we present preliminary work on an LLM-based optimization algorithm selection framework for multi-domain, high-performance networks orchestration. The proposed framework utilizes LLM-generated descriptive embeddings of algorithms, network state logs, and service requests to identify the most suitable optimization method from a pool of algorithms, curating optimization to the current scenario.
Workshop
Debugging & Correctness Tools
HPC Software & Runtime Systems
Livestreamed
Recorded
TP
W
DescriptionFloating-point inconsistencies across compilers can undermine the reliability of numerical software. We present LLM4FP, the first framework that uses Large Language Models (LLMs) to generate floating-point programs specifically designed to trigger such inconsistencies. LLM4FP combines Grammar-Based Generation and Feedback-Based Mutation to produce diverse and valid programs. We evaluate LLM4FP across multiple compilers and optimization levels, measuring inconsistency rate, time cost, and program diversity. LLM4FP detects over twice as many inconsistencies compared to the state-of-the-art tool, Varity. Notably, most of the inconsistencies involve real-valued differences, rather than extreme values like NaN or infinities. LLM4FP also uncovers inconsistencies across a wider range of optimization levels, and finds the most mismatches between host and device compilers. These results show that LLM-guided program generation improves the detection of numerical inconsistencies.
Workshop
Data Analytics
High Performance I/O, Storage, Archive, & File Systems
Storage
Livestreamed
Recorded
TP
W
DescriptionCheckpointing is essential for fault tolerance in training large language models (LLMs). However, existing methods, regardless of I/O strategies, periodically store the entire model and optimizer states, incurring substantial storage overhead and contention. Recent studies reveal that updates across LLM layers are highly non-uniform. During training, some layers may undergo more significant changes, while others remain stable or even unchanged. This suggests that selectively checkpointing only layers with significant updates could reduce overhead without harming training. Implementing such strategies requires fine-grained control over both weights and optimizer states, which no current tool provides. To address this gap, we propose LLMTailor, a checkpoint-merging-framework that filters and assembles layers from different checkpoints to form a composite checkpoint. Our evaluation indicates that LLMTailor can work with different checkpointing strategies and effectively reduce checkpoint size (e.g., 4.3 times smaller for Llama3.1-8B) and checkpoint time (e.g., 2.8 times faster for Qwen2.5-7B) while maintaining model quality.
Workshop
Livestreamed
Recorded
TP
W
DescriptionLLVM, winner of the 2012 ACM Software System Award, has become an integral part of the software development ecosystem for optimizing compilers, dynamic language execution engines, source code analysis and transformation tools, debuggers, linking, and a whole host of programming language and toolchain-related components. The recent surge in AI development has further proven the efficacy of the LLVM infrastructure, as many predominant AI/ML compilation systems deployed in practice leverage the MLIR framework to exploit high-level semantics provided by their frontends, while maintaining a production grade and high-performance software stack. Research in, and implementation of, program analysis, compilation, optimization and profiling have clearly benefited from the availability of a high-quality, freely available infrastructure on which to build. This workshop will focus on recent developments, from both academia and industry, that build on the LLVM ecosystem to advance the state of the art in high-performance computing.
Research and ACM SRC Posters
Research & ACM SRC Posters
TP
DescriptionSimulating wave propagation with the Fourier collocation method is computationally intensive due to its reliance on discrete Fourier transforms (DFTs). While DFTs enable near-minimal spatial discretization, they scale poorly on modern high performance computing systems. This work evaluates two multi-GPU strategies for three-dimensional simulations: a Global FFT approach using distributed transforms, and a Local FFT approach based on domain decomposition with halo exchanges. Experiments were performed on a system with eight NVIDIA A100 GPUs connected via NVSwitch. Precision tests show that the Local FFT approach maintains errors around 0.1% when the halo covers the local PML region. Performance results demonstrate that the Local FFT approach achieves lower runtimes and significantly reduced communication overhead compared to the Global FFT approach, particularly for larger domains. These findings indicate that Local FFT decomposition is a promising strategy for scalable, large-scale multi-node ultrasound simulations.
Paper
Algorithms
HPC for Machine Learning
Performance Measurement, Modeling, & Tools
State of the Practice
Livestreamed
Recorded
TP
DescriptionDistributed training of large deep-learning models often leads to failures, so checkpointing is commonly employed for recovery. State-of-the-art studies focus on frequent checkpointing for fast recovery from failures. However, it generates numerous checkpoints, incurring substantial costs and thus degrading training performance. Recently, differential checkpointing has been proposed to reduce costs, but it is limited to recommendation systems, so its application to general distributed training systems remains unexplored.
This paper proposes LowDiff, an efficient frequent-checkpointing framework that reuses compressed gradients (commonly used in distributed training), serving as differential checkpoints to reduce cost. Furthermore, LowDiff incorporates a batched gradient write optimization to efficiently persist these differentials to storage. It also dynamically tunes both the checkpoint frequency and the batching size to maximize the performance. Experiments on various workloads show that LowDiff can achieve checkpointing frequency up to per iteration with less than 3.1% overhead on training time.
This paper proposes LowDiff, an efficient frequent-checkpointing framework that reuses compressed gradients (commonly used in distributed training), serving as differential checkpoints to reduce cost. Furthermore, LowDiff incorporates a batched gradient write optimization to efficiently persist these differentials to storage. It also dynamically tunes both the checkpoint frequency and the batching size to maximize the performance. Experiments on various workloads show that LowDiff can achieve checkpointing frequency up to per iteration with less than 3.1% overhead on training time.
Workshop
Livestreamed
Recorded
TP
W
DescriptionThis paper provides an overview of the multi-image parallel features in Fortran 2023 and their implementation in the LLVM flang compiler and the Caffeine parallel runtime library. The features of interest support a Single-Program, Multiple-Data (SPMD) programming model based on executing multiple “images”, each of which is a program instance. The features also support a Partitioned Global Address Space (PGAS) in the form of “coarray” distributed data structures. The paper discusses the lowering of multi-image features to the Parallel Runtime Interface for Fortran (PRIF) and the implementation of PRIF in the Caffeine parallel runtime library. This paper also provides an early view into the design of a new multi-image dialect of the LLVM Multi-Level Intermediate Representation (MLIR). We describe validation and testing of the resulting software stack, and demonstrate that performance compares favorably to another open-source compiler and runtime library: GNU Compiler Collection (GCC) gfortran and OpenCoarrays, respectively.
Paper
Algorithms
Applications
State of the Practice
Livestreamed
Recorded
TP
DescriptionLight source facilities, which generate X-rays for probing microstructures and dynamic processes, produce intense data streams, reaching up to 250 GB/s and projected to exceed 1 TB/s by the end of this decade. Managing such massive data poses critical challenges due to limited local processing capacity and bandwidth constraints when offloading data to HPC systems. To address these challenges, we propose lsCOMP, a GPU compressor that operates within a single kernel. lsCOMP supports both lossless and configurable lossy compression, ensuring high compression ratios and preserved data quality across diverse light source applications. On one NVIDIA A100 GPU, lsCOMP achieves compression throughputs of 380.89 to 509.21 GB/s in lossless mode, delivering up to 20 times higher performance than industry-leading GPU compressors while achieving superior compression ratios. In lossy modes, lsCOMP further improves throughput and ratios significantly. Additionally, lsCOMP demonstrates versatile performance across various integer datasets and supports TB/s-level random access throughput.
Flash Session
Not Livestreamed
Not Recorded
TP
XO/EX
DescriptionThis presentation introduces two advanced cooling solutions designed to meet the demands of modern, high-density data centers: the LTA Sidecar and the Rear Door Heat Exchanger (RDHx).
We’ll explore how the LTA Sidecar offers modular, rack-attached liquid-to-air cooling—ideal for scalable deployments and retrofits—while the RDHx uses liquid cooling at the rack rear to efficiently remove heat before it enters the room.
Key takeaways include:
• Improved energy efficiency and thermal performance
• Reduced reliance on room-level HVAC
• Deployment flexibility and sustainability benefits
We’ll explore how the LTA Sidecar offers modular, rack-attached liquid-to-air cooling—ideal for scalable deployments and retrofits—while the RDHx uses liquid cooling at the rack rear to efficiently remove heat before it enters the room.
Key takeaways include:
• Improved energy efficiency and thermal performance
• Reduced reliance on room-level HVAC
• Deployment flexibility and sustainability benefits
Early Career
Not Livestreamed
Not Recorded
TP
W
TUT
XO/EX
Workshop
Livestreamed
Recorded
TP
W
Workshop
Partially Livestreamed
Partially Recorded
TP
W
Workshop
Data Analytics
High Performance I/O, Storage, Archive, & File Systems
Storage
Livestreamed
Recorded
TP
W
Workshop
Performance Evaluation, Scalability, & Portability
Livestreamed
Recorded
TP
W
Workshop
Livestreamed
Recorded
TP
W
Birds of a Feather
Storage
Livestreamed
Recorded
TP
XO/EX
DescriptionLustre is the leading open-source and open-development file system for HPC. Eight of the top 10 and 64% of the top 100 systems on the most recent Top500 list use Lustre. It is a community-developed technology with contributors from around the world. Lustre supports many HPC infrastructures such as research, finance, energy, and manufacturing. Lustre clients are available for instruction set architectures such as x86, POWER, and ARM.
At this BoF, Lustre users, developers, administrators, and solution providers will gather to ask questions and discuss recent Lustre developments and challenges, including the role of Lustre in AI and its use in cloud environments. People new to Lustre will get a feel for the power of this HPC shared file system.
At this BoF, Lustre users, developers, administrators, and solution providers will gather to ask questions and discuss recent Lustre developments and challenges, including the role of Lustre in AI and its use in cloud environments. People new to Lustre will get a feel for the power of this HPC shared file system.
Best Poster Presentations (Research, ACM SRC Grad/Undergrad)
Research & ACM SRC Posters
TP
DescriptionIn this poster we present Luthier, the first open-source dynamic binary instrumentation framework targeting AMD GPUs. We highlight key features of our framework, including example use cases and runtime overhead comparison with NVIDIA’s NVBit. We also go over some major enhancements under development in the latest version of Luthier that support more of the growing family of AMD GPUs and additional instrumentation scenarios.
Art of HPC
Art of HPC
Not Livestreamed
Not Recorded
TP
W
TUT
XO/EX
DescriptionMachine Dreaming is an interactive installation that explores the perceptual and social dynamic between humans and AI, highlighting the tensions and intimacies that surface in the experience of being “seen” by these systems. The video depicts a viewer interacting with the installation, watching as their image gradually transforms, shifting into AI’s “perception” of them, inviting reflection on what it means to be seen and interpreted by an LLM.
By using real-time video and AI-generated visuals, Machine Dreaming initially reveals a recognizable reflection of the viewer themselves. As the viewer moves within the space they notice their image gradually morph into an alien-like intermediary state, to plant-like forms that move and transform in real time. Though semi-ambiguous forms, the work allows for both abstraction and coherence—as plants can appear both alien and organic, yet beautiful simultaneously—lending itself to exploring unfamiliar yet resonant representations of the self.
The work simultaneously allows the viewer to feel in control while also being subtly guided by the system as the projection shifts back and forth from them into alien plant-like forms based on their movements, gestures, and interaction with the installation.
By foregrounding this interplay between human presence and algorithmic interpretation, Machine Dreaming highlights the conditions of being seen and mediated by intelligent systems. The work prompts viewers to reflect on their relationship to AI and how perception, agency, and representation are negotiated with these systems.
By using real-time video and AI-generated visuals, Machine Dreaming initially reveals a recognizable reflection of the viewer themselves. As the viewer moves within the space they notice their image gradually morph into an alien-like intermediary state, to plant-like forms that move and transform in real time. Though semi-ambiguous forms, the work allows for both abstraction and coherence—as plants can appear both alien and organic, yet beautiful simultaneously—lending itself to exploring unfamiliar yet resonant representations of the self.
The work simultaneously allows the viewer to feel in control while also being subtly guided by the system as the projection shifts back and forth from them into alien plant-like forms based on their movements, gestures, and interaction with the installation.
By foregrounding this interplay between human presence and algorithmic interpretation, Machine Dreaming highlights the conditions of being seen and mediated by intelligent systems. The work prompts viewers to reflect on their relationship to AI and how perception, agency, and representation are negotiated with these systems.
Workshop
Livestreamed
Recorded
TP
W
DescriptionDeep learning recommendation models (DLRMs) rely on massive embedding tables that often exceed GPU memory capacity. Tiered memory offers a cost-effective solution but creates challenges for managing irregular access patterns. We introduce RecMG, an ML-guided caching and prefetching system tailored for DLRM inference. RecMG uses separate models for short-term reuse and long-range prediction, with a novel differentiable loss to improve accuracy. In large-scale deployments, RecMG reduces on-demand fetches by up to 2.8× and cuts inference time by up to 43%.
Art of HPC
Art of HPC
Not Livestreamed
Not Recorded
TP
W
TUT
XO/EX
DescriptionThis star formation simulation data shows how the rotating gas winds up the magnetic field around a forming protostar. All protostellar systems have angular momentum/rotation—this is how accretion disks/planetary systems form. Various details of the magnetic field geometry visualized here in response to this rotation are interesting in protostar research.
Paper
System Software and Cloud Computing
Livestreamed
Recorded
TP
DescriptionErasure coding is widely adopted to maintain data reliability, yet it introduces a significant update penalty. We analyze real-world traces and observe several challenges that are not addressed by existing studies, which thereby restrict the performance gains. We propose FastUpdate, an efficient multi-stripe updates framework that assists existing update schemes for fast updates. FastUpdate comprises three key designs: (1) it perceives the update locality and carefully merges multiple update requests accessing the same stripe to reduce the incurred network traffic; (2) it abstracts the existing update schemes into collector selection and tree construction, greedily generates the update solution for each stripe to balance the transmission load across nodes; (3) it dynamically schedules appropriate stripes to update in heterogeneous and dynamic networks to fully saturate the bandwidth resources. Comprehensive evaluations verify the effectiveness of FastUpdate on Alibaba ECS. It can increase the update throughput by 16.15%-88.71% for various update schemes.
Workshop
Livestreamed
Recorded
TP
W
DescriptionThe vision of self-driving networks (AIOps) hinges on the ability to develop production-ready machine learning models—models that are not only performant but also generalizable, robust, and trustworthy. Yet, most ML artifacts in networking today remain underspecified, suffering from shortcut learning, spurious correlations, and out-of-distribution failures rooted in data deficiencies. This talk traces our journey toward addressing these challenges by closing the loop between model analysis and data generation. I will present a closed-loop ML pipeline—composed of Trustee for model analysis and NetUnicorn, NetReplica, and NetGent for programmable data generation—that iteratively fixes underspecification by generating "better" data. Building on this foundation, I will discuss our efforts toward developing network foundation models (NFMs) that leverage self-supervised learning on large-scale network telemetry to unify diverse tasks, and toward reasoning about the generalizability of these NFMs. Finally, I will highlight emerging opportunities for using these programmable substrates to reimagine network operations and network measurements—solving unexplored learning problems in networking and revisiting previously explored ones with a fresher perspective.
Doctoral Showcase
Research & ACM SRC Posters
Livestreamed
Recorded
TP
DescriptionTo solve increasingly complex problems more efficiently, modern HPC systems feature highly heterogeneous components: CPUs, GPUs, and recently QPUs (quantum processing units), each with a unique, complex compute topology. The massive parallelism of GPUs, combined with emerging memory technologies on CPUs and GPUs, makes the memory topologies increasingly heterogeneous, complex, and dynamically configurable. Understanding these topological details, especially regarding available memory and its usage, is essential to operating the systems and applications efficiently.
This thesis presents a framework targeting several fundamental gaps in the currently available research and tooling: sys-sage, MT4G, GPUscout, and Mitos modeling. At the core, the sys-sage library offers a unified approach to maintaining static and dynamic topological information from different sources and APIs. Its universal architecture handles CPUs, GPUs, and QPUs alike. MT4G provides an otherwise unavailable, vendor-agnostic, and complete report on GPU memory topologies, integrable with sys-sage. GPUs' massive parallelism amplifies the potential performance penalties of improper cache and memory usage. Therefore, GPUscout identifies root causes of frequently occurring memory-related bottlenecks, helping users efficiently utilize the complex memory subsystem of GPUs. Finally, to address emerging memory technologies, such as CXL.mem, this thesis presents a novel data access modeling workflow as an extension of Mitos. The model predicts the performance impact of CXL.mem-based cross-node shared-buffer data exchange as an alternative to point-to-point MPI communication. Altogether, these tools capture topologies of HPC systems and provide missing insights into application data transfer behavior.
This thesis presents a framework targeting several fundamental gaps in the currently available research and tooling: sys-sage, MT4G, GPUscout, and Mitos modeling. At the core, the sys-sage library offers a unified approach to maintaining static and dynamic topological information from different sources and APIs. Its universal architecture handles CPUs, GPUs, and QPUs alike. MT4G provides an otherwise unavailable, vendor-agnostic, and complete report on GPU memory topologies, integrable with sys-sage. GPUs' massive parallelism amplifies the potential performance penalties of improper cache and memory usage. Therefore, GPUscout identifies root causes of frequently occurring memory-related bottlenecks, helping users efficiently utilize the complex memory subsystem of GPUs. Finally, to address emerging memory technologies, such as CXL.mem, this thesis presents a novel data access modeling workflow as an extension of Mitos. The model predicts the performance impact of CXL.mem-based cross-node shared-buffer data exchange as an alternative to point-to-point MPI communication. Altogether, these tools capture topologies of HPC systems and provide missing insights into application data transfer behavior.
Tutorial
Livestreamed
Recorded
TUT
DescriptionModern scientific software stacks rely on thousands of packages, from low-level libraries in C, C++, and Fortran to higher-level tools in Python and R. Scientists must deploy these stacks across diverse environments, from personal laptops to supercomputers, while tailoring workflows to specific tasks. Development workflows often require frequent rebuilds, debugging, and small-scale testing for rapid iteration. In contrast, preparing applications for large-scale HPC production involves performance-critical libraries (e.g., MPI, BLAS, LAPACK) and machine-specific optimizations to maximize efficiency. Managing these varied requirements is challenging. Configuring software, resolving dependencies, and ensuring compatibility can hinder both development and deployment. Spack is an open-source package manager that simplifies building, installing, and customizing HPC software stacks. It offers a flexible dependency model, Python-based syntax for package recipes, and a repository of over 8,500 packages maintained by more than 1,500 contributors. Spack is widely adopted by researchers, developers, cloud platforms, and HPC centers worldwide. This tutorial introduces Spack’s core capabilities, including installing and authoring packages, configuring environments, and deploying optimized software on HPC systems. Attendees will gain foundational skills for automating routine tasks and acquire advanced knowledge to address complex use cases with Spack.
Paper
Data Analytics, Visualization & Storage
Livestreamed
Recorded
TP
DescriptionLossless compression is a classic technique for reducing data storage and transmission requirements. Asymmetric numeral systems (ANS) is a high-throughput, high-ratio lossless compression algorithm, but it lacks effective support for multi-byte data and cross-platform compatibility.
To address this issue, we propose an adaptive data mapping (ADM) scheme, which maps multi-byte integer data into single-byte space based on the data's characteristics, improving the compression ratio of ANS while maintaining low encoding redundancy. We also optimize the ADM algorithm and the ANS encoder for GPU and CPU architectures, respectively, and combine them to create an efficient and portable ANS encoding method for multi-byte integer data, called MANS.
Experimental results show that MANS improves compression ratios by an average of 1.24$\times$, achieves 870.27MB/s throughput on CPUs, and delivers up to 288.45$\times$ and 135.86$\times$ speedups on an NVIDIA A100 and an AMD MI210 GPU compared to the CPU version—demonstrating its efficiency and portability.
To address this issue, we propose an adaptive data mapping (ADM) scheme, which maps multi-byte integer data into single-byte space based on the data's characteristics, improving the compression ratio of ANS while maintaining low encoding redundancy. We also optimize the ADM algorithm and the ANS encoder for GPU and CPU architectures, respectively, and combine them to create an efficient and portable ANS encoding method for multi-byte integer data, called MANS.
Experimental results show that MANS improves compression ratios by an average of 1.24$\times$, achieves 870.27MB/s throughput on CPUs, and delivers up to 288.45$\times$ and 135.86$\times$ speedups on an NVIDIA A100 and an AMD MI210 GPU compared to the CPU version—demonstrating its efficiency and portability.
Research and ACM SRC Posters
Research & ACM SRC Posters
TP
DescriptionWe present a massively parallel Bayesian inference framework for GPU supercomputers, demonstrated in coseismic fault slip estimation. Bayesian inference, a robust method for inverse analysis, often relies on Monte Carlo sampling with over 100,000 forward simulations, making large-scale applications computationally intensive. A previous state-of-the-art implementation for the CPU-based supercomputer Fugaku was unsuitable for GPUs due to numerous small, imbalanced computations. We redesigned the algorithm to enforce uniform, dense computation and employed Multi-Process Service (MPS) to maximize GPU utilization. On a single node of the GPU-based supercomputer Miyabi with an NVIDIA GH200 Grace Hopper Superchip, the method achieved 13.40 TFLOPS (20% of Tensor Cores FP64 peak) and scaled to 128 nodes with 92.3% efficiency. Compared with the original CPU implementation on Fugaku, it achieved a 42.1-fold speedup per node and reduced energy-to-solution to 18.8%. The methodology provides a general guide for porting Bayesian inference and similar applications to GPU-based environments.
Research and ACM SRC Posters
Research & ACM SRC Posters
TP
DescriptionIn electronic design automation (EDA), traditional rasterization algorithms suffer from poor speedup and accuracy when managing large and complex semiconductor designs, limiting efficiency in optical proximity correction (OPC) processes. To overcome these challenges, we developed a GPU-based rasterization algorithm that employs floating-point precision and tile-based, warp-cooperative strategies. This approach significantly boosts performance, achieving up to 290x speedup for Manhattan shapes and 45x for curvilinear shapes over conventional CPU methods, while maintaining errors below 1% against CPU results. Our solution enhances both computational efficiency and geometric accuracy in nanometer-scale tasks. During the poster session, we will present our methodology, showcase performance results, and illustrate how advanced GPU optimization effectively addresses the limitations of traditional rasterization workflows in EDA.
Tutorial
Livestreamed
Recorded
TUT
DescriptionThis tutorial is designed for both users and facilitators who want to deepen their understanding of modeling AI pipelines in a portable, reproducible way using scientific workflows and application containers. Scientific workflows are essential for managing complex computations: they define the dependencies between steps in data analysis and simulation pipelines, automate execution, and capture provenance information critical for verifying results and ensuring reproducibility. Workflows also promote sharing and reuse. Participants will learn to use Pegasus, a leading scientific workflow management system now integrated into the ACCESS Support offerings (https://support.access-ci.org/pegasus). ACCESS Pegasus provides a fully hosted environment built on Open OnDemand and Jupyter, enabling users to develop and run workflows directly from a web browser. Workflow execution is powered by HTCondor Annex, allowing jobs to run across multiple ACCESS resources, including PSC Bridges-2, SDSC Expanse, Purdue Anvil, NCSA Delta, and IU Jetstream2. Through hands-on exercises in a hosted Jupyter Notebook, participants will work through an example LLM-RAG (large language model retrieval-augmented generation) workflow that leverages GPUs across ACCESS resources. Along the way, the tutorial will address key challenges and best practices across the entire workflow life cycle.
Tutorial
Livestreamed
Recorded
TUT
DescriptionOpenMP is the leading, portable, and widely supported directive-based programming model. Already in 2008, OpenMP introduced tasking to support the creation of composable parallel software blocks and the parallelization of irregular algorithms. Developers usually find OpenMP easy to learn. However, mastering the tasking concept requires a change in the way developers reason about the structure of their code and how they expose its parallelism. Our tutorial has been designed for the SC audience to learn about the tasking concept in detail and to understand code patterns as solutions to many common problems. Throughout all topics, we showcase the additions brought with OpenMP 5.x and OpenMP 6.0 and explain how to adopt codes. For this tutorial, we assume attendees understand basic parallelization concepts and know the fundamentals of OpenMP. First, we introduce the OpenMP tasking language features in detail and then focus on performance aspects, such as introducing cutoff mechanisms, exploiting task dependencies, and preserving locality. The new free-agent tasks introduced with OpenMP 6.0 are covered in detail. All topics are accompanied by extensive case studies. If accepted as a full-day tutorial, we will include hands-on sessions.
Flash Session
Not Livestreamed
Not Recorded
TP
XO/EX
DescriptionJoin Jeff Berger, an expert with over 35 years in product design, applications, and product management specializing in rubber hose and fittings, as he shares critical best practices for thermal cooling hose and tubing assemblies in data centers. This is perfect for engineers, IT administrators, and decision-makers who want to learn from Parker Hannifin’s extensive industry leadership and innovation in fluid conveyance solutions, guiding them through the complexities of material selection, component integration, and vendor evaluation.
This concise yet impactful session will explain how selecting the right hose materials and fittings can significantly enhance thermal management efficiency, reduce downtime, and extend equipment lifespan. Attendees will gain actionable insights on balancing performance, durability, and cost-effectiveness in hose assembly design tailored for demanding data center conditions.
Leveraging Parker Hannifin’s proven expertise and global presence, Jeff will provide real-world examples and technical considerations that empower your team to optimize cooling infrastructure reliability. Whether you’re a CIO, engineer, or policy maker focused on sustainability and operational excellence, this session offers valuable knowledge to support critical infrastructure decisions in today’s data-driven world.
This concise yet impactful session will explain how selecting the right hose materials and fittings can significantly enhance thermal management efficiency, reduce downtime, and extend equipment lifespan. Attendees will gain actionable insights on balancing performance, durability, and cost-effectiveness in hose assembly design tailored for demanding data center conditions.
Leveraging Parker Hannifin’s proven expertise and global presence, Jeff will provide real-world examples and technical considerations that empower your team to optimize cooling infrastructure reliability. Whether you’re a CIO, engineer, or policy maker focused on sustainability and operational excellence, this session offers valuable knowledge to support critical infrastructure decisions in today’s data-driven world.
Paper
Algorithms
Livestreamed
Recorded
TP
DescriptionScientific computing remains misaligned with the execution paradigm of modern AI accelerators, which favor structured, low-precision matrix operations. Quantum chemistry exemplifies this gap, with irregular computations, fragmented utilization, and limited support for high-complexity systems.
We present Mako, a matrix-centric system that rearchitects quantum chemistry to scale on AI accelerators. Mako comprises three components: KernelMako reformulates ERI evaluation into composable MatMul pipelines using CUTLASS; QuantMako introduces physics-informed quantization to exploit low-precision potential; and CompilerMako automates kernel fusion and architecture-tuned specialization.
Mako achieves up to ~20× speedup on high-angular-momentum basis sets. It sustains over 90% parallel efficiency on a single node and 70% across 64 GPUs, completing the accurate simulation of ubiquitin (1,231 atoms, def2-TZVP) from days to just 58 minutes. Mako demonstrates how scientific workloads can be restructured to inherit the scalability of deep learning—repurposing AI accelerators and their ecosystems to scale quantum chemistry beyond traditional limits.
We present Mako, a matrix-centric system that rearchitects quantum chemistry to scale on AI accelerators. Mako comprises three components: KernelMako reformulates ERI evaluation into composable MatMul pipelines using CUTLASS; QuantMako introduces physics-informed quantization to exploit low-precision potential; and CompilerMako automates kernel fusion and architecture-tuned specialization.
Mako achieves up to ~20× speedup on high-angular-momentum basis sets. It sustains over 90% parallel efficiency on a single node and 70% across 64 GPUs, completing the accurate simulation of ubiquitin (1,231 atoms, def2-TZVP) from days to just 58 minutes. Mako demonstrates how scientific workloads can be restructured to inherit the scalability of deep learning—repurposing AI accelerators and their ecosystems to scale quantum chemistry beyond traditional limits.
Paper
HPC for Machine Learning
System Software and Cloud Computing
Livestreamed
Recorded
TP
DescriptionLarge language models (LLMs) are becoming ubiquitous across industries, where applications demand diverse user intents. To meet those intents, developers must manually explore combinations of parallelism and compression techniques that affect resource usage, latency, cost, and accuracy. Prior works automate this process but incur high profiling costs, inefficient GPU use, or ignore diverse user-intents. We build MaverIQ, an automated intent-based LLM inference serving system that translates user-expressed intents into LLM deployment configurations and deploys the chosen configurations to improve operational cost for the provider. To reduce profiling costs, MaverIQ introduces and observes LLM fingerprint—a compact proxy of the LLM—under a few configurations, and uses novel analytical models to extrapolate the observed fingerprint data to the full LLM. To cut provider costs, we exploit our key observation that uneven LLM layer distribution minimally affects inference latency. MaverIQ cuts profiling costs by 7-15× and provider costs by 3.8-8.3× while best fulfilling user-intents.
Flash Session
Not Livestreamed
Not Recorded
TP
XO/EX
DescriptionIn today’s technology landscape, high-density electronic systems demand efficient, reliable cooling solutions that don’t compromise on space or energy consumption. The ebm-papst AxiEco 200 fan is engineered to meet these challenges head-on, delivering exceptional airflow performance and energy efficiency in a compact design. This session will explore how the AxiEco 200 optimizes cooling for high-density environments (like a rear door heat exchanger), reducing operational costs while enhancing system reliability. Attendees will gain insights into the fan’s innovative features, real-world application benefits, and how it supports sustainable, high-performance cooling in demanding scenarios.
Birds of a Feather
Community Meetings
Livestreamed
Recorded
TP
XO/EX
DescriptionCome and learn from the leaders of the professional societies focused on HPC from ACM, IEEE, and SIAM! Your SIGHPC, TCPP, TCHPC, SIAG-SC, and SIAG-CSE representatives invite SC25 participants to join this cross-society BoF to learn about the opportunities these societies provide. Each organization recognizes outstanding achievements in HPC with society awards, offers travel grants to students and early-career professionals, supports initiatives focused on education and outreach, and promotes diversity, equity, and inclusion. These representatives are also seeking feedback from the community to help improve their initiatives and to learn from each other.
Workshop
Livestreamed
Recorded
TP
W
DescriptionThe increasing disparity between computing speed and memory speed, commonly referred to as the memory wall, remains a critical and enduring challenge in the high performance computing and analytics community. This workshop aims to bring together computer science and computational science researchers, from industry, government labs and academia, concerned with the challenges of efficiently using existing and emerging memory systems. The term "performance" for memory systems is general, and includes latency, bandwidth, power consumption and reliability from the aspect of hardware memory technologies to how it is manifested in the application performance.
Research and ACM SRC Posters
Research & ACM SRC Posters
TP
DescriptionWe investigate matrix product states (MPS), a tensor-network compression method, as a memory-efficient representation of flow variables. A three-dimensional incompressible Navier-Stokes solver is implemented entirely in MPS form and is applied to canonical flow problems. Results show substantial memory savings and the ability to perform a $1024^3$ simulation on a single GPU. Performance analysis revealed new bottlenecks, particularly bond-dimension growth during nonlinear operations,
suggesting novel optimization strategies are needed to fully realize MPS-based CFD at extreme scales.
suggesting novel optimization strategies are needed to fully realize MPS-based CFD at extreme scales.
Community Engagement and Support
Not Livestreamed
Not Recorded
TP
XO/EX
DescriptionThis presentation showcases work from the Viva Bem Hub on using wearables for health and well-being, with a focus on mental health. We explore how AI can leverage physiological data to monitor and support an individual's mental state.
Paper
Applications
Architectures & Networks
BSP
Livestreamed
Recorded
TP
DescriptionPersistent memory (PMem) brings new design considerations in realizing high-performance and scalable hashing indexes. We uncover that existing hashing indexes for PMem still suffer from traffic amplification and memory inefficiency. We present MetoHash, a memory-efficient and traffic-optimized hashing index on hybrid PMem-DRAM memories. MetoHash proposes a three-layer index structure spanning across CPU caches, DRAM, and PMem for data management. It aggregates the incoming key-value items in CPU caches for fast inserts, which are then arranged in DRAM and flushed to PMem, to eliminate traffic amplification. MetoHash also uses fingerprinting to reduce unnecessary probings over PMem and removes duplicate items during bucket relocations. We implement MetoHash on PMem with persistent and volatile CPU caches, and show that compared to state-of-the-art hashing indexes for PMem, MetoHash improves the throughput by 86.1%–257.6% under various workloads.
SCinet
Not Livestreamed
Not Recorded
Workshop
Livestreamed
Recorded
TP
W
DescriptionThe semiconductor and high-performance computing (HPC) sectors face a parallel and urgent challenge: a widening skills gap amid rapid technological evolution, particularly with the rise of open hardware platforms such as RISC-V. As global initiatives like the EU Chips Act and national sovereignty strategies emphasize workforce development, traditional academic programs remain too rigid and slow to adapt. This paper presents a microcredential-based strategy for agile, modular, and industry-validated training focused on open hardware and full-stack HPC system design. Through the lens of Openchip’s approach, we explore how co-designed curricula—especially around vector-based RISC-V architectures—can modernize education to reflect emerging HPC paradigms. We also examine curricular gaps, the potential alignment with the TCPP curriculum initiative, and propose a roadmap for embedding microcredentials into scalable, open, and sovereign HPC education ecosystems.
Paper
Applications
Livestreamed
Recorded
TP
DescriptionUnderstanding the brain remains a major scientific challenge due to its complex structure and function. Unlike artificial neural networks, the biological brain features diverse biophysical properties essential to its function. Building an accurate digital replica using extensive anatomical and physiological data, which are available in standardized public databases, has emerged as a promising approach. In this study, a lightweight biophysical neuron simulator was developed and optimized to the supercomputer Fugaku. Good strong scaling was demonstrated in a benchmark model up to 152,064 compute nodes with 7.13 petaflops performance. In a more realistic scenario, a whole cerebral cortex of a mouse, consisting of 9 million biophysical neurons and 26 billion synapses, was simulated on the full-scale Fugaku with 145,728 nodes. These suggest that the present high performance computing technology is ready to support the construction of a digital replica of the whole mammalian brain.
Paper
Applications
GBC
Livestreamed
Recorded
TP
DescriptionOver the past decades, first-principles real-time time-dependent density functional theory (RT-TDDFT) simulations have been limited to systems with only thousands of atoms. We propose a novel method based on the discontinuous Galerkin adaptive local basis, significantly reducing global communication in RT-TDDFT. We further introduce a tensor compression technique that leverages basis locality to avoid repeated evaluation of multi-center integrals in hybrid functionals, greatly reducing computational cost. To overcome the projection bottleneck in our basis sets, we design a fused GEMM-Reduce operation that achieves several times higher floating-point efficiency than standard BLAS combination. Our implementation reaches 34.8\% of theoretical peak performance on 524,288 CGs of the New Sunway supercomputer and simulates electronic dynamics of systems with over one million atoms for both local-semi-local and hybrid functionals. This work improves computational scale by two orders of magnitude, opening new possibilities for exploring ultrafast dynamics in large-scale materials and nanophotonic devices.
Paper
Architectures & Networks
System Software and Cloud Computing
Livestreamed
Recorded
TP
DescriptionHigh performance computing (HPC) systems are essential for scientific discovery and engineering innovation. However, their growing power demands pose significant challenges, particularly as systems scale to the exascale level. Prior uncore frequency tuning studies primarily focused on conventional HPC workloads on homogeneous systems. As HPC advances toward heterogeneous computing, integrating diverse GPU workloads on heterogeneous systems, it is crucial to revisit and enhance uncore scaling. Our investigation reveals that uncore frequency decreases only when CPU power approaches its TDP (thermal design power)—an uncommon scenario in GPU-dominant applications—resulting in power waste. To address this, we present MAGUS, a user-transparent uncore scaling runtime for heterogeneous computing. Effective uncore tuning is complex, requiring dynamic detection of application execution phases that affect uncore utilization. Moreover, an efficient runtime should introduce minimal overhead. MAGUS employs key techniques such as memory throughput monitoring and prediction, and handling frequent phase transitions to tackle these challenges.
Paper
Applications
Livestreamed
Recorded
TP
DescriptionThe Atomic Kinetic Monte Carlo (AKMC) method provides insights into the macroscopic behavior of materials through atomistic-level simulations and finds broad applications in materials science innovation. Improving simulation scale and performance remains a consistent focus in the development of parallel AKMC software. We port the AKMC software to GPU clusters. To alleviate the memory pressure in large-scale complex system simulations, we redesign the data layout and propose the lattice data compression and vacancy data decompression algorithms. Additionally, we propose a multi-level pipeline scheme combined with an on-demand communication forwarding and merging strategy to reduce data transfer and communication overhead. Compared to state-of-the-art KMC software, MISA-AKMC\footnotemark achieves a 10.41-fold improvement in computational throughput and a 52.07-fold expansion in simulation scale. We implement the first true micrometer-scale AKMC simulation involving 20 quadrillion atoms on GPU clusters. MISA-AKMC achieves 96.03% parallel efficiency in weak scaling and 85.29% in strong scaling on 16,000 GPUs.
Birds of a Feather
State of the Practice
Livestreamed
Recorded
TP
XO/EX
DescriptionOver a decade ago, Arm was known for co-design, innovative architectures, and disruptive change in a x86-dominated software landscape. Today, Arm-based supercomputers operate globally, supporting advanced research across diverse fields. Has Arm become really “boring”? While accelerated computing is now essential, CPU-only systems remain important. The future of Arm in HPC raises questions: What new breakthroughs can Arm technologies enable? Is Arm still exciting for technologists and developers? This BoF gathers leaders who have helped advance the Arm HPC ecosystem to address what else is yet to be done to achieve the next maturity level.
Research and ACM SRC Posters
Research & ACM SRC Posters
TP
DescriptionThe increasing volume of high-resolution LiDAR data poses a significant I/O bottleneck in large-scale analysis and high-performance computing pipelines due to costly intermediary data storage and retrieval. We introduce a novel, end-to-end framework that addresses this issue by proposing the first unified RENO-based neural autoencoder with a Point Transformer v3 (PTV3) segmentation backbone. This integrated architecture directly feeds the high rank feature tensors of the RENO decoder into the segmentation backbone, completely bypassing the need for costly intermediary file storage and I/O operations. Evaluated on the German Outdoor and Offroad (GOOSE) dataset, this approach enables direct semantic analysis on compressed data. Our results demonstrate that this method significantly reduces storage overhead, saving 29.9 GB per 13,076 point clouds and 2.7 GB per minute of LiDAR operation, all while maintaining the accuracy of semantic segmentation. This unified framework represents a major step towards efficient, real-time processing of large-scale point cloud datasets.
Research and ACM SRC Posters
Research & ACM SRC Posters
TP
DescriptionThere is a growing need for workloads that don’t follow a traditional HPC workflow. Many of these workloads are developed with Kubernetes as the workload manager rather than an HPC-focused one such as Slurm. Mixing different workloads presents a challenge for a few reasons: The demand for either type of resource may fluctuate, so static assignments of Kubernetes or Slurm as the WLM may result in idle resources; the desire for one WLM or another may increase, so extra resources will need to be assigned and moved.
To address this demand, we utilized OpenCHAMI, an open-source system management platform for deploying, managing, and scaling HPC clusters. With OpenCHAMI, we created “spread”: a command line tool that configures nodes’ workload environments across the cluster. We support fast node booting using kexec and a dynamic base of workload environments to swap between, including Slurm and Kubernetes.
To address this demand, we utilized OpenCHAMI, an open-source system management platform for deploying, managing, and scaling HPC clusters. With OpenCHAMI, we created “spread”: a command line tool that configures nodes’ workload environments across the cluster. We support fast node booting using kexec and a dynamic base of workload environments to swap between, including Slurm and Kubernetes.
Birds of a Feather
Algorithms
Livestreamed
Recorded
TP
XO/EX
DescriptionWhat if we have been oversolving in computational science and engineering for decades? Are low-precision arithmetic formats only for AI workloads? How can HPC applications exploit mixed-precision hardware features? This BoF invites the HPC community at large interested in applying mixed precisions into their workflows and discussing the impact on time-to-solution, memory footprint, storage, data motion, and energy consumption. Experts from scientific applications/software libraries/hardware architectures will briefly provide the context on this trendy topic, share their own perspectives, and mostly engage with the audience via a set of questions, while gathering feedback to define a roadmap moving forward.
Workshop
Partially Livestreamed
Partially Recorded
TP
W
DescriptionAs simulations become more realistic, the pursuit of higher accuracy results in extended computation times and substantial energy consumption. This study explores mixed-precision computing as a promising strategy to address these challenges, leveraging computer arithmetic tools to optimize performance. To do so, we used Reactor Simulator and LULESH benchmarks as case studies to evaluate the potential of mixed-precision strategies to reduce both time-to-solution and energy-to-solution. For Reactor Simulator, we achieved more then 30 % reduction in both metrics without compromising accuracy. Similarly, results for LULESH demonstrated improvements of up to 31.5 % in time-to-solution and 25.6 % savings in energy-to-solution.
Workshop
Performance Evaluation, Scalability, & Portability
Livestreamed
Recorded
TP
W
DescriptionThe hardware diversity in leadership-class computing facilities, alongside the immense performance boosts from today's GPUs when computing in lower precision, incentivizes scientific HPC workflows to adopt mixed-precision algorithms and performance portability models. We present an on-the-fly framework using hipify for performance portability and apply it to FFTMatvec - an HPC application that computes matrix-vector products with block-triangular Toeplitz matrices. Our approach enables FFTMatvec, initially a CUDA-only application, to run seamlessly on AMD GPUs with excellent performance. Performance optimizations for AMD GPUs are integrated into the open-source rocBLAS library, keeping the application code unchanged. We then present a dynamic mixed-precision framework for FFTMatvec; a Pareto front analysis determines the optimal mixed-precision configuration for a desired error tolerance. Results are shown for AMD Instinct MI250X, MI300X, and the newly launched MI355X GPUs. The performance-portable, mixed-precision FFTMatvec is scaled to 4,096 GPUs on the OLCF Frontier supercomputer.
Paper
HPC for Machine Learning
Livestreamed
Recorded
TP
DescriptionLarge language models (LLMs) have been rapidly adopted across all domains, supporting divergent use cases with remarkable accuracy. However, training these massive models requires scaling across multiple GPUs. Given the expensive and limited GPU resources, advanced redundancy elimination and parallelization techniques are employed to maximize training throughput. Furthermore, to run LLMs larger than the aggregated memory of multiple GPUs, host memory or disk offloading techniques are leveraged. Despite advanced asynchronous multi-tier read/write strategies, such offloading strategies result in significant I/O overheads in the critical path of training. To this end, we propose MLP-Offload, a novel multi-level, multi-path offloading engine specifically designed for optimizing LLM training on resource-constrained setups by mitigating I/O bottlenecks. We design and implement MLP-Offload to offload the optimizer states across multiple tiers in a cache-efficient and concurrency-controlled fashion to mitigate I/O bottlenecks. Evaluations on models up to 280B parameters show that MLP-Offload achieves 2.5x faster iterations.
Paper
Performance Measurement, Modeling, & Tools
Livestreamed
Recorded
TP
DescriptionADMM-FFT is an iterative method with high reconstruction accuracy for laminography but suffers from excessive computation time and large memory consumption. We introduce mLR, which employs memoization to replace the time-consuming Fast Fourier Transform (FFT) operations based on the unique observation that similar FFT operations appear in iterations of ADMM-FFT. We introduce a series of techniques to make the application of memoization to ADMM-FFT performance-beneficial and scalable. We also introduce variable offloading to save CPU memory and scale ADMM-FFT across GPUs within and across nodes. Using mLR, we are able to scale ADMM-FFT on an input problem of $2K \times 2K \times 2K$, which is the largest input problem laminography reconstruction has ever worked on with the ADMM-FFT solution on limited memory; mLR brings 52.8\% performance improvement on average (up to 65.4\%), compared to the original ADMM-FFT.
Workshop
Livestreamed
Recorded
TP
W
DescriptionMulti-wavelength observation of gamma-ray bursts (GRBs) requires real-time interaction among multiple telescopes. A gamma-ray telescope detects and localizes a GRB in the sky and must then communicate with an optical telescope to direct the latter toward the GRB as quickly as possible. We previously developed software for ADAPT, a suborbital gamma-ray telescope, to localize GRBs in real time, on a timescale shorter than that of the GRB itself. This work therefore studies progressive localization, in which ADAPT computes a series of increasingly accurate location estimates during a GRB to enable a partner instrument to more rapidly find it. We describe a modeling and optimization framework to decide when ADAPT should compute estimated GRB locations to minimize the time for the partner to find the GRB. Our framework can design progressive strategies that allow a partner telescope to find a GRB up to 42% faster than strategies using a single alert.
Workshop
Livestreamed
Recorded
TP
W
DescriptionClimate change is a critical concern for HPC systems, but GHG protocol carbon-emission accounting methodologies are difficult for a single system, and effectively infeasible for a collections of systems.
As a result, there is no HPC-wide carbon reporting, and even the largest HPC sites do not do it.
We assess the carbon footprint of HPC, focusing on the Top 500 systems. The key challenge is modeling carbon footprint with limited data availability.
With the disclosed Top500.org data, and using EasyC tool, we were able to model operational carbon of 391 HPC systems and embodied carbon of 283 systems. This coverage can be further enhanced by public information, then interpolation is used to produce the first carbon footprint estimates of the Top 500 HPC systems (1.4 million MT CO2e operational carbon and 1.9 million MT CO2e embodied carbon). We also project how the Top 500's carbon footprint will increase through 2030.
As a result, there is no HPC-wide carbon reporting, and even the largest HPC sites do not do it.
We assess the carbon footprint of HPC, focusing on the Top 500 systems. The key challenge is modeling carbon footprint with limited data availability.
With the disclosed Top500.org data, and using EasyC tool, we were able to model operational carbon of 391 HPC systems and embodied carbon of 283 systems. This coverage can be further enhanced by public information, then interpolation is used to produce the first carbon footprint estimates of the Top 500 HPC systems (1.4 million MT CO2e operational carbon and 1.9 million MT CO2e embodied carbon). We also project how the Top 500's carbon footprint will increase through 2030.
Workshop
Livestreamed
Recorded
TP
W
DescriptionMemory bandwidth has become the primary limiting factor of performance in many modern HPC applications, and it poses a limit to scalability because the achievable memory
bandwidth only grows linearly with a small number of CPU cores. When the number of cores concurrently using the memory system exceeds a threshold, the aggregate memory bandwidth quickly saturates.
To estimate the time usage of a computation dominated by memory
traffic, the mainstream strategy is to divide the expected total memory
traffic volume by the maximum memory bandwidth. However, this
implicitly assumes homogeneous memory traffic which is often not the case, leading to inaccurate time estimates by
the mainstream strategy.
In this paper, we present a new performance model that specifically
targets inhomogeneity in per-core memory traffic.
The new requires only three hardware parameters. Using several cases of uneven per-core memory traffic, we demonstrate its advantage
over the mainstream strategy.
bandwidth only grows linearly with a small number of CPU cores. When the number of cores concurrently using the memory system exceeds a threshold, the aggregate memory bandwidth quickly saturates.
To estimate the time usage of a computation dominated by memory
traffic, the mainstream strategy is to divide the expected total memory
traffic volume by the maximum memory bandwidth. However, this
implicitly assumes homogeneous memory traffic which is often not the case, leading to inaccurate time estimates by
the mainstream strategy.
In this paper, we present a new performance model that specifically
targets inhomogeneity in per-core memory traffic.
The new requires only three hardware parameters. Using several cases of uneven per-core memory traffic, we demonstrate its advantage
over the mainstream strategy.
Workshop
Livestreamed
Recorded
TP
W
DescriptionThis moderated discussion will engage all symposium speakers and participants in a comprehensive conversation about the future directions of AI workflows. The session will address both technical and community-driven priorities, fostering new connections and ideas among workshop attendees.
Workshop
Livestreamed
Recorded
TP
W
DescriptionOur invited speakers address this year's charge question,
then our audience & panelists will dig deeper in a moderated discussion.
then our audience & panelists will dig deeper in a moderated discussion.
Tutorial
Livestreamed
Recorded
TUT
DescriptionAs more diverse applications move to high performance computing (HPC), the I/O workload has also become more varied. The community has spent three decades improving large-scale parallel file systems, but object stores provide new approaches and novel interfaces. At the same time, traditional parallel I/O approaches and abstractions have evolved under the covers to make use of both novel object stores and classic parallel file systems. In this full-day tutorial we will explore several object storage technologies and discuss both old and new approaches to getting maximum performance from them. With a mix of lectures and hands-on exercises, attendees will learn about object storage design and usage. We will also tackle how the classic I/O software stack has evolved to make use of object stores, as well as provide attendees with the knowledge to know when file systems and object stores are the most appropriate storage approach for their applications. By the end of the day, attendees will learn a bit more about what these new storage systems are doing, as well as how to use libraries and tools to hide the details.
Workshop
Livestreamed
Recorded
TP
W
DescriptionHigh-performance computing (HPC) environments require configuration management systems to support diverse infrastructure and operational needs. At the National Center for Supercomputing Applications (NCSA), we initiated a multi-year transition from Puppet to Ansible to modernize our configuration management across our active HPC clusters. This paper presents the motivations behind the migration, including limitations encountered with Puppet and the advantages of Ansible’s agentless architecture and human-readable YAML-based configuration model.We detail our transition methodology, emphasizing cross-team collaboration, configuration parity, and low operational impact to production systems. Comparative insights highlight key differences in compliance enforcement, inventory visibility, automation workflows, secrets management, and custom module development. Additionally, we share implementation insights regarding community resource gaps, provisioning integration, access constraints, and organizational buy-in.Our experience underscores the importance of deliberate planning and collaborative toolsets in infrastructure modernization.
Workshop
Livestreamed
Recorded
TP
W
DescriptionManaging Python environments on high-performance computing (HPC) systems presents unique challenges due to complex toolchains, file system constraints, and diverse user needs. We present ModuLair, a modular, metadata-driven Python virtual environment framework designed to simplify environment creation, activation, and management in HPC contexts. ModuLair supports both EasyBuild and non-EasyBuild module systems, automatically detecting explicit specification of toolchains to ensure reproducibility and compatibility across workflows. The framework integrates seamlessly with command-line and graphical interfaces, including the improved User Dashboard, Job Composer, and JupyterLab, enabling visual, intuitive environment management for both novice and experienced users.
We validate ModuLair through usage metrics collected over five months across three HPC clusters, demonstrating sustained adoption by active Python users and integration into ongoing research workflows in both CLI and GUI contexts. These results show that ModuLair reduces setup complexity, lowers the barrier to entry, and promotes best practices in environment configuration and job submission.
We validate ModuLair through usage metrics collected over five months across three HPC clusters, demonstrating sustained adoption by active Python users and integration into ongoing research workflows in both CLI and GUI contexts. These results show that ModuLair reduces setup complexity, lowers the barrier to entry, and promotes best practices in environment configuration and job submission.
Workshop
Livestreamed
Recorded
TP
W
DescriptionHigh-performance applications necessitate rapid and dependable transfer of massive datasets across geographically dispersed locations. Traditional file transfer tools often suffer from resource underutilization and instability due to fixed configurations or monolithic optimization methods. We propose AutoMDT, a novel Modular Data Transfer Architecture, to address these issues by employing deep reinforcement learning based agent to simultaneously optimize concurrency levels for read, network, and write operations. This solution incorporates a lightweight network–system simulator, enabling offline training of a Proximal Policy Optimization (PPO) agent in approximately 45 minutes on average, thereby overcoming the impracticality of lengthy online training in production networks. AutoMDT’s modular design decouples I/O and network tasks. This allows the agent to capture complex buffer dynamics precisely and to adapt quickly to changing system and network conditions. Evaluations on production-grade testbeds show that AutoMDT achieves up to 8X faster convergence and 68\% reduction in transfer completion times compared to state-of-the-art solutions.
Workshop
Livestreamed
Recorded
TP
W
DescriptionMixture of Experts (MoE) models have enabled the scaling of Large Language Models (LLMs) and Vision Language Models (VLMs) by achieving massive parameter counts while maintaining computational efficiency. However, MoEs introduce several inference-time challenges, including load imbalance across experts and the additional routing computational overhead. To address these challenges and fully harness the benefits of MoE, a systematic evaluation of hardware acceleration techniques is essential. We present MoE-Inference-Bench, a comprehensive study to evaluate MoE performance across diverse scenarios. We analyze the impact of batch size, sequence length, and critical MoE hyperparameters such as FFN dimensions and number of experts on throughput. We evaluate several optimization techniques on Nvidia H100 GPUs, including pruning, Fused MoE operations, speculative decoding, quantization, and various parallelization strategies. Our evaluation includes MoEs from the Mixtral, DeepSeek, OLMoE and Qwen families. The results reveal performance differences across configurations and provide insights for the efficient deployment of MoEs.
Workshop
Livestreamed
Recorded
TP
W
DescriptionWe explore the performance and portability of the novel Mojo language for scientific computing workloads on GPUs. As the first language based on the LLVM's Multi-Level Intermediate Representation (MLIR) compiler infrastructure, Mojo aims to close performance and productivity gaps by combining Python's interoperability and syntax with CUDA-like compile-time programming. We target four scientific workloads: (i) a Seven-point stencil (memory-bound), (ii) BabelStream (memory-bound), (iii) miniBUDE (compute-bound), and (iv) Hartree--Fock (compute-bound with atomic operations), and compared their performance against vendor baselines on NVIDIA H100 and AMD MI300A GPUs. We show that Mojo's performance is competitive with CUDA and HIP for memory-bound kernels, whereas gaps exist on AMD GPUs for atomic operations and for fast-math compute-bound kernels on both AMD and NVIDIA GPUs. Although the learning curve and programming requirements are still fairly low-level, Mojo can close significant gaps in the fragmented Python ecosystem in the convergence of scientific computing and AI.
Best Poster Presentations (Research, ACM SRC Grad/Undergrad)
Research & ACM SRC Posters
TP
DescriptionThis work investigates Mojo, a new MLIR-based language that combines Python-like syntax with portable, low-level GPU programming capabilities. We compare the performance of the Mojo portable GPU kernels against vendor-specific C++ NVIDIA CUDA and AMD HIP implementations on four representative scientific workloads: (1) BabelStream (memory-bound); (2) seven-point stencil (memory-bound); (3) miniBUDE (compute-bound); and (4) Hartree-Fock (compute-bound with atomic operations), evaluated on NVIDIA H100 and AMD MI300A GPUs. Results show that Mojo can match CUDA and HIP performance for memory-bound kernels, though gaps remain for atomic operations and certain compute-bound cases. This poster will present a general overview of the language, our benchmarking methodology, comparative results, the use of vendor profiling tools, and observations on Mojo’s potential to close the gap between high performance and developer productivity in scientific GPU programming. Our contribution is the first systematic evaluation of Mojo for HPC workloads, highlighting both its promise and current limitations.
Workshop
Livestreamed
Recorded
TP
W
DescriptionSmall modular reactors (SMRs) require a smaller physical footprint than conventional large nuclear reactors while still providing high reliability in power generation and they are frequently discussed in the context of providing power for data centers. Molten chloride reactors represent a new type of SMR where a mixture of molten chloride and uranium serves as both reactor fuel and coolant. The approach is key to molten chloride fast reactor technologies being marketed commercially in SMRs for end users like data centers. This work leverages the open-source Multiphysics Object Oriented Simulation Environment (MOOSE) framework to simulate the behavior of a molten salt SMR with the control rods at 60% open exploring steady state behavior of an SMR while operating below maximum capacity to power a data center with a mismatched power rating.
Paper
BSP
HPC for Machine Learning
System Software and Cloud Computing
Partially Livestreamed
Partially Recorded
TP
DescriptionGraph neural networks (GNNs) are widely employed in applications like recommendation systems, social network analysis, and fraud detection, but training large-scale GNNs is challenging due to memory limitations. Existing systems face a trade-off between throughput and monetary cost: distributed systems require expensive memory scaling, while single-machine out-of-core systems are limited by GPU/PCIe throughput. To this end, we propose Moment, a physical communication topology and data placement co-optimizer to enable high-throughput and low-cost GNN training in a single multi-GPU machine. Moment addresses communication contention and GPU load imbalance by modeling the physical topology as capacity-constrained directed graphs and formulating communication scheduling as a max-flow problem. It also introduces a data distribution-aware knapsack algorithm for optimized data placement. Experimental results show that Moment outperforms out-of-core systems by up to 6.51× and distributed systems by up to 3.02×, with only 50% monetary cost.
Early Career
Not Livestreamed
Not Recorded
TP
W
TUT
XO/EX
Workshop
Livestreamed
Recorded
TP
W
Workshop
Partially Livestreamed
Partially Recorded
TP
W
Workshop
Livestreamed
Recorded
TP
W
Workshop
Partially Livestreamed
Partially Recorded
TP
W
Workshop
Data Analytics
High Performance I/O, Storage, Archive, & File Systems
Storage
Livestreamed
Recorded
TP
W
Workshop
Partially Livestreamed
Partially Recorded
TP
W
Workshop
Performance Evaluation, Scalability, & Portability
Livestreamed
Recorded
TP
W
Birds of a Feather
Standards
Livestreamed
Recorded
TP
XO/EX
DescriptionThe Message Passing Interface (MPI) API is the most dominant programming approach for HPC environments. Its specification is driven by the MPI Forum, an open forum consisting of MPI developers, vendors, and users. This year, the MPI Forum published the latest version of the standard, MPI 5.0. We will take a look at the new features and will discuss what they mean for the users of MPI. We will also discuss ongoing work toward the next version of the MPI standard, with lightning talks from the working groups, and get feedback from the community.
Workshop
Livestreamed
Recorded
TP
W
DescriptionProgrammable smart network devices are heavily used by cloud providers, but typically not for HPC. However, they provide opportunities for off-loading computations, in particular for collective operations, which are important for data intensive workloads in classic HPC and ML training. In this paper, we present a prototype called mpitofino to enable offloading MPI collectives (in particular reductions) onto smart switches over an Ethernet fabric. We target Intel’s programmable Ethernet switches equipped with a Tofino ASIC, and we use the P4 programming language to process collective packets on the chip’s low-latency data path. We demonstrate how the flexibility of P4 enables us to use RoCEv2 as protocol, utilizing RDMA hardware support on the nodes’ NICs. Furthermore, we implement mpitofino as a collective provider in Open MPI and discuss its desirable scaling characteristics. Finally, we demonstrate that mpitofino can achieve data throughput close to the 100GBit/s line rate.
Workshop
Livestreamed
Recorded
TP
W
DescriptionAMD’s MI300A integrates CPU and GPU chiplets around a shared HBM3 pool, removing the traditional host-device boundary and changing assumptions in GPU-aware MPI. Despite early deployments, there is little guidance on how mainstream MPI libraries behave on this architecture. This evaluation paper presents a comparative study of MVAPICH-Plus, Open MPI, MPICH, and Cray MPICH on MI300A APU nodes. We measure point-to-point performance on CPU and GPU buffers, reporting intra-node and inter-node latency, unidirectional bandwidth, and bidirectional bandwidth across various message sizes. We then examine collectives, covering reduction-based and data-movement-based operations, and analyze scaling behavior. Finally, we connect microbenchmark trends to application results using OpenFOAM and Distributed training of a large language model (LLM) with Pytorch. The study distills practical guidance and highlights opportunities for MI300A-aware optimizations in MPI.
Research and ACM SRC Posters
Research & ACM SRC Posters
TP
DescriptionBig data and deep learning workloads often require handling sensitive data, but security mechanisms in current supercomputers mainly protect against external threats, leaving risks of insider leakage. As a result, supercomputers remain unsuitable for confidential applications. To address this challenge, we propose the first SGX-based parallel computing system with a secure MPI library, MPI-SGX. MPI-SGX enables MPI processes across multiple SGX enclaves to communicate safely through encryption, without requiring code modifications. By combining MPI-SGX with SGX enclaves, our system supports confidential execution of MPI-based parallel applications. Experimental results show that our approach incurs a 6.6x increase in communication latency and a 49% reduction in bandwidth compared to the baseline, but successfully achieves confidentiality. In the poster session, we will present the design of the SGX-based system and MPI-SGX, report detailed experimental findings, and discuss directions for improving performance and expanding the scope of secure HPC.
Birds of a Feather
System Architecture
Livestreamed
Recorded
TP
XO/EX
DescriptionMPICH is a widely used, open-source implementation of the MPI message passing standard. It has been ported to many platforms and used by several vendors and research groups as the basis for their own MPI implementations. This BoF session will provide a forum for core MPICH developers to share new features and release plans with the community. Developers of MPI implementations derived from MPICH will share their own status updates and discuss experiences and issues in using and porting MPICH. Key users will be given an opportunity to present MPICH usage stories. Questions from the audience are welcome.
Workshop
Livestreamed
Recorded
TP
W
DescriptionMPI provides a flexible C-API to communicate data of various types between a set of distributed processes over high-speed interconnects in HPC systems. Data buffers are described using MPI-Datatypes, which specify the type and layout of the data to be transmitted. To construct these datatypes, users must manually describe the memory layout of buffer elements via the MPI-API. However, modern applications are typically written in object-oriented C++, which offers significant advantages over C, including type safety and metaprogramming capabilities. In this work, we introduce a new C++-API and datatype engine that leverage C++ language features such as concepts, ranges, and the upcoming reflection to extract the necessary datatype information for the user at compile-time. This approach simplifies the user’s work, enhances code safety by eliminating manual datatype construction and offers previously unavailable possibilities. Our measurements demonstrate that this interface introduces no performance overhead and, in some cases, even improves performance.
Workshop
Livestreamed
Recorded
TP
W
DescriptionUnderstanding GPU topology is essential for performance-related tasks in HPC or AI. Yet, unlike for CPUs with tools like hwloc, GPU information is hard to come by, incomplete, and vendor-specific.
In this work, we address this gap and present MT4G, an open-source and vendor-agnostic tool that automatically discovers GPU compute and memory topologies and configurations, including cache sizes, bandwidths, and physical layouts.
MT4G combines existing APIs with a suite of over 50 microbenchmarks, applying statistical methods, such as the Kolmogorov-Smirnov test, to automatically and reliably identify otherwise programmatically unavailable topological attributes.
We showcase MT4G's universality on ten different GPUs and demonstrate its impact through integration into three workflows: GPU performance modeling, GPUscout bottleneck analysis, and dynamic resource partitioning.
These scenarios highlight MT4G's role in understanding system performance and characteristics across NVIDIA and AMD GPUs, providing an automated, portable solution for modern HPC and AI systems.
In this work, we address this gap and present MT4G, an open-source and vendor-agnostic tool that automatically discovers GPU compute and memory topologies and configurations, including cache sizes, bandwidths, and physical layouts.
MT4G combines existing APIs with a suite of over 50 microbenchmarks, applying statistical methods, such as the Kolmogorov-Smirnov test, to automatically and reliably identify otherwise programmatically unavailable topological attributes.
We showcase MT4G's universality on ten different GPUs and demonstrate its impact through integration into three workflows: GPU performance modeling, GPUscout bottleneck analysis, and dynamic resource partitioning.
These scenarios highlight MT4G's role in understanding system performance and characteristics across NVIDIA and AMD GPUs, providing an automated, portable solution for modern HPC and AI systems.
Research and ACM SRC Posters
Research & ACM SRC Posters
TP
DescriptionNumerical ocean models are essential tools for climate prediction and marine resource studies, requiring high resolution and realistic physical processes. We developed the global ocean model COCO and implemented it on GPUs using an OpenACC directive-based approach, while maintaining compatibility with CPUs. Performance was evaluated on the Miyabi supercomputer, which includes GPU-based (NVIDIA GH200) and CPU-based (Intel Xeon MAX 9480) systems. Realistic ocean experiments with a 0.17° global grid showed that most components achieved faster execution on GPUs, with the tracer calculation accelerated by a factor of 2.9. Roofline analysis revealed that most loops were memory-bound, and GPU speedup was constrained by memory bandwidth rather than compute capability. Future improvements will require increasing arithmetic intensity and applying kernel-level optimizations, while ensuring compatibility between CPU- and GPU-based codes.
Workshop
Livestreamed
Recorded
TP
W
DescriptionHeterogeneous compute nodes containing multiple accelerators and Ethernet network injections have become common in recent years. Despite this, additional network injections beyond the first are often only utilized by application middleware such as MPI or NCCL supporting an RDMA API. We explain why traditional Etherchannel can't support this usecase. We further propose an alternative network configuration which allows these hardware resources to be utilized both by RDMA application middleware such as MPI as well as other applications which utilize the OS provided sockets API rather than a kernel bypass API. This allows user applications using less HPC focused (but potentially more portable) APIs as well as parallel filesystems and other tools to also benefit from the additional networking hardware available in this type of compute node.
SCinet Network Research Exhibition
Not Livestreamed
Not Recorded
Descriptionnri106
nri### (from Joe Mambretti/StarLight) nri### (from Harvey Newman/Caltech)
nri### (from Joe Mambretti/StarLight) nri### (from Harvey Newman/Caltech)
ACM Gordon Bell Finalist
Awards and Award Talks
Applications
GB
Livestreamed
Recorded
TP
DescriptionLight-matter dynamics in topological quantum materials enables ultralow-power, ultrafast devices. A challenge is simulating multiple field and particle equations for light, electrons, and atoms over vast spatiotemporal scales on Exaflop/s computers with increased heterogeneity and low-precision focus. We present a paradigm shift that solves the multiscale/multiphysics/heterogeneity challenge harnessing hardware heterogeneity and low-precision arithmetic. Divide-conquer-recombine algorithms divide the problem into not only spatial but also physical subproblems of small dynamic ranges and minimal mutual information, which are mapped onto best-characteristics-matching hardware units, while metamodel-space algebra minimizes communication and precision requirements. Using 60,000 GPUs of Aurora, DC-MESH (divide-and-conquer Maxwell-Ehrenfest surface hopping) and XS-NNQMD (excited-state neural-network quantum molecular dynamics) modules of MLMD (multiscale light-matter dynamics) software were 152- and 3,780-times faster than the state-of-the-art for 15.4 million-electron and 1.23 trillion-atom PbTiO3 material, achieving 1.87 EFLOP/s for the former. This enabled the first study of light-induced switching of topological superlattices for future ferroelectric "topotronics."
Paper
Algorithms
HPC for Machine Learning
Programming Frameworks
Livestreamed
Recorded
TP
DescriptionMicro-scaling general matrix multiplication (MX-GEMM) uses 8-bit MX-format inputs to accelerate deep learning workloads. While the MX-format supports diverse scaling patterns and granularities, current MX-GEMM implementations are often model-specific. This leads to three main issues: tight coupling between models and kernels, inefficient promotion operations, and neglected quantization overhead.
This paper introduces MXBLAS, a high-performance MX-GEMM library that supports the full range of MX-format variations. MXBLAS overcomes prior limitations with three key innovations: (1) a template-based design enabling flexible promotion patterns within a unified framework; (2) adaptive runtime kernel generation using template matching, guided search pruning, and auto-tuning to find optimal configurations; and (3) a compute-store co-optimization that fuses quantization into the kernel’s epilogue, reducing overhead. Experiments show MXBLAS outperforms existing MX-GEMM libraries by 33% on average, and is the first to fully harness the performance potential of generalized 8-bit computing across all MX-formats.
This paper introduces MXBLAS, a high-performance MX-GEMM library that supports the full range of MX-format variations. MXBLAS overcomes prior limitations with three key innovations: (1) a template-based design enabling flexible promotion patterns within a unified framework; (2) adaptive runtime kernel generation using template matching, guided search pruning, and auto-tuning to find optimal configurations; and (3) a compute-store co-optimization that fuses quantization into the kernel’s epilogue, reducing overhead. Experiments show MXBLAS outperforms existing MX-GEMM libraries by 33% on average, and is the first to fully harness the performance potential of generalized 8-bit computing across all MX-formats.
Workshop
Livestreamed
Recorded
TP
W
DescriptionAs demand for AI literacy and data science education grows, there is a critical need for infrastructure that bridges the gap between research data, computational resources, and educational experiences. To address this gap, we developed a first-of-its-kind Education Hub within the National Data Platform. This hub enables seamless connections between collaborative research workspaces, classroom environments, and data challenge settings. Early use cases demonstrate the effectiveness of the platform in supporting complex and resource-intensive educational activities. Ongoing efforts aim to enhance the user experience and expand
adoption by educators and learners alike.
adoption by educators and learners alike.
Invited Talk
National Strategies
Livestreamed
Recorded
TP
DescriptionThe National Supercomputing Mission (NSM) is a Government of India initiative with an aim to promote cutting-edge research in science and technology. The main objective of the Mission is to create HPC infrastructure at various academic and research institutions in the country, develop applications for national needs, develop indigenous HPC technologies for self-reliance, and develop human resources to spearhead HPC activities in the nation. The Centre for Development of Advanced Computing (C-DAC) and the Indian Institute of Science (IISc), Bangalore, are the implementation agencies of the Mission.
To date, 22 large and midsize supercomputing systems with a total compute power of 37+ PF have been created under this program. Ten more systems with a total compute power of 60+ PF are being built with indigenous technologies in the next six months, bringing the cumulative compute power capacity to 100 PF under the Mission. To continue the momentum, an NSM 2.0 proposal with exascale compute power is being worked out with a major focus on self-reliance in supercomputing.
To date, 22 large and midsize supercomputing systems with a total compute power of 37+ PF have been created under this program. Ten more systems with a total compute power of 60+ PF are being built with indigenous technologies in the next six months, bringing the cumulative compute power capacity to 100 PF under the Mission. To continue the momentum, an NSM 2.0 proposal with exascale compute power is being worked out with a major focus on self-reliance in supercomputing.
Birds of a Feather
State of the Practice
Livestreamed
Recorded
TP
XO/EX
DescriptionWith the increasing demand for AI in HPC, there has been a rapid rise in accelerated architectures, portable programming models, and frameworks. The already-daunting task of programming for accelerated systems has become even more complex. This BoF, organized by IXPUG, will focus on portable programming across a wide range of heterogeneous architectures—including Intel, NVIDIA, AMD, and Arm—supporting diverse simulation, data analytics, and AI workloads. The session will explore key challenges, state-of-the-art solutions, and emerging best practices for programming across these systems, identifying common principles and methodologies that support development and long-term maintenance across sites, architectures, and scientific applications.
Art of HPC
Art of HPC
Not Livestreamed
Not Recorded
TP
W
TUT
XO/EX
DescriptionA short meditation in two parts about how ideas and knowledge shape and order what we find in the sky.
Manu-o-Kū
The first part is based on a narration by Polynesian navigator Nainoa Thompson who describes how stars, clouds, waves, and living beings form an interconnected system of orientation that can be read, felt, heard, and smelled. This celestial knowledge is not a product of the human mind alone but shared with animals such as the seabird Manu-o-Kū, which indicates the proximity of land. Thompson’s Hawaiian voyaging canoe played a central role in the revival of traditional Polynesian non-instrumental navigation techniques in the 1970s. The close entanglement of celestial knowledge and cultural ideas is also reflected in the visuals generated by an artificial neural network that has been trained on millions of images representing contemporary visual culture.
SIMBAD
The second part traces how scientific knowledge is shaped by instruments and human culture. SIMBAD, alluding to another mythical seafarer, is the name of an astronomical database maintained by the Université de Strasbourg. It maps every celestial object described in scientific literature to its corresponding place in the sky. Looking at the composite image of all astronomical references, one is struck by distinct geometrical patterns – rectangles, circles, and other complex shapes appear in the map of all known stars and galaxies, revealing the imprints of instruments, publication formats, and changing cultural interests. Sounds and visuals are generated from 28 million bibliographic references extracted from the database.
Manu-o-Kū
The first part is based on a narration by Polynesian navigator Nainoa Thompson who describes how stars, clouds, waves, and living beings form an interconnected system of orientation that can be read, felt, heard, and smelled. This celestial knowledge is not a product of the human mind alone but shared with animals such as the seabird Manu-o-Kū, which indicates the proximity of land. Thompson’s Hawaiian voyaging canoe played a central role in the revival of traditional Polynesian non-instrumental navigation techniques in the 1970s. The close entanglement of celestial knowledge and cultural ideas is also reflected in the visuals generated by an artificial neural network that has been trained on millions of images representing contemporary visual culture.
SIMBAD
The second part traces how scientific knowledge is shaped by instruments and human culture. SIMBAD, alluding to another mythical seafarer, is the name of an astronomical database maintained by the Université de Strasbourg. It maps every celestial object described in scientific literature to its corresponding place in the sky. Looking at the composite image of all astronomical references, one is struck by distinct geometrical patterns – rectangles, circles, and other complex shapes appear in the map of all known stars and galaxies, revealing the imprints of instruments, publication formats, and changing cultural interests. Sounds and visuals are generated from 28 million bibliographic references extracted from the database.
Panel
AI, Machine Learning, & Deep Learning
HPC Software & Runtime Systems
Parallel Programming Methods, Models, Languages, & Environments
Livestreamed
Recorded
TP
DescriptionThe future of high performance computing faces a software reckoning. As architectural diversity explodes—from AI accelerators and chiplet-based designs to quantum processors—our programming models, compilers, and system software are under strain. New programming systems seemingly materialize from thin air while stalwarts like Fortran struggle to keep current. Will AI help us manage the increasing complexity? Can open standards and toolchains like LLVM provide stability? How should the HPC community adapt in an era increasingly dominated by “productivity-focused” languages like Python, Rust, and Julia? This panel assembles thought leaders across architectures, languages, AI, and programming models to debate whether today's approaches are sustainable—or whether radical new software infrastructures are needed to keep science moving forward. Technology is only one factor in the solution; business models, training, and stewardship are equally, if not more, important than a specific technology.
Workshop
Livestreamed
Recorded
TP
W
DescriptionShared network testbeds are critical for systems and networking research. However, their shared hardware can introduce variability—like increased jitter or loss—that may impact experiment fidelity or reproducibility.
We present Choir, the first 100 Gbps replay tool designed to run on commodity hardware and shared infrastructures. Choir enables precise replay and measurement to observe how closely a testbed reproduces expected behavior. We also introduce a metric for quantifying consistency, designed to support comparison across time, configurations, and environments.
We evaluate our approach on FABRIC and a local, bare-metal testbed. We show that FABRIC, even with dedicated resources and low background utilization, has greater variability in inter-packet arrival times and latency compared to the local testbed. With high utilization on shared hardware, this variability increases by an order of magnitude. Our findings demonstrate how tools like Choir can help researchers better understand and mitigate the effects of shared infrastructure.
We present Choir, the first 100 Gbps replay tool designed to run on commodity hardware and shared infrastructures. Choir enables precise replay and measurement to observe how closely a testbed reproduces expected behavior. We also introduce a metric for quantifying consistency, designed to support comparison across time, configurations, and environments.
We evaluate our approach on FABRIC and a local, bare-metal testbed. We show that FABRIC, even with dedicated resources and low background utilization, has greater variability in inter-packet arrival times and latency compared to the local testbed. With high utilization on shared hardware, this variability increases by an order of magnitude. Our findings demonstrate how tools like Choir can help researchers better understand and mitigate the effects of shared infrastructure.
Flash Session
Not Livestreamed
Not Recorded
TP
XO/EX
Workshop
Partially Livestreamed
Partially Recorded
TP
W
DescriptionThe Singular Value Decomposition (SVD) is a foundational building block in many applications, including low rank adaptation (LoRA) for large language models (LLMs). Historically, separate SVD implementations have been designed for each data precision, for each hardware vendor, and for each hardware type (personal computer and HPC). This divergence leads to increased development time, the need to redevelop entire libraries when new architectures or data types emerge, and significant complexity for the end user. In this abstract, we discuss a work in progress to develop an alternative: a unified SVD, enabled by abstraction layers. We demonstrate that state-of-the-art performance across the board can be reached using abstraction frameworks, and investigate the performance engineering process and the characteristics that enable adaptable performance.
Exhibitor Forum
Hardware and Architecture
Livestreamed
Recorded
TP
XO/EX
DescriptionThe Next Vector Project by NEC addresses the growing challenges in high performance computing (HPC), such as energy efficiency, scalability, and accessibility. Building on the proven SX-Aurora TSUBASA vector architecture, the project integrates the open-standard RISC-V instruction set to foster innovation and collaboration. NEC (a Japanese IT company) partners with Openchip & Software Technologies in Spain to co-develop this next-generation processor system. The initiative emphasizes not only advanced hardware but also a robust software ecosystem, including compiler development and user-friendly programming tools. The goal is to make powerful vector computing accessible to a broader range of users, from HPC experts to domain scientists. The presentation outlines current HPC challenges, showcases the benefits of vector computing, and details the technical and collaborative aspects of the Next Vector Project, including its roadmap and specifications.
Workshop
Livestreamed
Recorded
TP
W
DescriptionJoin us for an in-depth look at the IBM Quantum roadmap, where we will share our vision for the future of quantum computing and outline the key developments and milestones that will drive progress towards achieving fault tolerant quantum computing by 2029. This session will provide a comprehensive overview of our technical roadmap, highlighting upcoming advancements in quantum hardware, software, and ecosystem development. By attending this session, attendees will gain a deeper understanding of IBM's quantum strategy and the exciting developments that are on the horizon, and learn how to prepare for the emerging era of quantum advantage.
SCinet Network Research Exhibition
Not Livestreamed
Not Recorded
Invited Talk
Life Sciences
Societal Impact
Livestreamed
Recorded
TP
DescriptionNightingale AI is Europe’s flagship effort to build sovereign, open medical foundation models using secure national health data. Unlike language-only models, medical AI must learn across multimodal data—imaging, biosignals, genomics, and clinical text—demanding innovations in architecture, scaling, and interpretability. Leveraging exascale compute and federated secure data environments, Nightingale AI pioneers an “AI factory” approach that fuses national-scale datasets with immediate healthcare impact. This talk will share an overview of our work today since launch in March 2025, and how we have partnered from day one with Isambard-AI at an unprecedented scale of compute for academic/medical research teams.
Workshop
Livestreamed
Recorded
TP
W
DescriptionAs HPC systems increasingly support sensitive and federally regulated research, frameworks like the NIST Special Publication (SP) 800 series are becoming essential for compliance and data protection. This lightning talk offers a fast-paced overview of the NIST SP 800-* family, highlighting key standards like 800-53, 800-66, 800-171, 800-223, and 800-234, and what they mean for the HPC community. Attendees will gain a high-level understanding of how these standards influence research requirements, cybersecurity expectations, and institutional responsibilities. Whether you're supporting CUI, HIPAA, preparing for CMMC, or looking to stay ahead of compliance trends, this session will help you understand the standards that matter most.
Paper
Applications
Livestreamed
Recorded
TP
DescriptionNeural network quantum states (NNQS) offer a powerful variational Monte Carlo (VMC) approach for quantum many-body problems, balancing polynomial scaling with high expressive power. However, scaling NNQS to large chemical systems faces challenges in preserving accuracy with exact energy and managing vast configurations efficiently. In this work, we introduce NNQS-SCI, a high-performance selected configuration interaction (SCI) based NNQS method designed to overcome these limitations. NNQS-SCI employs highly parallelized Slater-Condon rules for fast local energy evaluations, avoiding accuracy loss, while its adaptive SCI engine dynamically manages billions of configurations without space explosion or arbitrary cutoffs that plague other NNQS-CI approaches. Optimized for extreme scalability via multi-level parallelism and memory compression, NNQS-SCI successfully simulates systems up to 152 spin orbitals, tackling Hilbert space dimensions exceeding 10$^{14}$ and demonstrating significant advances in scale and efficiency. NNQS-SCI thus provides a robust and scalable path towards high-accuracy quantum chemistry on high performance computing platforms.
Research and ACM SRC Posters
Research & ACM SRC Posters
TP
DescriptionThe increasing complexity of HPC simulations poses several challenges to their reproducibility and reliability. One critical issue is the non-determinism (ND) induced by asynchronous MPI communication. Locating the sources of ND in large codes is difficult. This problem can be addressed by comparing event graphs (graphs mapping MPI communication) across multiple runs of the application by using tools like ANACIN-X[2] (to trace the event graphs) and network alignment (to locate areas of ND). We expand ANACIN-X's point-to-point tracing capabilities by adding collective communication tracing, and propose a novel network alignment algorithm to effectively compare event graphs.
Invited Talk
Energy Efficiency
Power Use Monitoring & Optimization
Livestreamed
Recorded
TP
DescriptionA transformation is underway in the world's energy sector. Tremendous growth in AI, data centers, and electrification is accelerating the need for new and expanded energy sources. Nuclear energy, with unparalleled bipartisan support, is seeing record-breaking private investment in both fission and fusion technologies and global commitments to expanding nuclear energy. Nuclear energy can be a key provider of energy for large-scale computing, and this talk will provide an overview of current activities and challenges in expansion to meet this rising need.
Nuclear energy is also a user of large-scale computing; accelerated deployment timelines are placing a growing importance on high-fidelity simulation to enable parameter exploration, design optimization, and safety analysis for regulatory compliance. The second part of this talk will describe the adoption of high performance computing for nuclear engineering applications, in particular focusing on the NEAMS (Nuclear Energy Advanced Modeling and Simulation) program, codes developed during the ExaSMR project under the Exascale Computing Project (ECP), and research activities at the University of Illinois.
Nuclear energy is also a user of large-scale computing; accelerated deployment timelines are placing a growing importance on high-fidelity simulation to enable parameter exploration, design optimization, and safety analysis for regulatory compliance. The second part of this talk will describe the adoption of high performance computing for nuclear engineering applications, in particular focusing on the NEAMS (Nuclear Energy Advanced Modeling and Simulation) program, codes developed during the ExaSMR project under the Exascale Computing Project (ECP), and research activities at the University of Illinois.
Research and ACM SRC Posters
Research & ACM SRC Posters
TP
DescriptionThis poster describes the open-source massively parallel and portable radiation hydrodynamics code FleCSI-HARD (Hydrodynamic And Radiative Diffusion) used to study radiation hydrodynamics instabilities. FleCSI-HARD is based on FleCSI (Flexible Computational Science Infrastructure) runtime which enables task-based distributed and portable code to be written in single source modern C++. We show good strong and weak scaling on CPUs and GPUs.
Paper
Algorithms
Applications
Architectures & Networks
Livestreamed
Recorded
TP
DescriptionThe computation of select eigenvalues and eigenvectors of large, sparse matrices is fundamental to a wide range of applications. Accordingly, evaluating the numerical performance of emerging alternatives to the IEEE 754 floating-point standard, such as OFP8 (E4M3 and E5M2), bfloat16, and the tapered-precision posit and takum formats, is of significant interest. Among the most widely used methods for this task is the implicitly restarted Arnoldi method, as implemented in ARPACK.
This paper presents a comprehensive and untailored evaluation based on two real-world datasets: the SuiteSparse Matrix Collection, which includes matrices of varying sizes and condition numbers, and the Network Repository, a large collection of graphs from practical applications. The results demonstrate that the tapered-precision posit and takum formats provide improved numerical performance, with takum arithmetic avoiding several weaknesses observed in posits. While bfloat16 performs consistently better than float16, the OFP8 types are generally unsuitable for general-purpose computations.
This paper presents a comprehensive and untailored evaluation based on two real-world datasets: the SuiteSparse Matrix Collection, which includes matrices of varying sizes and condition numbers, and the Network Repository, a large collection of graphs from practical applications. The results demonstrate that the tapered-precision posit and takum formats provide improved numerical performance, with takum arithmetic avoiding several weaknesses observed in posits. While bfloat16 performs consistently better than float16, the OFP8 types are generally unsuitable for general-purpose computations.
Workshop
Livestreamed
Recorded
TP
W
Descriptions-step Preconditioned Conjugate Gradient (PCG) variants for iteratively solving large sparse linear systems reduce the number of global synchronization points of standard PCG by a factor of O(s). Despite improving scalability on large-scale parallel computers, they have worse numerical properties than standard PCG. Choosing a suitable basis type for the s-step basis matrices is known to potentially improve numerical stability strongly. The s-step method proposed first in the literature was designed to only use the monomial basis. We generalize this method to support arbitrary basis types, denoting our new method as sPCG.
Moreover, we theoretically and experimentally compare all s-step PCG methods. To the best of our knowledge, this is the first comprehensive comparison in the literature. Our theoretical analysis, strong scaling experiments with a synthetic test problem, and runtime experiments with real-world problems confirm that our novel sPCG algorithm achieves higher speedup over standard PCG than existing s-step algorithms.
Moreover, we theoretically and experimentally compare all s-step PCG methods. To the best of our knowledge, this is the first comprehensive comparison in the literature. Our theoretical analysis, strong scaling experiments with a synthetic test problem, and runtime experiments with real-world problems confirm that our novel sPCG algorithm achieves higher speedup over standard PCG than existing s-step algorithms.
Workshop
Livestreamed
Recorded
TP
W
DescriptionClimate change is altering the state of our atmosphere, leading to more extreme weather events. But studying our changing climate is enormously complicated due to the many interrelated systems and processes and the sheer size of a planetary scale problem. Digital twin technology provides us with a mechanism to tackle both the scale and complexity of climate change. In this talk, we introduce Earth-2, a revolutionary digital twin framework developed by NVIDIA that allows us to safely look at and experiment with future climate events at both global and regional scales.
Workshop
Livestreamed
Recorded
TP
W
DescriptionThis presentation introduces NVSHMEM4Py, which provides Python-first host and device APIs that integrate naturally with the Python ecosystem. The library supports array-oriented memory management, collectives, and one-sided communication as on-stream, host-initiated operations, enabling overlap with compute. Additionally, device-side APIs allow fused communication and computation within user-defined kernels. Benchmarks show NVSHMEM4Py achieves native C-level performance while dramatically improving usability, empowering Python developers to build scalable multi-GPU applications without deep C/C++ expertise.
Doctoral Showcase
Research & ACM SRC Posters
Livestreamed
Recorded
TP
DescriptionOsteoarthritis (OA) is a chronic condition which affects over 300 million people globally and is a leading cause of disability, yet predictive models often remain monomodal, static, and opaque to clinicians. This dissertation develops OAAgent, a multimodal large language model (LLM) clinical assistant that integrates medical images (X-ray, MRI), longitudinal clinical variables, and physician notes for personalized, interpretable OA care and prediction of progression. OAAgent employs a fusion transformer for multimodal integration, a temporal retrieval system C-TRAG with explicit cross-visit semantics to retrieve clinically similar cases (similar past trajectories), and reinforcement learning for dynamic decision-making through a Chain-of-Thought reasoning layer and Extract-and-Abstract clinical note summarization to ensure transparent, patient-specific recommendations. The dissertation addresses five critical gaps at the intersection of OA AI research:
1. Joint integration of heterogeneous modalities
2. Longitudinal temporal reasoning
3. Clinically interpretable decision support
4. Personalized treatment recommendations
5. Inclusion of underutilized narrative notes
OAAgent’s architecture is designed for extensibility through the Model Context Protocol (MCP), enabling it to interoperate with other domain-specific models, multimodal pipelines, and external reasoning agents. This creates a bridge between the LLM core and diverse analytical components, enhancing adaptability to new modalities and clinical contexts.
Developed in collaboration with the Cleveland Clinic and validated on the OAI dataset and FNIH cohort and MIMIC datasets, OAAgent demonstrates improved accuracy, temporal calibration, and interpretability. Anchored in a Trustworthy AI framework, this work advances agentic multimodal AI for healthcare, offering a scalable, ethical, and interoperable pathway toward equitable, explainable clinical decision support across chronic diseases.
1. Joint integration of heterogeneous modalities
2. Longitudinal temporal reasoning
3. Clinically interpretable decision support
4. Personalized treatment recommendations
5. Inclusion of underutilized narrative notes
OAAgent’s architecture is designed for extensibility through the Model Context Protocol (MCP), enabling it to interoperate with other domain-specific models, multimodal pipelines, and external reasoning agents. This creates a bridge between the LLM core and diverse analytical components, enhancing adaptability to new modalities and clinical contexts.
Developed in collaboration with the Cleveland Clinic and validated on the OAI dataset and FNIH cohort and MIMIC datasets, OAAgent demonstrates improved accuracy, temporal calibration, and interpretability. Anchored in a Trustworthy AI framework, this work advances agentic multimodal AI for healthcare, offering a scalable, ethical, and interoperable pathway toward equitable, explainable clinical decision support across chronic diseases.
Paper
Programming Frameworks
Livestreamed
Recorded
TP
DescriptionThe increasing complexity and scale of high performance computing (HPC) workloads demand innovative approaches to optimize both computation and communication. While OpenMP has been widely adopted for intra-node parallelism and MPI for inter-node communication, emerging SmartNICs introduce new opportunities for offloading communication-intensive tasks.
In this work, we extend OpenMP to support MPI kernel offloading to SmartNICs. Our implementation integrates Open MPI communication offloading into the LLVM compiler while utilizing DOCA SDK for efficient interaction with NVIDIA BlueField DPUs. Leveraging OpenMP eliminates the need for direct low-level programming, lowering the entry barrier for domain scientists.
We demonstrate our framework’s versatility by implementing a SmartNIC-enabled version of the MPI OSU micro-benchmarks and improving the execution time of an atmospheric weather simulation by over 18%, thanks to concurrent computation and communication.
In this work, we extend OpenMP to support MPI kernel offloading to SmartNICs. Our implementation integrates Open MPI communication offloading into the LLVM compiler while utilizing DOCA SDK for efficient interaction with NVIDIA BlueField DPUs. Leveraging OpenMP eliminates the need for direct low-level programming, lowering the entry barrier for domain scientists.
We demonstrate our framework’s versatility by implementing a SmartNIC-enabled version of the MPI OSU micro-benchmarks and improving the execution time of an atmospheric weather simulation by over 18%, thanks to concurrent computation and communication.
Workshop
Partially Livestreamed
Partially Recorded
TP
W
DescriptionFederated Learning (FL) is critical for edge and High Performance Computing (HPC) where data is not centralized and privacy is crucial. We present OmniFed, a modular framework designed around decoupling and clear separation of concerns for configuration, orchestration, communication, and training logic. Its architecture supports configuration-driven prototyping and code-level override-what-you-need customization. We also support different topologies, mixed communication protocols within a single deployment, and popular training algorithms. It also offers optional privacy mechanisms including Differential Privacy (DP), Homomorphic Encryption (HE), and Secure Aggregation (SA), as well as compression strategies. These capabilities are exposed through well-defined extension points, allowing users to customize topology and orchestration, learning logic, and privacy/compression plugins, all while preserving the integrity of the core system. We evaluate multiple models and algorithms to measure various performance metrics. By unifying topology configuration, mixed-protocol communication, and pluggable modules in one stack, OmniFed streamlines FL experimentation and deployment across heterogeneous environments.
Workshop
Livestreamed
Recorded
TP
W
DescriptionThe IEEE 754 floating-point standard is the most used representation for real numbers in modern computer systems, despite issues in accuracy for certain applications. The posit format, which has several advantages, has been proposed as a direct drop-in replacement for IEEE floats. Many works compare the use of posits to floats in a wide range of scientific computing domains. However, there has not been any work looking into the compressibility of posit data. In this paper, we compare the compression ratios of different algorithms when the input is encoded in IEEE format and in posit format. We evaluate 5 lossless general-purpose compressors as well as several new compression algorithms synthesized by our LC framework, on 14 single-precision inputs from the SDRBench suite, encoded in float and posit format. Our results show that that 4 of the 6 compressors yield an average of 2.59\% reduction in compression ratio on posit data.
Workshop
Livestreamed
Recorded
TP
W
DescriptionIntegrating asynchronous MPI messaging with tasking runtimes requires careful handling of request polling and dispatching of associated completions to participating threads. The new C++26 Senders (std::execution) library offers a flexible collection of interfaces and templates for schedulers, algorithms and adaptors to work with asynchronous functions—and it makes explicit the mechanism to transfer execution from one context to another—essential for high performance. We have implemented the major features of the Senders API in the pika tasking runtime and used them to wrap asynchronous MPI calls such that messaging operations become nodes in the execution graph with the same calling semantics as other operations. The API allows us to easily experiment with different methods of message scheduling and dispatching completions. We present insights from our implementation on how application performance is affected by design choices surrounding the placement, scheduling and execution of polling and completion tasks using Senders.
Workshop
Livestreamed
Recorded
TP
W
DescriptionLaunch of Eagle, Azure’s hyper-scale supercomputer and the Number 3 on TOP500 list in November 2023, marked a new era where cloud providers are at the forefront of supercomputing. Despite its rapid expansion, public knowledge on the performance and scalability of cloud-based supercomputing is limited, with numerous misconceptions regarding performance implications due to virtualization layer of cloud-based systems. To address these gaps, we present a comparative analysis of two cloud-based supercomputers: Azure Eagle, a hyper-scale system ranked Number 3 on TOP500 in November 2023, and Azure Reindeer, a small-scale system ranked Number 32 on TOP500 in November 2024.
Using a comprehensive performance analysis, we highlight differences in computational efficiency and scaling characteristics of these systems in comparison to their bare-metal on-premises counterparts. We furthermore quantify the overhead from Azure's virtualization layer, demonstrating its performance implication for real-world HPC workloads to be less than 4%, with typical values ranging from 2–3%.
Using a comprehensive performance analysis, we highlight differences in computational efficiency and scaling characteristics of these systems in comparison to their bare-metal on-premises counterparts. We furthermore quantify the overhead from Azure's virtualization layer, demonstrating its performance implication for real-world HPC workloads to be less than 4%, with typical values ranging from 2–3%.
Birds of a Feather
State of the Practice
Livestreamed
Recorded
TP
XO/EX
DescriptionEver-increasing compute system heat density and scale is driving liquid cooling solutions in cutting-edge supercomputers and large-scale AI training/inference systems. Ensuring peak performance and reliability hinges on robust commissioning and ongoing commissioning of data centers. This session, tailored for operational managers, facility engineers, liquid cooling vendors, architects, and engineers, looks into this critical process. Gain firsthand insights from real-world case studies, featuring cooling system commissioning at Oak Ridge National Laboratory's Frontier and Lawrence Livermore National Laboratory's El Capitan, alongside Sandia National Laboratories' advanced OCx methodologies. Learn about a community initiative to document a guideline for liquid cooling commissioning.
Workshop
Livestreamed
Recorded
TP
W
DescriptionSubmitting batch jobs on HPC clusters usually requires familiarity with Linux commands and job schedulers, posing a significant learning barrier for beginners. To address this issue, we have developed Open Composer, a web-based application that helps users generate and manage batch jobs on HPC clusters. Open Composer automatically generates shell scripts using web forms defined for each application while also providing a real-time preview of the generated shell scripts and allowing direct editing. This feature helps reduce both the learning curve and the risk of syntax errors while maintaining flexibility in script writing. Open Composer provides a unified interface for job submission and status monitoring, and supports reusable job parameters, dynamic form widgets, and preprocessing steps. By enhancing usability and accessibility, Open Composer aims to make HPC resources more approachable for both novice and experienced users.
Exhibitor Forum
Exascale
Livestreamed
Recorded
TP
XO/EX
DescriptionIT infrastructures face AI and analytics demands, driving the need for storage that leverages existing networks, cuts server counts, and frees CAPEX for AI.
Modeled on the Open Compute Project, the Open Flash Platform (OFP) initiative liberates high-capacity flash through an open architecture built on standard pNFS in every Linux distribution. Each OFP unit contains a DPU-based Linux instance and network port, so it connects directly as a peer—no additional servers.
By removing surplus hardware and proprietary software, OFP lets enterprises use dense flash efficiently, halving TCO and increasing storage density 10×. Early configurations deliver up to 48 PB in 2U and scale to 1 EB per rack, yielding a 10× reduction in rack space, power, and OPEX and a 33% longer service life.
This session explains the vision and engineering that make OFP possible, showing how open, standards-based architecture can simplify and reduce the costs of high-capacity storage.
Modeled on the Open Compute Project, the Open Flash Platform (OFP) initiative liberates high-capacity flash through an open architecture built on standard pNFS in every Linux distribution. Each OFP unit contains a DPU-based Linux instance and network port, so it connects directly as a peer—no additional servers.
By removing surplus hardware and proprietary software, OFP lets enterprises use dense flash efficiently, halving TCO and increasing storage density 10×. Early configurations deliver up to 48 PB in 2U and scale to 1 EB per rack, yielding a 10× reduction in rack space, power, and OPEX and a 33% longer service life.
This session explains the vision and engineering that make OFP possible, showing how open, standards-based architecture can simplify and reduce the costs of high-capacity storage.
Birds of a Feather
System Software
Livestreamed
Recorded
TP
XO/EX
DescriptionOpen MPI continues to drive the state of the art in HPC. This year, we've added new features, fixed bugs, improved performance, and collaborated with many across the HPC community. We'll discuss what Open MPI has accomplished over the past year and present a roadmap for the next year.
One of Open MPI's strengths lies in its diversity: we represent many different viewpoints across the HPC ecosystem. To that end, many developers from the community will be present to discuss and answer your questions both during and after the BoF.
One of Open MPI's strengths lies in its diversity: we represent many different viewpoints across the HPC ecosystem. To that end, many developers from the community will be present to discuss and answer your questions both during and after the BoF.
Workshop
Livestreamed
Recorded
TP
W
DescriptionOpen OnDemand (openondemand.org) is an innovative, open-source, web-based portal that removes the complexities of research computing system environments from the end-client, and in so doing, reduces “time to science” for researchers by facilitating their access to research computing resources. Through Open OnDemand, research computing clients can upload and download files, create, edit, submit and monitor jobs, create and share apps, run graphical user interface-based applications and connect to a terminal, all via a web browser, with no client software to install and configure. Open OnDemand greatly simplifies access to research computing resources, freeing domain scientists from having to worry about the operating environment and instead focus on their research. It enables computer center staff to support a wide range of clients by simplifying the user interface and experience. The overall impact is that clients can use remote computing resources faster and more efficiently. This presentation will provide an overview of Open OnDemand and detail some of the success stories that have been generated from the global community of over 2,100 research computing centers that utilize it.
Workshop
Livestreamed
Recorded
TP
W
DescriptionThe primary topics for this panel include the role of AI in Digital Twins among the panelists and the audience. An additional goal of the panel is to provide a way for workshop attendees to address their questions arising from all the workshop presentations given earlier and raise their own topics and areas of interest in a thought provoking panel style discussion.
Workshop
Livestreamed
Recorded
TP
W
DescriptionThe Open QHPC Software Ecosystem (openQSE) initiative is a community-driven effort aimed at defining a common specification for the emerging Quantum–High Performance Computing (QHPC) software stack. Its goal is to enable interoperability across diverse hardware and software platforms, allowing vendors, national laboratories, and academic institutions to develop components that seamlessly integrate within a unified ecosystem. This talk will provide an overview of the initiative’s objectives and progress, highlighting the inaugural workshop held at Oak Ridge National Laboratory on July 25, 2025. The workshop convened key stakeholders from industry, academia, and national laboratories to discuss shared challenges in quantum–HPC integration and to establish the foundational direction for the openQSE effort.
Birds of a Feather
Architectures & Networks
Livestreamed
Recorded
TP
XO/EX
DescriptionAs demand for architectural innovation accelerates in the post-Moore era of HPC and scientific edge computing, the importance of accessible and scalable design methodologies has never been greater. This BoF will focus on the critical role of open-source hardware tools in advancing research and accelerating chip prototyping. Modern chip development increasingly relies on large-scale computing for simulation, formal verification, power–performance–area analysis, and physical layout. These workflows not only require substantial computational resources but also create opportunities to integrate HPC and AI into design processes in transformative ways. Open-source tools provide a unique opportunity to lower barriers to entry, enable reproducibility, and foster innovation across diverse communities.
The session invites contributions from both hardware and software communities, spanning circuit-level abstractions, design-space exploration, and scalable toolchains. Special invited speakers include Mark Ren (NVIDIA, U.S.), Lilia Zaourar (CEA, Europe), and Toru Niina (RIKEN, Japan), representing global perspectives from industry, national laboratories, and academia. Their talks will highlight international efforts that demonstrate the power of open-source methodologies to accelerate chip design and strengthen collaboration across regions.
To maximize interaction, this BoF will include impromptu pitches from attendees and an open Q&A discussion. This is not just a listening session—it is an opportunity for participants to co-create future directions, exchange ideas, and form new collaborations. We encourage all attendees to be actively involved in shaping the outcomes.
Expected results include actionable insights, new collaborations across traditionally separate communities, and concrete follow-up efforts to advance open, reproducible, and scalable hardware design.
The session invites contributions from both hardware and software communities, spanning circuit-level abstractions, design-space exploration, and scalable toolchains. Special invited speakers include Mark Ren (NVIDIA, U.S.), Lilia Zaourar (CEA, Europe), and Toru Niina (RIKEN, Japan), representing global perspectives from industry, national laboratories, and academia. Their talks will highlight international efforts that demonstrate the power of open-source methodologies to accelerate chip design and strengthen collaboration across regions.
To maximize interaction, this BoF will include impromptu pitches from attendees and an open Q&A discussion. This is not just a listening session—it is an opportunity for participants to co-create future directions, exchange ideas, and form new collaborations. We encourage all attendees to be actively involved in shaping the outcomes.
Expected results include actionable insights, new collaborations across traditionally separate communities, and concrete follow-up efforts to advance open, reproducible, and scalable hardware design.
Birds of a Feather
Programming Frameworks
Livestreamed
Recorded
TP
XO/EX
DescriptionThe Open Accelerated Computing Consortium (OpenACC) supports researchers and developers to advance science by nurturing parallel computing skills and offering a directive-based, high-level parallel programming model for CPUs, GPUs, and more. Additionally, OpenACC organizes 25 global hackathons annually and facilitates the acceleration of 200+ applications on platforms such as Frontier, Perlmutter, JUWELS, LUMI, Alps, and Miyabi. This community BoF serves as a forum to openly discuss the status and future of OpenACC. Opening presentation will be led by OpenACC officers, compiler implementers, and invited users, such as the NASA OVERFLOW CFD project, followed by an open audience fish bowl discussion.
Birds of a Feather
Practitioners in HPC
Livestreamed
Recorded
TP
XO/EX
DescriptionThe landscape of HPC cluster provisioning is rapidly evolving, with innovative open-source solutions emerging to meet modern computational demands. This BoF showcases leading open-source provisioning platforms through lightning talks from the Warewulf, Confluent, and OpenCHAMI communities, preceded by an update on the OpenHPC project. The session will highlight recent advances in container-based provisioning, cloud-native HPC management, and security-focused deployment strategies. Community members will engage in interactive discussions about best practices, interoperability challenges, and future collaboration opportunities. This forum aims to strengthen the open-source provisioning ecosystem and foster cross-project innovation for next-generation HPC infrastructure.
Workshop
Livestreamed
Recorded
TP
W
DescriptionAs AI and HPC workloads push the limits of performance, power, and scalability, the industry faces a fundamental architectural shift. The traditional boundaries between compute, memory, and networking are dissolving, giving rise to disaggregated systems—composed from modular chiplets and accelerators that must communicate with near-monolithic efficiency. In this landscape, photonics is emerging not just as an I/O technology, but as the foundation for next-generation system architecture.
This keynote will explore how advances in integrated photonics—including silicon photonic interposers, co-packaged optics, and optical circuit fabrics—are transforming the design of large-scale AI and supercomputing platforms. By delivering orders-of-magnitude improvements in bandwidth density, latency uniformity, and energy per bit, photonics interconnects enable the flexible composition of heterogeneous compute elements across dies, packages, and system racks.
The talk will outline the evolution from electrical domain limits to photonic-domain scalability, highlighting how photonic fabrics can unify on-package, board-level, and cluster-scale communication. It will examine the interplay between photonics, packaging, and network topology, and discuss emerging opportunities in optically reconfigurable architectures for AI model training and HPC workflows. Ultimately, it will argue that photonics is not just a bandwidth solution—but the enabler of a new class of composable, memory-centric, and energy-efficient supercomputing systems.
This keynote will explore how advances in integrated photonics—including silicon photonic interposers, co-packaged optics, and optical circuit fabrics—are transforming the design of large-scale AI and supercomputing platforms. By delivering orders-of-magnitude improvements in bandwidth density, latency uniformity, and energy per bit, photonics interconnects enable the flexible composition of heterogeneous compute elements across dies, packages, and system racks.
The talk will outline the evolution from electrical domain limits to photonic-domain scalability, highlighting how photonic fabrics can unify on-package, board-level, and cluster-scale communication. It will examine the interplay between photonics, packaging, and network topology, and discuss emerging opportunities in optically reconfigurable architectures for AI model training and HPC workflows. Ultimately, it will argue that photonics is not just a bandwidth solution—but the enabler of a new class of composable, memory-centric, and energy-efficient supercomputing systems.
Workshop
Partially Livestreamed
Partially Recorded
TP
W
Workshop
Livestreamed
Recorded
TP
W
DescriptionCommunication increasingly limits performance in high-performance computing (HPC), yet mainstream compilers focus on computation because communication intent is lost early in compilation. OpenSHMEM offers a one-sided Partitioned Global Address Space (PGAS) model with symmetric memory and explicit synchronization, but lowering to opaque runtime calls hides these semantics from analysis.
We present an OpenSHMEM dialect for Multi-Level Intermediate Representation (MLIR) that preserves one-sided communication, symmetric memory, and team/context structure as first-class intermediate representation (IR) constructs. Retaining these semantics prior to lowering enables precise, correctness-preserving optimizations that are difficult to recover from LLVM IR. The dialect integrates with existing MLIR/LLVM passes while directly representing communication and synchronization intent.
We demonstrate four transformations: recording the number of processing elements, fusing compatible atomics, converting blocking operations to non-blocking forms when safe, and aggregating small messages. These examples show how explicit OpenSHMEM semantics enable communication-aware optimization and lay the groundwork for richer cross-layer analyses.
We present an OpenSHMEM dialect for Multi-Level Intermediate Representation (MLIR) that preserves one-sided communication, symmetric memory, and team/context structure as first-class intermediate representation (IR) constructs. Retaining these semantics prior to lowering enables precise, correctness-preserving optimizations that are difficult to recover from LLVM IR. The dialect integrates with existing MLIR/LLVM passes while directly representing communication and synchronization intent.
We demonstrate four transformations: recording the number of processing elements, fusing compatible atomics, converting blocking operations to non-blocking forms when safe, and aggregating small messages. These examples show how explicit OpenSHMEM semantics enable communication-aware optimization and lay the groundwork for richer cross-layer analyses.
Birds of a Feather
Programming Frameworks
Livestreamed
Recorded
TP
XO/EX
DescriptionOpenSHMEM is a PGAS API for single-sided asynchronous scalable communications in HPC applications. OpenSHMEM is a community-driven standard for this API across multiple architectures/implementations. This BoF brings together the OpenSHMEM community to present the latest accomplishments since the release of the 1.6 specification, and discuss future directions for the OpenSHMEM community as we develop version 1.7 and beyond. The BoF will consist of talks from end-users, implementers, and middleware and tool developers to discuss their experiences and plans for using OpenSHMEM. We will then open the floor for discussion of the specification and our mid-to-long-term goals.
Birds of a Feather
State of the Practice
Livestreamed
Recorded
TP
XO/EX
DescriptionOperational data analytics (ODA) provides unique opportunities to analyze, understand, and optimize operations of HPC systems. However, those opportunities are often missed because access to data is restricted, siloed, or misaligned with the people best positioned to act on it. System administrators, researchers, and HPC users could combine their skills to optimize efficiency of HPC systems if data and expertise were shared between all parties.
How can we bridge these gaps and make more operational data available to more stakeholders in an easily digestible way while still maintaining operational safety, privacy, and legal requirements? What data is relevant to whom?
How can we bridge these gaps and make more operational data available to more stakeholders in an easily digestible way while still maintaining operational safety, privacy, and legal requirements? What data is relevant to whom?
Invited Talk
Applications
Livestreamed
Recorded
TP
DescriptionForecasting the rise of wearable devices equipped with audio-visual feeds, this talk will present opportunities for research in egocentric video understanding. The talk argues for new ways to foresee egocentric videos as partial observations of a dynamic 3D world, where objects are out of sight but not out of mind. I’ll review new data collection and annotation HD-EPIC (https://hd-epic.github.io/) that merges video understanding with 3D modeling, showcasing current failures of VLMs in understanding the perspective outside the camera’s field of view—a task trivial for humans.
All project details are at.
All project details are at
Research and ACM SRC Posters
Research & ACM SRC Posters
TP
DescriptionThis work extends nifty-ls, a high-performance Lomb–Scargle periodogram implementation, with multiple Fourier terms harmonic fitting with OpenMP-parallelized/GPU-parallelized methods.
We leverage a fast and accurate spreading kernel (the "exponential of semicircle") from the Flatiron Institute Nonuniform Fast Fourier Transform (FINUFFT) to replace the extrapolation method of Press and Rybicki when computing trigonometric sums.
To efficiently process large batches of short or variably-sized time series, we introduce a heterobatch model that uses "kernel fusion." This approach wraps the entire workflow—preprocessing, FINUFFT execution, and postprocessing—into a single, per-series C++ pipeline via nanobind. By operating entirely in C++, the model avoids Python’s Global Interpreter Lock (GIL) and integrates seamlessly with OpenMP. This tight coupling allows OpenMP's dynamic scheduling to treat each time series as an independent unit of work, effectively balancing the computational load.
We leverage a fast and accurate spreading kernel (the "exponential of semicircle") from the Flatiron Institute Nonuniform Fast Fourier Transform (FINUFFT) to replace the extrapolation method of Press and Rybicki when computing trigonometric sums.
To efficiently process large batches of short or variably-sized time series, we introduce a heterobatch model that uses "kernel fusion." This approach wraps the entire workflow—preprocessing, FINUFFT execution, and postprocessing—into a single, per-series C++ pipeline via nanobind. By operating entirely in C++, the model avoids Python’s Global Interpreter Lock (GIL) and integrates seamlessly with OpenMP. This tight coupling allows OpenMP's dynamic scheduling to treat each time series as an independent unit of work, effectively balancing the computational load.
Research and ACM SRC Posters
Research & ACM SRC Posters
TP
DescriptionWe evaluate the current state of collective communication on GPU-based supercomputers for large language model (LLM) training at scale. Existing libraries such as RCCL and Cray-MPICH exhibit critical limitations on systems such as Frontier—Cray-MPICH underutilizes network and compute resources, while RCCL suffers from severe scalability issues. To address these challenges, we introduce PCCL, a communication library with highly optimized implementations of all-gather and reduce-scatter operations tailored for distributed deep learning workloads. PCCL is designed to maximally utilize all available network and compute resources and to scale efficiently to thousands of GPUs. It achieves substantial performance improvements, delivering 6-33x speedups over RCCL and 28-70x over Cray-MPICH for all-gather on 2,048 GCDs of Frontier. These gains translate directly to end-to-end performance: in large-scale GPT-3-style training, PCCL provides up to 60% and 40% speedups over RCCL for 7B and 13B parameter models, respectively.
Paper
Applications
Architectures & Networks
Livestreamed
Recorded
TP
DescriptionWe present ROSfs, a novel user-level file system designed to address critical data query inefficiencies in multi-robot systems (MRS). ROSfs introduces an innovative file organization model where robot data is structured as labeled sub-files, coupled with a time-indexed architecture that enables efficient querying of actively modified data. This design enables real-time cross-robot data acquisition and collaboration capabilities previously unattainable in MRS deployments. Our implementation integrates seamlessly with the Robot Operating System (ROS) and has been extensively evaluated using both physical UAV/UGV platforms and data servers. Experimental results demonstrate that ROSfs achieves a 7x reduction in online data query latency under wireless network conditions compared to conventional ROS storage methods, while simultaneously improving data freshness (Age of Information) by up to 271x. These advancements position ROSfs as a transformative solution for high-performance robotic data management in distributed systems.
Workshop
Livestreamed
Recorded
TP
W
DescriptionAs computing energy demand continues to grow and electrical grid infrastructure struggles to keep pace, an increasing number of data centers are being planned with colocated microgrids that integrate on-site renewable generation and energy storage. However, while existing research has examined the tradeoffs between operational and embodied carbon emissions in the context of renewable energy certificates, there is a lack of tools to assess how the sizing and composition of microgrid components affects long-term sustainability and power reliability.
In this paper, we present a novel optimization framework that extends the computing and energy system co-simulator Vessim with detailed renewable energy generation models. Our framework simulates the interaction between computing workloads, on-site renewable production, and energy storage, capturing both operational and embodied emissions. We use a multi-horizon black-box optimization to explore efficient microgrid compositions and enable operators to make more informed decisions when planning energy systems for data centers.
In this paper, we present a novel optimization framework that extends the computing and energy system co-simulator Vessim with detailed renewable energy generation models. Our framework simulates the interaction between computing workloads, on-site renewable production, and energy storage, capturing both operational and embodied emissions. We use a multi-horizon black-box optimization to explore efficient microgrid compositions and enable operators to make more informed decisions when planning energy systems for data centers.
Workshop
short paper
Livestreamed
Recorded
TP
W
DescriptionDynamic programming accelerators (DPAs) are devices designed with an instruction set optimized for dynamic programming (DP) operations. DP is fundamental to solving complex networking problems, particularly those involving fault tolerance and routing under dynamic conditions. This paper explores the use of DPA to accelerate network resilience by implementing an optimal routing algorithm that efficiently identifies alternative paths in response to link failures. The system’s performance is evaluated by comparing DPA implementation against conventional GPU-based and CPU-based solutions. Results show that DPA provides significant performance improvements, enabling faster recovery and improved robustness in dynamic network environments.
Paper
Post-Moore Computing
Quantum Computing
Livestreamed
Recorded
TP
DescriptionModular quantum architectures have emerged as a promising solution for scalable quantum computing systems. Executing circuits in such distributed systems necessitates non-local operations between modules, incurring significant communication overhead. In this work, an optimized quantum circuit mapping technique called DQTetris is proposed to reduce inter-module communications. DQTetris employs a hierarchical framework that first seeks a global communication-free qubit mapping assignment under module capacity constraints. If infeasible, it searches for subcircuits with local communication-free qubit assignments via layer-wise gate pruning. Executing adjacent subcircuits with different qubit assignments incurs inter-module data teleportation. DQTetris minimizes these overheads by reducing qubit reassignment events through optimal circuit segmentation, qubit assignment selection, and adaptive gate teleportation. Experiments show that compared with existing methods, DQTetris can achieve average reductions in communication costs ranging from 28% to 75% across various benchmarks.
Research and ACM SRC Posters
Research & ACM SRC Posters
TP
DescriptionWe investigate an inefficiency in the LLVM OpenMP runtime related to accelerator offloading. The current implementation manages asynchronous GPU tasks by polling async handles, which introduces CPU overhead. We propose replacing this polling model with an event-driven approach that detaches target tasks by default. In our design, each asynchronous task is associated with an event that is fulfilled once the GPU kernel completes, allowing the task to yield execution. This eliminates repeated polling and reduces scheduling overhead. We implemented this mechanism using existing features in the LLVM OpenMP runtime, relying on a host callback function provided by CUDA. Experiments on NVIDIA H100 GPUs show runtime improvements of up to 75% for independent tasks once matrix sizes exceed 128×128, with benefits appearing at even smaller sizes when task dependencies are present. For large kernels, the effect diminishes as execution time dominates.
Research and ACM SRC Posters
Research & ACM SRC Posters
TP
DescriptionLarge inter-GPU all-reduce operations, prevalent throughout deep learning, are bottlenecked by communication costs. Emerging heterogeneous architectures are comprised of complex nodes, often containing four GPUs and dozens to hundreds of CPU cores per node. Parallel applications are typically accelerated on the available GPUs, using only a single CPU core per GPU while the remaining cores sit idle. This poster presents novel optimizations to large GPU-aware all-reduce operations, extending lane-aware reductions to the GPUs, and notably using multiple CPU cores per GPU to accelerate these operations. These multi-CPU-accelerated GPU-aware lane all-reduces using an intermediate host buffer yield speedup of up to 2.45x for large MPI all-reduces across the NVIDIA A100 GPUs of NCSA's Delta supercomputer. Finally, the approach is extended to GPUDirect RDMA communication, yielding speedup of 1.17x for large all-reduces.
ACM Gordon Bell Climate Modeling Finalist
ACM Gordon Bell Finalist
Awards and Award Talks
Applications
BP
GBC
Livestreamed
Recorded
TP
DescriptionSparse observations and coarse-resolution climate models limit regional decision-making, underscoring the need for robust downscaling. However, existing AI methods struggle with generalization across variables and geographies and are constrained by the quadratic complexity of Vision Transformer (ViT) self-attention. We introduce ORBIT-2, a scalable foundation model for global, high-resolution climate downscaling. ORBIT-2 incorporates two key innovations: (1) Residual Slim ViT (Reslim), a lightweight architecture with residual learning and Bayesian regularization for efficient, robust prediction; and (2) TILES, a tile-wise sequence-scaling algorithm that reduces self-attention complexity from quadratic to linear, enabling long-sequence processing and massive parallelism. ORBIT-2 scales to 10 billion parameters across 32,768 GPUs, achieving up to 1.8 ExaFLOPS sustained throughput and 92%–98% strong scaling efficiency. It supports downscaling to 0.9 km global resolution and processes sequences up to 4.2 billion tokens. At 7 km resolution, ORBIT-2 achieves high accuracy with $R^2$ scores in the range of 0.98–0.99 against observational data.
Tutorial
Livestreamed
Recorded
TUT
DescriptionScientific computing workflows are growing increasingly complex, combining diverse computational patterns, heterogeneous resources, and sophisticated dependencies that challenge traditional orchestration tools. Meanwhile, cloud and AI architectures are driving Kubernetes adoption for these workloads. Deploying workflow components that provide the performance and features required for HPC simulations and applications remains challenging in this environment. This tutorial demonstrates a portability layer to solve this problem—integration of the Flux Framework with Kubernetes to efficiently manage complex scientific workflows on Amazon Web Services (AWS). Participants will learn how Flux’s hierarchical resource management and graph-based scheduling capabilities extend Kubernetes to support diverse workflows. The tutorial progresses from foundational infrastructure concepts to advanced Flux capabilities, culminating in deploying MuMMI (Multiscale Machine-learned Modeling Infrastructure)—a scientific workflow exemplifying emerging complexity through combined large-scale simulations and machine learning. Through lectures and hands-on labs using Amazon EKS, attendees will experience how this architecture supports demanding workflows while maintaining portability across on-premises, cloud, and hybrid environments. Using practical examples, participants will gain applicable skills for orchestrating complex workflows in various computing environments. In the end, attendees will learn how to build efficient, scalable, and flexible environments for complex scientific workflows using Kubernetes, Flux, and cloud infrastructure.
Workshop
Livestreamed
Recorded
TP
W
DescriptionMost quantum computers today are constrained by hardware limitations, particularly the number of available qubits, causing significant challenges for executing large-scale quantum algorithms. Circuit cutting has emerged as a key technique to overcome these limitations by decomposing large quantum circuits into smaller subcircuits that can be executed independently and later reconstructed. Qdislib is a distributed and flexible library for quantum circuit cutting, designed to seamlessly integrate with hybrid quantum-classical HPC systems. Qdislib employs a graph-based representation of quantum circuits to enable efficient partitioning, manipulation and execution, supporting both wire and gate cutting techniques. The library is compatible with multiple quantum programming languages, including Qiskit and Qibo, and leverages distributed computing to execute subcircuits across CPUs, GPUs, and quantum processing units in a fully parallelized manner. The paper describes Qdislib and demonstrates how it enables the distributed execution of quantum circuits across heterogeneous resources, showcasing its potential for scalable quantum-classical workflows.
Best Poster Presentations (Research, ACM SRC Grad/Undergrad)
Research & ACM SRC Posters
TP
DescriptionThere is a growing need for the efficient solution of many small eigenvalue problems (up to N = 1500) that arise in emerging scientific applications. These small-to-medium sized problems present unique computational challenges, particularly when thousands or millions of such problems must be solved repeatedly. This work presents Orchid, a novel distributed, heterogeneous, batched eigenvalue solver based on the IRIS runtime. Orchid can utilize all compute platforms in both heterogeneous nodes and clusters by harnessing the capabilities of the IRIS architecture. Orchid leverages heterogeneous architectures across multiple nodes by partitioning the application task DAG intelligently and orchestrates multiple instances of the IRIS runtime via MPI. We evaluate our proposal against two heterogeneous hardware configurations and Frontier, demonstrating Orchid’s performance utilizing both intra-node and inter-node heterogeneity.
Workshop
Livestreamed
Recorded
TP
W
DescriptionThis work introduces a novel double-sided streaming methodology that combines control-plane and data-plane streaming. Our
goal is to implement the long-advocated separation of concerns in workflow orchestration without introducing artificial boundaries in their execution. Our approach is exemplified by the integration of control-plane streaming provided by dispel4py and the transparent data-plane streaming provided by CAPIO. Our integration eliminates file synchronization barriers without requiring modifications to existing workflow logic. To support this, we extend CAPIO with a new commit rule that allows streaming over dynamically generated file sets, enabling hybrid workflows that blend in-memory dataflows with file-based communication. We validate our approach using a real-world seismic cross-correlation workflow, achieving performance improvements between 23% and 40%. Unlike previous solutions, our method supports streaming across the entire workflow, including phase boundaries where file I/O would typically enforce strict execution ordering. Therefore, our approach can be straightforwardly extended to other multi-stage streaming applications.
goal is to implement the long-advocated separation of concerns in workflow orchestration without introducing artificial boundaries in their execution. Our approach is exemplified by the integration of control-plane streaming provided by dispel4py and the transparent data-plane streaming provided by CAPIO. Our integration eliminates file synchronization barriers without requiring modifications to existing workflow logic. To support this, we extend CAPIO with a new commit rule that allows streaming over dynamically generated file sets, enabling hybrid workflows that blend in-memory dataflows with file-based communication. We validate our approach using a real-world seismic cross-correlation workflow, achieving performance improvements between 23% and 40%. Unlike previous solutions, our method supports streaming across the entire workflow, including phase boundaries where file I/O would typically enforce strict execution ordering. Therefore, our approach can be straightforwardly extended to other multi-stage streaming applications.
Workshop
Overhead Quantification of the Lightweight Distributed Metric Service for High-Performance Computers
2:35pm - 2:40pm CST Monday, 17 November 2025 276Partially Livestreamed
Partially Recorded
TP
W
DescriptionThe Lightweight Distributed Metric Service (LDMS) is a monitoring framework that collects high-fidelity, high-volume node-level data on large distributed computer systems. LDMS is built to introduce negligible overhead in application workloads which has been verified in several scale tests since its inception in 2014. However, new communication strategies, sensor samplers, and fundamental data structures within the core LDMS code have been introduced that could increase the overhead. In this study, we quantify the current overhead that LDMS introduces and verify that it is insignificant. This was done through a variety of benchmarks and applications where we captured timing and performance statistics while LDMS ran with different configurations.
Workshop
Livestreamed
Recorded
TP
W
DescriptionThis NIST Special Publication introduces an HPC security overlay built upon the moderate baseline defined in SP 800-53B. The overlay tailors 60 security controls with supplemental guidance and/or discussions to enhance their applicability in HPC contexts. This overlay aims to provide practical, performance-conscious security guidance that can be readily adopted.
Workshop
Data Analytics
High Performance I/O, Storage, Archive, & File Systems
Storage
Livestreamed
Recorded
TP
W
DescriptionDeep learning workloads are driving the HPC landscape, from foundation models and surrogate models to emerging agentic workflows. The I/O characteristics of these workloads challenge the assumptions underlying traditional HPC storage systems and libraries: whereas classical modeling/simulation workflows favor large, sequential writes and predictable access patterns, AI workflows demand random access I/O for dataset shuffling and burst bandwidth for model checkpoints, frequently simultaneously. The coupling of simulations with AI models further complicates storage requirements. In this panel, we will examine the challenges of managing data movement and storage for AI workloads, the requirements of I/O systems and how existing ones must evolve, and the open challenges for the field.
Workshop
Data Analytics
High Performance I/O, Storage, Archive, & File Systems
Storage
Livestreamed
Recorded
TP
W
DescriptionThis work focuses on developing unique performance monitoring capabilities in PAPI to extend support for specialized AI chips to help researchers identify hardware-specific bottlenecks and improve AI architectures. However, the lack of traditional hardware counters on these specialized devices poses a unique challenge and makes it necessary to develop alternative approaches for performance monitoring. PAPI’s Software Defined Events (SDEs) is one such promising approach since it provides a portable way to capture and expose software-level metrics from within applications. Our proof-of-concept implementation on traditional processors uses the vendor-agnostic HPL-MxP benchmark instrumented with PAPI SDE to register custom events, such as sde_io_read_bytes and sde_float16, to track memory and network I/O, and the floating-point precision usage throughout the workload. Results showed close agreement between SDE and hardware counts, with a mean ± standard deviation difference of 0.310% ± 5.315%, providing confidence in applying the SDE approach to systems without accessible hardware counters.
Tutorial
Livestreamed
Recorded
TUT
DescriptionThis tutorial provides a comprehensive overview of parallel computing, emphasizing those aspects most relevant to the user. It is suitable for new users, students, managers, and anyone seeking an overview of parallel computing. It discusses software and hardware/software interaction, with an emphasis on standards, portability, and systems that are widely available. The tutorial surveys basic parallel computing concepts, using examples from multiple engineering, scientific, and machine learning problems. These examples illustrate using MPI on distributed memory systems; OpenMP on shared memory systems; MPI+OpenMP on hybrid systems; and CUDA and compiler directives on GPUs and accelerators. The tutorial discusses numerous parallelization and load balancing approaches; performance improvement tools; and an overview of recent developments such as machine learning based on accelerators and parallel versions of Python. The tutorial helps attendees make intelligent decisions by covering the primary options that are available, explaining how the different components work together and what they are most suitable for. Extensive pointers to web-based resources are provided to facilitate follow-up studies.
Workshop
Data Analytics
High Performance I/O, Storage, Archive, & File Systems
Storage
Livestreamed
Recorded
TP
W
DescriptionHigh-level I/O libraries, such as PnetCDF and HDF5, are commonly used by large-scale scientific applications to perform I/O tasks in parallel. These I/O libraries store the metadata of data objects in files along with their raw data. To ensure metadata consistency during parallel data object creation, they require applications to call the metadata APIs collectively using consistent metadata. Such a requirement can result in an expensive consistency check, as its cost increases with the metadata volume and the number of processes. To address this limitation, we propose a new file header format, which uses partitioned metadata blocks to enable independent data object creation and reduce the objects required for consistency check. Our performance evaluation shows that this new design achieves a scalable performance, cutting data object creation times by up to 196x when running on 4096 MPI processes to create 5,684,800 data objects in parallel.
Research and ACM SRC Posters
Research & ACM SRC Posters
TP
DescriptionGraph motifs—small subgraphs such as triangles and cliques—are key tools for comparing and aligning networks in domains ranging from biology to social sciences. While recent advances enable motif counting in billion-edge networks, existing methods focus mainly on global frequencies. Building on ParaDyMS, we introduce a method to compute local edge-level motif frequencies, capturing the motifs incident to each edge. Experiments on real-world networks show that our approach achieves competitive performance against state-of-the-art static algorithms and demonstrate its scalability on shared memory systems and GPUs.
Paper
Algorithms
Livestreamed
Recorded
TP
DescriptionHigher-order orthogonal iteration (HOOI) is an iterative algorithm that computes a Tucker decomposition of fixed ranks of an input tensor. In this work we modify HOOI to determine ranks adaptively subject to a fixed approximation error, apply optimizations to reduce the cost of each HOOI iteration, and parallelize the method in order to scale to large, dense datasets. We show that HOOI is competitive with the sequentially truncated higher-order singular value decomposition (ST-HOSVD) algorithm, particularly in cases of high compression ratios. Our proposed rank-adaptive HOOI can achieve comparable approximation error to ST-HOSVD in less time, sometimes achieving a better compression ratio. We demonstrate that our parallelization scales well over thousands of cores and show, using three scientific simulation datasets, that HOOI outperforms ST-HOSVD in high-compression regimes. For example, for a 3D fluid-flow simulation dataset, HOOI computes a Tucker decomposition 82x faster and achieves a compression ratio 50% better than ST-HOSVD's.
Research and ACM SRC Posters
Research & ACM SRC Posters
TP
DescriptionConventional MPI trace visualizations become unwieldy as the number of processes increases. Identifying global patterns is challenging, and both the underlying machine topology and parallel program structure are often obscured by the rigid two-dimensional rank-time graphs. To address these limitations, we propose a three-dimensional video visualization of MPI programs that adapts its spatial layout to both the machine topology and the parallel decomposition of the numerical problem, while mapping process time evolution to actual video playback time.
Using Blender, we develop a framework that automatically translates MPI trace data into 3D visual scenes. We explore various display choices, including color-time gradients and transparency, to enhance interpretability. Our approach provides a new, intuitive exploration of both large-scale and local behaviors and patterns. We showcase the utility of this method with idle wave and process desynchronization phenomena.
Using Blender, we develop a framework that automatically translates MPI trace data into 3D visual scenes. We explore various display choices, including color-time gradients and transparency, to enhance interpretability. Our approach provides a new, intuitive exploration of both large-scale and local behaviors and patterns. We showcase the utility of this method with idle wave and process desynchronization phenomena.
Workshop
Livestreamed
Recorded
TP
W
DescriptionTransforming unstructured information into structured common
data models (CDM) is a critical step for enabling cancer
surveillance and advancing precision medicine. CDMs standardize
the structure and content of oncologic data extracted
from electronic health records. Unfortunately, traditional Extract
Transform Load processes for electronic health data capture
are generally rule-based, error-prone, and produce static
datasets unsuitable for near real-time information retrieval.
The Modeling Outcomes using Surveillance Data and Scalable
AI for Cancer (MOSSAIC) project developed and deployed
a hierarchical self-attention (HiSAN) model capable
of autocoding approximately 30% of National Cancer Institute
Surveillance, Epidemiology, and End Results (SEER) registry
cancer pathology reports [1], [2]. While a significant step
forward, this falls short of the broader goal of automatically
coding all pathology reports. Fully automating CDM conversion
would facilitate clinical trial matching, decision support
dashboards, real-time case ascertainment, and population
health surveillance.
The distribution of cancer phenotypes in real-world data is
highly imbalanced. While HiSAN performs well on classes
well-represented during training, its accuracy and confidence
degrade substantially for less common categories. Large language
models (LLMs) offer a promising solution for underrepresented
oncological entities, owing to their ability to
leverage context and pretraining. Rather than relying solely
on general-purpose models, domain adaptation or continual
pretraining of LLMs may further improve performance by
helping models learn the specialized vocabulary, abbreviations,
and context typical of clinical text. In this study, we finetune
LLMs for SEER pathology report classification, with and
without additional domain-adaptive pretraining, and compare
the results to the HiSAN baseline [2].
Based on Llama 3 8B, PathLlama was developed by finetuning
for cancer pathology report classification, with and without
domain adaptation. The domain adaptation task was next token
prediction and the pretraining dataset was composed of a
large corpus of approximately 10M cancer pathology reports
and abstracts from SEER and about 500k clinical notes and
radiology reports from MIMIC [3]. The PathLlama models
were finetuned to classify site (70 categories), subsite (330),
laterality (7), histology (677), and behavior (4). The finetuning
dataset was 4052951 reports from six SEER registries:
Kentucky, Louisiana, New Jersey, New Mexico, Seattle/Puget
Sound, and Utah. The finetuning dataset was randomly split
into 80%/10%/10% for training, test, and validation, ensuring
all reports associated with a single case belong to the same
split.
Finetuning results are shown in Table I. We observe that
the micro F1 scores, dominated by majority classes due the
imbalance in the dataset, improve only slightly from the
HiSAN to either of the PathLlama models. The most notable
improvements in micro F1 come from the domain-adapted
PathLlama for subsite and laterality. In contrast, more significant
improvements occur for macro F1, particularly for subsite,
laterality, and histology. For these three tasks, the domain-adapted
PathLlama model also substantially outperforms the
PathLlama base model. From these macro F1 results, we find
that the contextual and pretraining advantages of Llama itself
are indeed sufficient to markedly improve classification performance
on underrepresented classes. However, domain adaptation
offers additional benefit, further enhancing performance
that justifies the increased computational cost associated with
extended pretraining.
data models (CDM) is a critical step for enabling cancer
surveillance and advancing precision medicine. CDMs standardize
the structure and content of oncologic data extracted
from electronic health records. Unfortunately, traditional Extract
Transform Load processes for electronic health data capture
are generally rule-based, error-prone, and produce static
datasets unsuitable for near real-time information retrieval.
The Modeling Outcomes using Surveillance Data and Scalable
AI for Cancer (MOSSAIC) project developed and deployed
a hierarchical self-attention (HiSAN) model capable
of autocoding approximately 30% of National Cancer Institute
Surveillance, Epidemiology, and End Results (SEER) registry
cancer pathology reports [1], [2]. While a significant step
forward, this falls short of the broader goal of automatically
coding all pathology reports. Fully automating CDM conversion
would facilitate clinical trial matching, decision support
dashboards, real-time case ascertainment, and population
health surveillance.
The distribution of cancer phenotypes in real-world data is
highly imbalanced. While HiSAN performs well on classes
well-represented during training, its accuracy and confidence
degrade substantially for less common categories. Large language
models (LLMs) offer a promising solution for underrepresented
oncological entities, owing to their ability to
leverage context and pretraining. Rather than relying solely
on general-purpose models, domain adaptation or continual
pretraining of LLMs may further improve performance by
helping models learn the specialized vocabulary, abbreviations,
and context typical of clinical text. In this study, we finetune
LLMs for SEER pathology report classification, with and
without additional domain-adaptive pretraining, and compare
the results to the HiSAN baseline [2].
Based on Llama 3 8B, PathLlama was developed by finetuning
for cancer pathology report classification, with and without
domain adaptation. The domain adaptation task was next token
prediction and the pretraining dataset was composed of a
large corpus of approximately 10M cancer pathology reports
and abstracts from SEER and about 500k clinical notes and
radiology reports from MIMIC [3]. The PathLlama models
were finetuned to classify site (70 categories), subsite (330),
laterality (7), histology (677), and behavior (4). The finetuning
dataset was 4052951 reports from six SEER registries:
Kentucky, Louisiana, New Jersey, New Mexico, Seattle/Puget
Sound, and Utah. The finetuning dataset was randomly split
into 80%/10%/10% for training, test, and validation, ensuring
all reports associated with a single case belong to the same
split.
Finetuning results are shown in Table I. We observe that
the micro F1 scores, dominated by majority classes due the
imbalance in the dataset, improve only slightly from the
HiSAN to either of the PathLlama models. The most notable
improvements in micro F1 come from the domain-adapted
PathLlama for subsite and laterality. In contrast, more significant
improvements occur for macro F1, particularly for subsite,
laterality, and histology. For these three tasks, the domain-adapted
PathLlama model also substantially outperforms the
PathLlama base model. From these macro F1 results, we find
that the contextual and pretraining advantages of Llama itself
are indeed sufficient to markedly improve classification performance
on underrepresented classes. However, domain adaptation
offers additional benefit, further enhancing performance
that justifies the increased computational cost associated with
extended pretraining.
Workshop
PathPCNet: Pathway Principal Component-Based Interpretable Framework for Drug Sensitivity Prediction
11:55am - 12:10pm CST Monday, 17 November 2025 241Livestreamed
Recorded
TP
W
DescriptionBackground: Precision medicine aims to identify significant biomarkers and effective drugs based on individual genomic profiles, enabling personalized treatment strategies. Drug efficacy is commonly assessed via drug response, typically measured by the concentration required to inhibit a biological activity (e.g., IC50). In contrast, drug sensitivity reflects the strength of a tumor's response to a drug, where a lower effective dose indicates higher sensitivity. With the increased availability of large-scale multi-omics datasets, machine learning (ML) and deep learning approaches have emerged as powerful tools for studying drug response--holding great promise for accelerating biomarker discovery and enabling the development of more effective therapeutics.
Methods: We present `PathPCNet`, a novel interpretable deep learning framework that integrates multi-omics data (copy number variation, mutation, and RNA sequencing) with biological pathways, drug molecular structures, and Principal Component Analysis (PCA) to predict drug response. We project high-dimensional, noisy gene-level features to pathway-level principal components, and evaluate six machine learning models using the first one to five principal components. Our models are trained to predict the IC50 values for 182 drugs across 409 cell lines representing 29 cancer types from the GDSC (Genomics of Drug Sensitivity in Cancer) dataset. Finally, we fine-tune the deep learning model and apply SHAP to interpret feature contributions. SHAP scores are back-projected from the principal components to original genes using PCA loadings, enabling identification of the most significant genes.
Results: Our model achieves a Pearson correlation coefficient of 0.941 and an R-squared value of 0.885, outperforming existing pathway-based approaches for drug response prediction. Using SHAP-based model interpretation, we quantify the contributions of different omics and drug features, and identify critical pathways and gene-drug interactions involved in resistance mechanisms. These results highlight the potential of integrative deep learning models not only for accurate prediction, but also for uncovering biologically meaningful insights that can inform drug discovery and precision oncology. Furthermore, our framework enables the identification of key pathways, genes, and atomic-level drug attributes associated with drug sensitivity across diverse cancer types.
Discussion: Our intuitive feature extraction approach, based on pathway-level principal components, effectively reduces dimensionality while preserving data variance and enhancing biological interpretability. Tumor response is a complex biological phenomenon that extends beyond single gene–drug interactions. Therefore, integrating multi-omics profiles and molecular drug features within the context of biological pathways is essential for understanding drug response. This integrative approach has strong potential to support targeted therapy design, biomarker discovery, and the advancement of precision medicine and drug development.
Methods: We present `PathPCNet`, a novel interpretable deep learning framework that integrates multi-omics data (copy number variation, mutation, and RNA sequencing) with biological pathways, drug molecular structures, and Principal Component Analysis (PCA) to predict drug response. We project high-dimensional, noisy gene-level features to pathway-level principal components, and evaluate six machine learning models using the first one to five principal components. Our models are trained to predict the IC50 values for 182 drugs across 409 cell lines representing 29 cancer types from the GDSC (Genomics of Drug Sensitivity in Cancer) dataset. Finally, we fine-tune the deep learning model and apply SHAP to interpret feature contributions. SHAP scores are back-projected from the principal components to original genes using PCA loadings, enabling identification of the most significant genes.
Results: Our model achieves a Pearson correlation coefficient of 0.941 and an R-squared value of 0.885, outperforming existing pathway-based approaches for drug response prediction. Using SHAP-based model interpretation, we quantify the contributions of different omics and drug features, and identify critical pathways and gene-drug interactions involved in resistance mechanisms. These results highlight the potential of integrative deep learning models not only for accurate prediction, but also for uncovering biologically meaningful insights that can inform drug discovery and precision oncology. Furthermore, our framework enables the identification of key pathways, genes, and atomic-level drug attributes associated with drug sensitivity across diverse cancer types.
Discussion: Our intuitive feature extraction approach, based on pathway-level principal components, effectively reduces dimensionality while preserving data variance and enhancing biological interpretability. Tumor response is a complex biological phenomenon that extends beyond single gene–drug interactions. Therefore, integrating multi-omics profiles and molecular drug features within the context of biological pathways is essential for understanding drug response. This integrative approach has strong potential to support targeted therapy design, biomarker discovery, and the advancement of precision medicine and drug development.
Workshop
Data Analytics
High Performance I/O, Storage, Archive, & File Systems
Storage
Livestreamed
Recorded
TP
W
DescriptionEfficient storage, movement, and management of data are crucial to application performance and scientific productivity in both traditional simulation-oriented HPC environments and cloud AI/ML/big data analysis environments. This issue is further exacerbated by the growing volume of experimental and observational data, the widening gap in performance between computational hardware and storage hardware, and the emergence of new data-driven algorithms in machine learning. The goal of this workshop is to facilitate in-depth discussions of research and development that address the most critical challenges in large-scale data storage and data processing.
PDSW will continue to build on the successful tradition established by its predecessor workshops: the Petascale Data Storage Workshop (PDSW, 2006-2015) and the Data Intensive Scalable Computing Systems workshop (DISCS, 2012-2015). These workshops were successfully combined in 2016, and the resulting joint workshop has attracted up to 45 full paper submissions and 195 attendees per year from 2016 to 2024.
PDSW will continue to build on the successful tradition established by its predecessor workshops: the Petascale Data Storage Workshop (PDSW, 2006-2015) and the Data Intensive Scalable Computing Systems workshop (DISCS, 2012-2015). These workshops were successfully combined in 2016, and the resulting joint workshop has attracted up to 45 full paper submissions and 195 attendees per year from 2016 to 2024.
Workshop
Livestreamed
Recorded
TP
W
DescriptionPeachy Parallel Assignments are high-quality assignments that require students to practice concepts in parallel and distributed computing. They are selected competitively and published in the Edu* workshops to provide instructors with inspiration and easy-to-adopt assignments. The assignments must have been successfully tested with real students, easy to adopt by other instructors in a variety of contexts, and ``cool and inspirational'' for students completing them.
This article presents three Peachy Parallel Assignments selected for presentation at EduHPC 2025.
The first is simulation of the growth of ``fairy rings'', a biologically-motivated variation of the Game of Life. The second assignment asks students to simulate flooding over uneven terrain and in the presence of active rainfall. The third assignment has them implement the softmax function in parallel, motivated by applications in deep learning.
This article presents three Peachy Parallel Assignments selected for presentation at EduHPC 2025.
The first is simulation of the growth of ``fairy rings'', a biologically-motivated variation of the Game of Life. The second assignment asks students to simulate flooding over uneven terrain and in the presence of active rainfall. The third assignment has them implement the softmax function in parallel, motivated by applications in deep learning.
Workshop
Livestreamed
Recorded
TP
W
DescriptionInstrumentation-based profiling is essential for uncovering fine-grained optimization opportunities in High-Performance Computing (HPC) and cloud applications, yet static instrumentation methods often impose fixed profiling overheads that cannot adapt to the dynamic workloads from applications at runtime.
We further develop PEAK, a Dynamic Binary Instrumentation (DBI)-based profiler with two complementary modes for overhead control: a static mode, which enforces an upper limit on the absolute instrumentation overhead, and a dynamic mode based on the heartbeat mechanism, which controls the relative overhead in real time to maintain a user-defined ratio.
Evaluations of workloads ranging from compute-intensive kernels to lightweight functions show that the heartbeat mechanism effectively bounds overhead while improving profile accuracy compared to static methods, delivering predictable and adaptive profiling performance for long-running, dynamic workloads.
We further develop PEAK, a Dynamic Binary Instrumentation (DBI)-based profiler with two complementary modes for overhead control: a static mode, which enforces an upper limit on the absolute instrumentation overhead, and a dynamic mode based on the heartbeat mechanism, which controls the relative overhead in real time to maintain a user-defined ratio.
Evaluations of workloads ranging from compute-intensive kernels to lightweight functions show that the heartbeat mechanism effectively bounds overhead while improving profile accuracy compared to static methods, delivering predictable and adaptive profiling performance for long-running, dynamic workloads.
Workshop
Livestreamed
Recorded
TP
W
DescriptionUnderstanding the behavior of scientific software is essential in maintaining the integrity and transparency of computational research. Tracking the changes in computational parameters (input-output parameters, configuration parameters for hardware and software) across different versions of software executions helps in understanding by providing a context to correlate the outcomes with meaningful changes and to interpret the results reliably. Diverse computing environments and software platforms add complexity in tracking the evolution of computational parameters across multiple runs. We present PerfAnalyzer, an interactive dashboard to simplify the collection, management, and visual analysis of computational parameters across Git commits. We demonstrate the usefulness of the dashboard in identifying performance issues through a case study on collecting and analyzing computational parameters of the CloverLeaf mini-application. The results of the case study show PerfAnalyzer's ability to highlight performance changes across versions and identify parameters related to the changes that are difficult to locate using isolated measurements.
Paper
HPC for Machine Learning
Programming Frameworks
Livestreamed
Recorded
TP
DescriptionThe increasing complexity of machine learning models and the proliferation of diverse hardware architectures (CPUs, GPUs, accelerators) make achieving optimal performance a significant challenge. Heterogeneity in instruction sets, specialized kernel requirements for different data types and model features (e.g., sparsity, quantization), and architecture-specific optimizations complicate performance tuning. Manual optimization is resource-intensive, while existing automatic approaches often rely on complex hardware-specific heuristics and uninterpretable intermediate representations, hindering performance portability. We introduce PerfLLM, a novel automatic optimization methodology leveraging large language models (LLMs) and reinforcement learning (RL). Central to this is PerfDojo, an environment framing optimization as an RL game using a human-readable, mathematically-inspired code representation that guarantees semantic validity through transformations. This allows effective optimization without prior hardware knowledge, facilitating both human analysis and RL agent training. We demonstrate PerfLLM's ability to achieve significant performance gains across diverse CPU (x86, Arm, RISC-V) and GPU architectures.
Workshop
Livestreamed
Recorded
TP
W
DescriptionThe performance gap between processors and memory has become a significant bottleneck nowadays, commonly referred to as the Memory Wall. Compute Express Link emerges as a promising solution to address these challenges by providing benefits to expand the memory space and bandwidth. In this work, we focus on the performance measurement and analysis of the memory interleaving strategy on CXL memory. Our experiments, conducted on both simulated and genuine CXL-enabled system, show that the naive interleaving configurations cannot always deliver the best memory bandwidth. In fact, it could be 26.97% less than the optimal configuration in the worst case. Besides, we observed distinct characteristics between emulated and genuine CXL system, presenting the limitation of evaluating performance by simulation for memory interleaving. Our work reveals the importance of interleaving configurations and provide the performance comparison with analyses for identifying the influencing factors and developing guideline of the CXL memory placement policy.
Workshop
Livestreamed
Recorded
TP
W
DescriptionCompute eXpress Link (CXL) is emerging as a promising memory interface technology. However, its performance characteristics remain largely unclear due to the limited availability of production hardware. In this work, we study how HPC applications and large language models (LLM) can benefit from the CXL memory, and study the interplay between memory tiering and page interleaving. We also propose a novel data object-level interleaving policy to match the interleaving policy with memory access patterns. Our findings reveal the challenges and opportunities of using CXL.
Tutorial
Livestreamed
Recorded
TUT
DescriptionThis tutorial covers code analysis, performance modeling, and optimization for sparse linear solvers on CPUs and GPUs. Performance engineering is often taught using simple loops as examples for performance models and how they can guide optimization; however, full, preconditioned linear solvers comprise multiple loops and an iteration scheme that is executed to convergence. Consequently, the concept of "optimal performance" must account for both hardware efficiency and solver convergence. After introducing basic notions of hardware organization and storage for dense and sparse data structures, we show how to apply the roofline model to such solvers in predictive and diagnostic ways and how it can be used to assess the hardware efficiency of a solver, covering important corner cases such as memory boundedness. Then we advance to preconditioned solvers, using the conjugate gradient method (CG) algorithm as a leading example. Bottlenecks of the solver are identified, followed by the introduction of optimization techniques like the use of preconditioners and cache blocking. The interplay among solver performance, convergence, and time to solution is given special attention. In hands-on exercises, attendees will be able to carry out experiments on a GPU cluster and study the influence of matrix data formats, preconditioners, and cache optimizations.
Research and ACM SRC Posters
Research & ACM SRC Posters
TP
DescriptionWe propose a co-design approach that integrates two powerful tools—MVAPICH and TAU—to demonstrate the new possibilities for performance-guided control and optimization for two large-scale applications—AWP-ODC and heFFTe. AWP-ODC is a highly scalable parallel finite-difference application with point-to-point operations that enables 3D earthquake calculations, while heFFTe is a massively parallel application that provides scalable and efficient implementations of the widely used Fast Fourier Transform using several MPI primitives. Through a deep integration between MVAPICH and TAU, the two applications can identify their performance bottlenecks on various supercomputers with different architectures. AWP-ODC and heFFTe can also act as representative real-world benchmarks to MVAPICH and TAU. We show how the co-design approach enables AWP-ODC and heFFTe to deliver better performance on cutting-edge HPC architectures. This is achieved using 1) more optimized and fine-tuned collective operations, and 2) reduced network traffic through real-time data compression.
Workshop
Performance Evaluation, Scalability, & Portability
Livestreamed
Recorded
TP
W
DescriptionThis paper describes the development of performance portable
batched linear algebra kernels for SN-DG neutron transport
sweeps using Kokkos. We establish a new sweep algorithm
for GPUs that relies on batched linear algebra kernels. We
implement an optimized batched gesv solver for small linear
systems that builds upon state-of-the-art algorithms. Our
implementation achieves high performance by minimizing
global memory traffic and maximizing the amount of compu-
tations done at compile-time. We assess the performance of
the batched gesv kernel on NVIDIA and AMD GPUs. We
show that our custom implementation outperforms state-of-
the-art linear algebra libraries on these architectures. The
performance of the new GPU sweep implementation is as-
sessed on the H100 and MI300A GPUs. We demonstrate that
our GPU implementation is able to achieve high performance
on both architectures, and is competitive with an optimized
multithreaded CPU implementation on a 128-core CPU.
batched linear algebra kernels for SN-DG neutron transport
sweeps using Kokkos. We establish a new sweep algorithm
for GPUs that relies on batched linear algebra kernels. We
implement an optimized batched gesv solver for small linear
systems that builds upon state-of-the-art algorithms. Our
implementation achieves high performance by minimizing
global memory traffic and maximizing the amount of compu-
tations done at compile-time. We assess the performance of
the batched gesv kernel on NVIDIA and AMD GPUs. We
show that our custom implementation outperforms state-of-
the-art linear algebra libraries on these architectures. The
performance of the new GPU sweep implementation is as-
sessed on the H100 and MI300A GPUs. We demonstrate that
our GPU implementation is able to achieve high performance
on both architectures, and is competitive with an optimized
multithreaded CPU implementation on a 128-core CPU.
Tutorial
Livestreamed
Recorded
TUT
DescriptionThe Roofline performance model offers an insightful and intuitive method for extracting the key execution characteristics of HPC and ML/AI applications and comparing them against the performance bounds of modern CPUs and GPUs. Its ability to abstract the complexity of memory hierarchies and identify the most profitable optimization techniques has made Roofline-based analysis increasingly popular in the HPC and ML/AI communities. Although different flavors of the Roofline model have been developed to deal with various definitions of memory data movement, there remains a need for a systematic methodology when applying them to analyze applications running on multicore and accelerated systems. This tutorial aims to bridge this gap on both CPUs and GPUs by both exposing the fundamental aspects behind different Roofline modeling principles as well as providing several practical use case scenarios that highlight their efficacy for application optimization. This tutorial presents a unique combination of instruction in Roofline by its creator; hands-on instruction in using Roofline within Intel’s, NVIDIA’s, and AMD’s production performance tools; and discussions of real-world Roofline use cases at ALCF, NERSC, and OLCF computing centers. The tutorial presenters have a long history of collaborating on the Roofline model and have presented several Roofline-based tutorials.
Workshop
Livestreamed
Recorded
TP
W
DescriptionIn lower-upper (LU) factorization in the form A=LU, symbolic factorization is a pre-processing stage performed to discover the sparsity structure of the factors. A is usually not equal to L+U due to fill-ins (the nonzeros that do not appear in A but element of L or U) introduced in factorization. Symbolic factorization can be performed with A's pattern by utilizing the corresponding graph. In this work, we assess the viability of utilizing GraphBLAS for symbolic factorization. GraphBLAS defines a standard way to express operations on graphs in the language of linear algebra. We express edge-based and path-based symbolic factorization using graph operations and investigate utilization of masks and elimination trees. Our goal is to obtain a performant symbolic factorization, which can be used in a portable manner on any hardware GraphBLAS standard is realized. We demonstrate our approach with various sparse matrices on multi-core and many-core architectures.
Paper
HPC for Machine Learning
System Software and Cloud Computing
Partially Livestreamed
Partially Recorded
TP
DescriptionSpatiotemporal graph neural networks (ST-GNNs) are powerful tools for modeling spatial and temporal data dependencies. However, their applications have been limited primarily to small-scale datasets because of memory constraints. While distributed training offers a solution, current frameworks lack support for spatiotemporal models and overlook the properties of spatiotemporal data. Informed by a scaling study on a large-scale workload, we present PyTorch Geometric Temporal Index (PGT-I), an extension to PyTorch Geometric Temporal that integrates distributed data parallel training and two novel strategies: index-batching and distributed-index-batching. Our index techniques exploit spatiotemporal structure to construct snapshots dynamically at runtime, significantly reducing memory overhead, while distributed-index-batching extends this approach by enabling scalable processing across multiple GPUs. Our techniques enable the first-ever training of an ST-GNN on the entire PeMS dataset without graph partitioning, reducing peak memory usage by up to 89% and achieving up to a 11.78x speedup over standard DDP with 128 GPUs.
Invited Talk
Art of HPC
Creativity
Livestreamed
Recorded
TP
DescriptionData are usually understood as descriptions of the world, a reference to something out there. Yet, today it is increasingly clear that this representational perspective falls short in addressing many of the critical issues, practices, and debates surrounding data. This talk explores three non-representational perspectives on data, discussing their surprising and paradoxical consequences: autographic, data as material traces that inscribe environmental changes without symbolic mediation; synthetic, training data for AI models fabricated to resemble observations yet detached from the world they mimic; and toxic, infotrash and AI slop, the proliferating by-products of AI that sustain digital capitalism despite their apparent uselessness. By focusing on the agency of data, I argue that focusing on representation alone obscures the relational, material, and economic dimensions through which data now operate.
Paper
Data Analytics, Visualization & Storage
Livestreamed
Recorded
TP
DescriptionGPU Direct Storage (GDS) plays a vital role in GPU storage systems, utilizing P2P-DMA technology to establish a direct data transfer path between the GPU and storage devices. This direct path reduces storage access latency and CPU overhead, thus improving data transfer efficiency. Currently, however, GDS employs a phony buffer in host memory to interact with the Linux kernel, resulting in suboptimal performance, additional resource consumption, and deployment complexity.
In this paper, we propose Phoenix, a refactored GDS software stack without phony buffers. Phoenix employs the memory mapping service of ZONE_DEVICE to map GPU memory into the page table at system startup. The kernel module of Phoenix stores the returned address information, allocates user-space virtual memory, and establishes a mapping with the designated GPU memory. Extensive evaluation shows that, compared to the existing GDS software stack, Phoenix reduces software overhead along the critical I/O path and improves end-to-end performance.
In this paper, we propose Phoenix, a refactored GDS software stack without phony buffers. Phoenix employs the memory mapping service of ZONE_DEVICE to map GPU memory into the page table at system startup. The kernel module of Phoenix stores the returned address information, allocates user-space virtual memory, and establishes a mapping with the designated GPU memory. Extensive evaluation shows that, compared to the existing GDS software stack, Phoenix reduces software overhead along the critical I/O path and improves end-to-end performance.
Workshop
Livestreamed
Recorded
TP
W
DescriptionThis paper evaluates oversubscribing in High-Performance Computing (HPC) systems as a solution to balance interactive and batch job performance. Using real workload traces and physical hardware experiments, we demonstrate that oversubscribing can reduce queue waiting times while maintaining overall system performance. Our results show this approach (1) decreases waiting times for interactive jobs, (2) has minimal impact on overall system throughput, and (3) effectively manages individual job turnaround times. Unlike traditional multiple queue approaches, oversubscribing provides these benefits with simpler configuration requirements. Additionally, through quantitative memory usage analysis, we provide insights into oversubscribing applicability for production capacity planning. Our research contributes empirical evidence of its effectiveness in real HPC environments, supported by comprehensive experimental data and practical implementation insights.
Research and ACM SRC Posters
Research & ACM SRC Posters
TP
DescriptionModern scientific computing generates massive simulation data across physics domains, yet researchers lack general-purpose tools for efficient analysis. While vision transformers like CLIP and DINO have revolutionized natural image analysis, no equivalent exists for physics simulation data. This project trains a custom vision transformer on “the Well” dataset, a 15 TB collection of diverse physics simulations. Using only 7 million images (compared to >100 million for CLIP/DINOv2), we trained our physics foundation model in 22 hours on a single Cerebras CS-3 server. Despite reduced training scale, our model demonstrates competitive classification performance while exceeding at physics-specific tasks: temporal forecasting (𝑅2 = 0.33 vs. DINOv2’s 0.23) and physics clustering (silhouette score = 0.232 vs. DINOv2’s 0.195). This work demonstrates that efficient, domain-focused foundation models can achieve better performance in specialized scientific domains.
Invited Talk
AI, Machine Learning, & Deep Learning
Artificial Intelligence & Machine Learning
Big Data
Livestreamed
Recorded
TP
DescriptionThe proliferation of geospatial artificial intelligence (GeoAI) is generating unprecedented demands on high performance computing infrastructure. Training foundational Earth observation (EO) models on petabytes of satellite imagery, running inference across multi-modal global geospatial models, and generating global vector datasets like agricultural field boundaries and building footprints all require computational strategies at a massive scale.
These complex tasks, which are fundamental to both commercial industries and national security, push the boundaries of modern supercomputing, demanding novel approaches to data management, model parallelism, and distributed inference pipelines that can handle the sheer volume and velocity of geospatial data.
St. Louis has emerged as a critical epicenter for this technological convergence, fostering a unique ecosystem where GeoAI innovations are cross-pollinating between disparate domains. This presentation will highlight the region's synergistic environment, where cutting-edge techniques developed for precision agriculture directly inform vital applications in national security, and vice versa. We will explore specific case studies that demonstrate how this local cross-fertilization is accelerating the development of next-generation geospatial capabilities and cementing St. Louis's role as a global leader in solving planetary-scale challenges through supercomputing.
These complex tasks, which are fundamental to both commercial industries and national security, push the boundaries of modern supercomputing, demanding novel approaches to data management, model parallelism, and distributed inference pipelines that can handle the sheer volume and velocity of geospatial data.
St. Louis has emerged as a critical epicenter for this technological convergence, fostering a unique ecosystem where GeoAI innovations are cross-pollinating between disparate domains. This presentation will highlight the region's synergistic environment, where cutting-edge techniques developed for precision agriculture directly inform vital applications in national security, and vice versa. We will explore specific case studies that demonstrate how this local cross-fertilization is accelerating the development of next-generation geospatial capabilities and cementing St. Louis's role as a global leader in solving planetary-scale challenges through supercomputing.
Paper
HPC for Machine Learning
System Software and Cloud Computing
Partially Livestreamed
Partially Recorded
TP
DescriptionGraph neural networks leverage the connectivity and structure of real-world graphs to learn intricate properties and relationships between nodes. Many real-world graphs exceed the memory capacity of a GPU due to their sheer size, and distributed full-graph training suffers from high communication overheads and load imbalance due to the irregular structure of graphs. We propose a three-dimensional parallel approach for full-graph training that tackles these issues and scales to billion-edge graphs. In addition, we introduce optimizations such as a double permutation scheme for load balancing, and a performance model to predict the optimal 3D configuration of our parallel implementation: Plexus. We evaluate Plexus on six different graph datasets and show scaling results on up to 2,048 GPUs of Perlmutter, and 1,024 GPUs of Frontier. Plexus achieves unprecedented speedups of 2.3-12.5X over prior state of the art, and a reduction in time-to-solution by 5.2-8.7X on Perlmutter and 7.0-54.2X on Frontier.
Workshop
Livestreamed
Recorded
TP
W
DescriptionThe PMBS25 workshop is concerned with the comparison of high-performance computing systems through performance modeling, benchmarking, or through the use of tools such as simulators. We are particularly interested in research that reports the ability to measure and make tradeoffs in software/hardware co-design to improve sustained application performance. We are also keen to capture the assessment of future systems.
The aim of this workshop is to bring together researchers, from industry and academia, concerned with the qualitative and quantitative evaluation and modeling of high-performance computing systems. Authors are invited to submit novel research in all areas of performance modeling, benchmarking, and simulation, and we welcome research that brings together current theory and practice. We recognize that the term "performance" has broadened to include power consumption and reliability, and that performance modeling is practiced through analytical methods and approaches based on software tools and simulators.
The aim of this workshop is to bring together researchers, from industry and academia, concerned with the qualitative and quantitative evaluation and modeling of high-performance computing systems. Authors are invited to submit novel research in all areas of performance modeling, benchmarking, and simulation, and we welcome research that brings together current theory and practice. We recognize that the term "performance" has broadened to include power consumption and reliability, and that performance modeling is practiced through analytical methods and approaches based on software tools and simulators.
SCinet
Not Livestreamed
Not Recorded
Tutorial
Livestreamed
Recorded
TUT
DescriptionIn this tutorial, you’ll discover the portable parallelism and concurrency features of the ISO C++23 standard and learn to accelerate HPC applications on modern, heterogeneous GPU-based systems from all three main vendors (AMD, Intel, NVIDIA), without any non-standard extensions. We’ll show you how to parallelize classic HPC patterns like multi-dimensional loops and reductions, and how to solve common problems like overlapping MPI communication with GPU computation. The material is supplemented with numerous hands-on exercises and illustrative HPC mini-applications. All exercises will be done on cloud GPU instances directly in your web browser—no setup required. The tutorial synthesizes practical techniques acquired from our professional experience to show how the C++23 standard programming model applies to real-world HPC workloads, and which thoughts went into implementing and designing the programming model itself. You'll also receive links to additional resources and a preview of upcoming C++ features.
Workshop
Livestreamed
Recorded
TP
W
DescriptionThis paper presents the 2-step work undertaken to port GYSELA, a petascale Fortran simulation code for turbulence in tokamak plasmas, to GPUs. The initial porting process using OpenMP offloading allowed for good performance in most of the code, with the exception of the collision operator, which became a major bottleneck. This performance critical operator was then rewritten in C++ using Kokkos. It is now known as KoLiOp, and is now one of the code modules with the largest speedups relative to the CPU baseline. We explain our strategy in both phases of development and provide an in-depth analysis of how we leveraged each framework for overall performance. The techniques detailed are applicable to other codes seeking to use a portability layer. Finally, we present a comparative benchmark run on the CPU (AMD Genoa) and GPU (MI250X) partitions of the Adastra machine as well as its upcoming MI300A APU nodes.
Workshop
Livestreamed
Recorded
TP
W
DescriptionWe implement a post-variational quantum neural network on a real HPC-QC system and show the feasibility of fully training this class of algorithms on current Noisy Intermediate-Scale Quantum (NISQ) devices, which are limited by noise, low number of qubits, and scarcity. Post-variational methods are hybrid classical-quantum Machine Learning algorithms that remove the need for quantum circuits evaluations during training, thus making them more suited to the availability constraints of physical quantum devices. We investigate the scalability of the algorithm to a higher number of qubits, larger datasets, and more elaborate models, giving insight for more efficient implementation. Experiments for an image classification task on a cutting-edge HPC-QC system show that post-variational quantum neural networks are fully trainable in reasonable times on a superconducting device. The models trained also show performance at least comparable to a variational approach, with one configuration showing a significant improvement in classification accuracy.
Birds of a Feather
Education & Workforce Development
Livestreamed
Recorded
TP
XO/EX
DescriptionWriting good parallel programs is painful. Very painful.
This is because high performance compute systems have evolved from simple single-core machines, strung together with Ethernet, into multi-core, multi-accelerator, multi-level monsters. As a consequence, programming such systems means dealing with synchronization and communication overhead, load imbalance, and a multitude of programming models and languages.
In this BoF we bring together application people to share their pain in programming parallel systems, with people working on programming frameworks and models. We hope this can lead to insights and solutions to alleviate the pain.
This is because high performance compute systems have evolved from simple single-core machines, strung together with Ethernet, into multi-core, multi-accelerator, multi-level monsters. As a consequence, programming such systems means dealing with synchronization and communication overhead, load imbalance, and a multitude of programming models and languages.
In this BoF we bring together application people to share their pain in programming parallel systems, with people working on programming frameworks and models. We hope this can lead to insights and solutions to alleviate the pain.
Research and ACM SRC Posters
Research & ACM SRC Posters
TP
DescriptionIn this study, we discuss the practicality and limitations of using current large language models (LLMs) for automatically translating Fortran legacy codes to C++, so that legacy codes written in Fortran can be modernized to exploit the performance and features available only in C++. Moreover, we investigate the effectiveness of in-context learning (ICL: translation using custom prompts) and interactive translation (IT: re-translating the code when a compile error occurs) at automatic Fortran-to-C++ translation. In our evaluation, the rate of producing the same results as the original code, called the output match rate, is used as the primary evaluation metric. The evaluation results not only demonstrate that it is difficult even for the latest LLM to achieve 100% accurate translation at present, but also that ICL and IT are effective to improve the accuracy.
Workshop
Partially Livestreamed
Partially Recorded
TP
W
DescriptionArtificial Intelligence (AI) is rapidly transforming scientific discovery and industrial applications, but its growth has escalated demands on high-performance computing (HPC) resources. A central challenge is predicting resource requirements for deep neural network (DNN) workloads, where inefficient provisioning leads to underutilized GPUs, wasted CPUs, and higher costs. This work explores AI resource prediction in HPC using complementary approaches: Black-Box models leverage tabular features and regressors such as XGBoost for fast, workload-specific predictions. In contrast, White-Box models extract graph-based features from High-Level Optimized (HLO) graphs to generalize across architectures. Results show hybrid methods significantly improve accuracy, reducing fit-time estimation error from 75.48% to 10.55%. The estimators are being integrated with AI-driven job schedulers to improve workload allocation and utilization, paving the way for creating agents for Machine Learning Workflow (MLOps) systems across the computing continuum.
Workshop
Performance Evaluation, Scalability, & Portability
Livestreamed
Recorded
TP
W
DescriptionPreparing large-scale scientific applications for diverse GPU architectures requires strategies that balance performance, portability, and long-term maintainability.
We introduce a unified kernel abstraction and evaluate it using CRK-HACC, a production N-body cosmology code, enabling single-source compilation through both CUDA and SYCL toolchains. Our approach introduces a thin C++ layer that preserves the original CUDA kernel syntax and launch style while providing SYCL compatibility through a mechanical ``functorization'' process. This method avoids the complexity of automated source translation, retains architecture-specific optimizations, and reduces maintenance effort by eliminating code duplication. We evaluate the implementation on two DOE leadership systems—Polaris (NVIDIA GPUs) and Aurora (Intel GPUs)—comparing kernel-level execution times across backends and architectures. Results show competitive performance for SYCL relative to native CUDA while preserving code clarity and portability.
This case study demonstrates a practical path toward sustaining performance in complex, physics-rich codes as HPC hardware continues to evolve.
We introduce a unified kernel abstraction and evaluate it using CRK-HACC, a production N-body cosmology code, enabling single-source compilation through both CUDA and SYCL toolchains. Our approach introduces a thin C++ layer that preserves the original CUDA kernel syntax and launch style while providing SYCL compatibility through a mechanical ``functorization'' process. This method avoids the complexity of automated source translation, retains architecture-specific optimizations, and reduces maintenance effort by eliminating code duplication. We evaluate the implementation on two DOE leadership systems—Polaris (NVIDIA GPUs) and Aurora (Intel GPUs)—comparing kernel-level execution times across backends and architectures. Results show competitive performance for SYCL relative to native CUDA while preserving code clarity and portability.
This case study demonstrates a practical path toward sustaining performance in complex, physics-rich codes as HPC hardware continues to evolve.
Workshop
Livestreamed
Recorded
TP
W
DescriptionTraining large language models (LLMs) at scale presents challenges that demand careful co-design across software, hardware, and parallelization strategies. In this work, we introduce a communication-aware tuning methodology for optimizing LLM pretraining, and extend the performance portability metric to evaluate LLM-training efficiency across our systems. Our methodology, validated through LLM pretraining workloads at a leading global technology enterprise, delivered up to 1.6x speedup over default configurations. We further provide six key insights that challenge prevailing assumptions in LLM training performance, including the trade-offs between ZeRO stages, the default DeepSpeed communication collectives, and the critical role of batch size choices. Our findings highlight the need for platform-specific tuning and advocate for a shift toward end-to-end co-design to unlock performance efficiency in LLM training.
Tutorial
Livestreamed
Recorded
TUT
DescriptionRecent advances in machine learning and deep learning (ML/DL) have led to many exciting challenges and opportunities. Modern ML/DL frameworks including PyTorch, TensorFlow, and cuML enable high-performance training, inference, and deployment for various types of ML models and deep neural networks (DNNs). This tutorial provides an overview of recent trends in ML/DL and the role of cutting-edge hardware architectures and interconnects in moving the field forward. We will also present an overview of different DNN architectures, ML/DL frameworks, DL training and inference, and hyperparameter optimization, with special focus on parallelization strategies for large models such as GPT, LLaMA, DeepSeek, and ViT. We highlight new challenges and opportunities for communication runtimes to exploit high-performance CPU/GPU architectures to efficiently support large-scale distributed training. We also highlight some of our co-design efforts to utilize MPI for large-scale DNN training on cutting-edge CPU/GPU/DPU architectures available on modern HPC clusters. Throughout the tutorial, we include several hands-on exercises to enable attendees to gain firsthand experience of running distributed ML/DL training and hyperparameter optimizations on a modern GPU cluster.
Research and ACM SRC Posters
Research & ACM SRC Posters
TP
DescriptionThere is growing interest in securing scientific software, which underpins research results and often transitions into commercial systems. While source code metrics provide useful indicators of vulnerabilities, software engineering process (SEP) metrics can uncover patterns that lead to their introduction. Few studies have explored whether SEP metrics can reveal risky development activities over time—insights that are essential for predicting vulnerabilities.
This work highlights the critical role of SEP metrics in understanding and mitigating vulnerability reintroduction. We move beyond file-level prediction and analyze security fixes at the commit level, focusing on sequences of changes where vulnerabilities evolve and re-emerge. Our approach emphasizes that reintroduction is rarely the result of one isolated action, but emerges from cumulative development activities and socio-technical conditions.
This work highlights the critical role of SEP metrics in understanding and mitigating vulnerability reintroduction. We move beyond file-level prediction and analyze security fixes at the commit level, focusing on sequences of changes where vulnerabilities evolve and re-emerge. Our approach emphasizes that reintroduction is rarely the result of one isolated action, but emerges from cumulative development activities and socio-technical conditions.
Research and ACM SRC Posters
Research & ACM SRC Posters
TP
DescriptionDistributed computing frameworks are vital for managing complex workloads in high-performance computing and scientific research. Julia's Dagger.jl supports task-based parallelism using TCP communication, suitable for cloud and local environments. However, TCP limits performance on modern HPC systems with low-latency, high-bandwidth interconnects. We introduce MPIAcceleration, an MPI-based backend replacing TCP with MPI-aware task placement and data movement. We benchmarked MPIAcceleration against TCP using parallel Cholesky decomposition on the Aurora exascale supercomputer at Argonne National Laboratory. Results show MPI successfully enables Dagger on HPC interconnects, significantly outperforming TCP on Aurora by overcoming the latency limitations inherent in standard TCP.
Workshop
Livestreamed
Recorded
TP
W
DescriptionWith the availability of sophisticated profiling tools for GPUs such as NVIDIA’s Nsight Compute and Nsight Systems, programmers tend to overlook the level of insight that can be gained from simple profiling techniques. For instance, the basic profiling approach of manually adding counters to source code is able to expose important application-specific behavior that general-purpose profilers cannot capture. Analyzing global or thread-local counts of certain events can help developers better reason about program behaviors that are crucial for detecting performance bottlenecks, validating key assumptions, and guiding effective optimizations. In this paper, we demonstrate on the example of 5 high-performance GPU graph-analytics codes how we used this profiling approach to uncover interesting application behaviors and to develop performance optimizations based on some of them.
Workshop
Livestreamed
Recorded
TP
W
DescriptionInnovation in cancer care and research come about in many ways. High performance computing (HPC) has expanded the frontier for innovation in cancer and promises accelerated impact on the massive challenge of cancer. Yet even with visionary inspiration, vast quantities of data, and a growing capacity in HPC, there are many key technical, scientific, organizational and cultural challenges that remain in realizing impactful patient outcomes.
The volume of real-world patient data at The University of Texas MD Anderson Cancer Center is tremendous with 1.6 million outpatient patient visits per year, 14 million pathology and laboratory procedures, and over 600,000 diagnostic imaging procedures each year. Harnessing and leveraging this information, however, requires more than HPC and computational algorithms. Ultimately, it requires building a data and data science ecosystem to enable interdisciplinary teams to effectively communicate around the data. To formulate actionable questions for cancer discovery and clinical data, teams must appreciate the importance of the context of the data and embrace the complexity of the data, cancer, and the surrounding healthcare system.
MD Anderson Cancer Center has taken a particularly innovative approach in creating the Institute for Data Science in Oncology (IDSO) as part of an overall institutional strategy to tackle these challenges and accelerate translational impact for data science. While not a traditional computational approach for cancer focusing on algorithms, methods, technologies, and implementations, the IDSO programmatic approach is addressing key challenges of creating an organizational ecosystem and culture that readily embraces, innovates, advances, and adopts computational and data science approaches to cancer.
Building on a decade of formative efforts and formally launched in 2023, the IDSO approach is anchored with three pillars of team data science, translational impact, and continuous learning and innovation, all with a direction for improving patient care. The IDSO serves as a hub with defined programs in education and culture, collaboration (both internal and external), and five co-led team data science focus areas emphasizing translational impact in domains of quantitative imaging, single cell spatial analytics, computational modeling for precision medicine, decision analytics for health, and safety, quality, and access.
Already, IDSO is having key impacts, opening avenues for innovation in computational approaches, data flows and growing demand for HPC in meeting cancer challenges. A collaboration with the University of Texas at Austin, the Texas Advanced Computing Center and MD Anderson co-led with IDSO has led to over 20 new collaborative projects involving HPC. The Tumor Measurement Initiative, which heavily uses HPC to train AI models using MD Anderson’s vast image resources has prepared hundreds of imaging datasets utilizing tens of thousands of images and developed an initial library of model algorithms. The IDSO affiliates program now includes more than 50 individuals from across the institution. And in just two short years, the fellowship training program has trained more than 38 personnel in data science.
The presentation will provide useful insights including lessons learned in forming, launching and establishing the IDSO, perspectives on challenges that require communities to solve, and thoughts on future directions.
The volume of real-world patient data at The University of Texas MD Anderson Cancer Center is tremendous with 1.6 million outpatient patient visits per year, 14 million pathology and laboratory procedures, and over 600,000 diagnostic imaging procedures each year. Harnessing and leveraging this information, however, requires more than HPC and computational algorithms. Ultimately, it requires building a data and data science ecosystem to enable interdisciplinary teams to effectively communicate around the data. To formulate actionable questions for cancer discovery and clinical data, teams must appreciate the importance of the context of the data and embrace the complexity of the data, cancer, and the surrounding healthcare system.
MD Anderson Cancer Center has taken a particularly innovative approach in creating the Institute for Data Science in Oncology (IDSO) as part of an overall institutional strategy to tackle these challenges and accelerate translational impact for data science. While not a traditional computational approach for cancer focusing on algorithms, methods, technologies, and implementations, the IDSO programmatic approach is addressing key challenges of creating an organizational ecosystem and culture that readily embraces, innovates, advances, and adopts computational and data science approaches to cancer.
Building on a decade of formative efforts and formally launched in 2023, the IDSO approach is anchored with three pillars of team data science, translational impact, and continuous learning and innovation, all with a direction for improving patient care. The IDSO serves as a hub with defined programs in education and culture, collaboration (both internal and external), and five co-led team data science focus areas emphasizing translational impact in domains of quantitative imaging, single cell spatial analytics, computational modeling for precision medicine, decision analytics for health, and safety, quality, and access.
Already, IDSO is having key impacts, opening avenues for innovation in computational approaches, data flows and growing demand for HPC in meeting cancer challenges. A collaboration with the University of Texas at Austin, the Texas Advanced Computing Center and MD Anderson co-led with IDSO has led to over 20 new collaborative projects involving HPC. The Tumor Measurement Initiative, which heavily uses HPC to train AI models using MD Anderson’s vast image resources has prepared hundreds of imaging datasets utilizing tens of thousands of images and developed an initial library of model algorithms. The IDSO affiliates program now includes more than 50 individuals from across the institution. And in just two short years, the fellowship training program has trained more than 38 personnel in data science.
The presentation will provide useful insights including lessons learned in forming, launching and establishing the IDSO, perspectives on challenges that require communities to solve, and thoughts on future directions.
Workshop
Livestreamed
Recorded
TP
W
DescriptionA major challenge the HPC community faces is how to deliver increased performance demanded by scientific programmers, whilst addressing an increased emphasis on sustainable operations. Specialised architectures, such as FPGAs and AMD's AI Engines (AIEs), have demonstrated significant energy efficiency advantages, however a major challenge is that substantial expertise and investment of time is required to gain best performance from this hardware which is a major blocker.
Fortran in the lingua franca of scientific computing, and in this paper we explore the automatic offloading of Fortran intrinsics to the AIEs in AMD's Ryzen AI CPU as a case study, demonstrating how the MLIR compiler ecosystem can provide performance and programmer productivity. We describe an approach that lowers the MLIR linear algebra dialect to AMD's AIE dialects, and demonstrate that for suitable workloads the AIEs can provide significant performance advantages over the CPU without any code modifications required by the programmer.
Fortran in the lingua franca of scientific computing, and in this paper we explore the automatic offloading of Fortran intrinsics to the AIEs in AMD's Ryzen AI CPU as a case study, demonstrating how the MLIR compiler ecosystem can provide performance and programmer productivity. We describe an approach that lowers the MLIR linear algebra dialect to AMD's AIE dialects, and demonstrate that for suitable workloads the AIEs can provide significant performance advantages over the CPU without any code modifications required by the programmer.
Tutorial
Livestreamed
Recorded
TUT
DescriptionScientific applications are increasingly adopting artificial intelligence (AI) techniques to advance science. There are specialized hardware accelerators designed and built to run AI applications efficiently. With a wide diversity in the hardware architectures and software stacks of these systems, it is challenging to understand the differences between these accelerators, their capabilities, programming approaches, and how they perform, particularly for scientific applications. In this tutorial, we will cover an overview of the AI accelerators landscape, focusing on SambaNova, Cerebras, Graphcore, Groq, and Intel Gaudi systems along with architectural features and details of their software stacks. Through hands-on exercises, attendees will gain practical experience in refactoring code and running models on these systems, focusing on use cases of pre-training and fine-tuning open-source large language models (LLMs) and deploying AI inference solutions relevant to scientific contexts. The tutorial will provide attendees with an understanding of the key capabilities of emerging AI accelerators and their performance implications for scientific applications.
Tutorial
Livestreamed
Recorded
TUT
DescriptionIf you are an HPC programmer, you know OpenMP. Alongside MPI, OpenMP is the open, cross-vendor foundation of HPC. As hardware complexity has grown, OpenMP has grown as well, adding GPU support in OpenMP 4.0 (2013). With a decade of evolution since then, OpenMP GPU technology is a mature option for programming any GPU you are likely to find on the market. While there are many ways to program a GPU, the best way is through OpenMP. Why? Because the GPU does not exist in isolation. There are always one or more CPUs on a node. Programmers need portable code that fully exploits all available processors. In other words, programmers need a programming model, such as OpenMP, that fully embraces heterogeneity. In this tutorial, we explore GPU programming with OpenMP. We assume attendees already know the fundamentals of multithreading with OpenMP, so we will focus on the directives that define how to map loops onto GPUs and optimize data movement between the CPU and GPU. Students will use their own laptops (with Windows, Linux, or macOS) to connect to remote servers with GPUs and all the software needed for the tutorial.
Flash Session
Not Livestreamed
Not Recorded
TP
XO/EX
DescriptionCurrently there is no recognized test standard for determining the thermal capacity of a single-phase liquid-to-liquid coolant distribution unit (CDU). The absence of a nationally recognized method of test for rating adds to the complexity of making meaningful decisions about which vendors’ CDUs should be deployed and what their actual thermal capacity is. In the absence of an accepted standard, it is difficult for engineers and owners to make fair and accurate comparisons between different CDUs.
In this presentation, Dave Meadows will address the current state of development of a method of test for rating of liquid-to-liquid single-phase coolant distribution units. He will detail progress made within the ASHRAE 127 standard and discuss how this relates to AHRI standard 1360. Dave will discuss the proposed methodology and required test equipment needed to record the thermal capacity and hydraulic capabilities of a single-phase liquid-to-liquid CDU.
In this presentation, Dave Meadows will address the current state of development of a method of test for rating of liquid-to-liquid single-phase coolant distribution units. He will detail progress made within the ASHRAE 127 standard and discuss how this relates to AHRI standard 1360. Dave will discuss the proposed methodology and required test equipment needed to record the thermal capacity and hydraulic capabilities of a single-phase liquid-to-liquid CDU.
Exhibitor Forum
Data Analytics
Livestreamed
Recorded
TP
XO/EX
DescriptionIn high performance computing (HPC) environments, data analytics systems often face inefficiencies related to I/O, leading to increased costs. To address these challenges, we propose OASIS (Object-based Analytics Storage for Intelligent SQL Query Offloading), an interoperable and standards-based computational storage system that leverages Substrait and Arrow from the Apache ecosystem.
A key feature of the OASIS system is its ability to provide a consistent data view and a unified analytics methodology across the entire infrastructure, from compute nodes to data-aware computational storage devices (CSDs). This capability enables the creation of a vertically optimized and scalable analytics pipeline, facilitating the flexible distribution of computational loads and promoting optimal performance throughout the data analytics system.
In this talk, we will share performance results for HPC workload analysis within the OASIS-based data analytics system and discuss the applicability of integrating OASIS with existing data analytics frameworks.
A key feature of the OASIS system is its ability to provide a consistent data view and a unified analytics methodology across the entire infrastructure, from compute nodes to data-aware computational storage devices (CSDs). This capability enables the creation of a vertically optimized and scalable analytics pipeline, facilitating the flexible distribution of computational loads and promoting optimal performance throughout the data analytics system.
In this talk, we will share performance results for HPC workload analysis within the OASIS-based data analytics system and discuss the applicability of integrating OASIS with existing data analytics frameworks.
SCinet Network Research Exhibition
Not Livestreamed
Not Recorded
Workshop
Livestreamed
Recorded
TP
W
DescriptionWarewulf v4, the current generation of the popular cluster provisioning system, was significnatly simpler than its predecessor by supporting only a single-stage, provision-to-memory pattern. While this simplification has many benefits, users of the platform, particularly those coming from Warewulf 3, continue to request "stateful" provisioning support. The Warewulf development community, however, is protective of the simplicity of the current platform, and wants to introduce "diskful" provisioning features without sacrificing the simplicity and benefits of the stateless provisioning paradigm.
Here we present recent additions to Warewulf's ability to provision local storage and to provision a node image to that local storage. We also present a proposed roadmap for the future of disk provisioning in Warewulf as a prompt for further community feedback.
Here we present recent additions to Warewulf's ability to provision local storage and to provision a node image to that local storage. We also present a proposed roadmap for the future of disk provisioning in Warewulf as a prompt for further community feedback.
Workshop
Livestreamed
Recorded
TP
W
DescriptionThis presentation introduces Elm, Stanford Research Computing’s latest storage system designed to handle large-scale archiving of research data, up to hundreds of petabytes. With a strong focus on affordability and energy-efficiency, Elm combines several open-source technologies: MinIO for S3 compatibility and high-level data security through erasure coding, Lustre with built-in parallel hierarchical storage management (HSM), Phobos for modern tape management, and LTFS for easy access to tape data in a standardized format. Together, these elements create a seamless S3 experience for researchers, and offers them access to scalable cold storage for their archival needs. Elm opens new opportunities for data storage at Stanford and has the potential to be replicated at other research institutions.
Workshop
Livestreamed
Recorded
TP
W
DescriptionThis talk outlines system- and algorithm-level progress toward quantum-advantage-ready QC-HPC, where quantum and classical resources cooperate and are integrated into a single logical system. We will focus on Pasqal's neutral atom QPUs and their native ability for quantum simulation, as well as progress towards fault-tolerant, digital, QPUs. We also discuss the design of heterogeneous (multi-modal) QC-HPC systems.
Paper
Post-Moore Computing
Quantum Computing
Livestreamed
Recorded
TP
DescriptionProtein structure prediction is a core challenge in computational biology, particularly for fragments within ligand-binding regions, where accurate modeling is still difficult. Quantum computing offers a novel first-principles modeling paradigm, but its application is currently limited by hardware constraints, high computational cost, and the lack of a standardized benchmarking dataset. In this work, we present QDockBank—the first large-scale protein fragment structure dataset generated entirely using utility-level quantum computers, specifically designed for protein–ligand docking tasks. QDockBank comprises 55 protein fragments extracted from ligand-binding pockets. The dataset was generated through tens of hours of execution on superconducting quantum processors, making it the first quantum-based protein structure dataset with a total computational cost exceeding $1 million. Experimental evaluations demonstrate that structures predicted by QDockBank outperform those predicted by AlphaFold2 and AlphaFold3 in terms of both RMSD and docking affinity scores. QDockBank serves as a new benchmark for evaluating quantum-based protein structure prediction.
Workshop
Livestreamed
Recorded
TP
W
DescriptionQiskit is a popular open source SDK for quantum computing it enables users to build quantum circuits, compile them for a specific quantum computer, and provides interfaces for running circuits. Historically Qiskit only exposed a Python interface, but this has changed in recent releases and Qiskit now also has a C API. This talk will explain how Qiskit's C API was developed, how to use the new API, and show how it is building a broader ecosystem for quantum software. It will also demonstrate practical examples of using the C API to run circuits on quantum computers.
Workshop
Livestreamed
Recorded
TP
W
DescriptionSample-based Quantum Diagonalization (SQD) is a hybrid quantum-classical algorithm which approximates the ground state of a many-body quantum system. This talk introduces a new Qiskit addon for SQD, implemented as a modern C++ template library. Closely related to the original Python-based SQD addon, this new library provides high-performance implementations of key algorithmic components: post-selection, subsampling, and configuration recovery. It is designed to integrate with the SBD eigensolver developed at RIKEN, enabling large-scale SQD calculations.This talk will demonstrate how these components, together with the Qiskit C API, enable the construction of fully compiled applications for hybrid quantum/classical workflows. As a case study, we present an open-source application that uses SQD to approximate the ground state energy of the Fe₄S₄ cluster—a molecule of interest in quantum chemistry. This example showcases Qiskit's readiness for high-performance computing environments and its growing support for compiled, scalable quantum-classical applications.
Paper
BSP
Post-Moore Computing
Quantum Computing
Livestreamed
Recorded
TP
DescriptionWe describe Qonductor, a cloud orchestrator for hybrid quantum-classical applications that run on heterogeneous hybrid resources. Qonductor abstracts away the complexity of hybrid programming and resource management by exposing the Qonductor API, a high-level and hardware-agnostic API. The resource estimator strategically balances quantum and classical resources to mitigate resource contention and the effects of hardware noise. The hybrid scheduler automates job scheduling on hybrid resources and balances the tradeoff between users’ objectives of QoS and the cloud operator’s objective of resource efficiency.
We implement an open-source prototype and evaluate Qonductor using more than 7,000 real quantum runs on the IBM Quantum Cloud to simulate real cloud workloads. Qonductor achieves up to 54% lower job completion times (JCTs) while sacrificing 3% execution quality; balances the load across QPU, which increases quantum resource utilization by up to 66%; and scales with growing system sizes and loads.
We implement an open-source prototype and evaluate Qonductor using more than 7,000 real quantum runs on the IBM Quantum Cloud to simulate real cloud workloads. Qonductor achieves up to 54% lower job completion times (JCTs) while sacrificing 3% execution quality; balances the load across QPU, which increases quantum resource utilization by up to 66%; and scales with growing system sizes and loads.
Workshop
Data Analytics
High Performance I/O, Storage, Archive, & File Systems
Storage
Livestreamed
Recorded
TP
W
DescriptionHigh-performance computing facilities increasingly form hybrid environments that integrate cloud services.
To avoid cumbersome network transfers when sharing data, a new class of storage gateways map a subset of facility storage to a cloud counterpart and automatically manage data mirroring.
However, the performance characteristics of accessing AWS's S3 from HPC systems using different methods and patterns remains poorly understood.
This paper presents a roofline-based analysis of three S3 integration approaches: NFS-mounted AWS Storage Gateway, data migration through Storage Gateway, and direct S3 API transfers. We extend I/O roofline modeling to characterize operational intensity and bandwidth ceilings across varying data sizes and access patterns.
Our experimental evaluation demonstrates significant performance differences between access methods, with POSIX I/O on NFS Storage Gateway achieving up to 6.4× higher bandwidth than other approaches for large transfers. The roofline analysis reveals distinct characteristics for each method, enabling informed selection of S3 integration strategies.
To avoid cumbersome network transfers when sharing data, a new class of storage gateways map a subset of facility storage to a cloud counterpart and automatically manage data mirroring.
However, the performance characteristics of accessing AWS's S3 from HPC systems using different methods and patterns remains poorly understood.
This paper presents a roofline-based analysis of three S3 integration approaches: NFS-mounted AWS Storage Gateway, data migration through Storage Gateway, and direct S3 API transfers. We extend I/O roofline modeling to characterize operational intensity and bandwidth ceilings across varying data sizes and access patterns.
Our experimental evaluation demonstrates significant performance differences between access methods, with POSIX I/O on NFS Storage Gateway achieving up to 6.4× higher bandwidth than other approaches for large transfers. The roofline analysis reveals distinct characteristics for each method, enabling informed selection of S3 integration strategies.
Exhibitor Forum
Quantum & Other Post Moore Computing Technologies
Livestreamed
Recorded
TP
XO/EX
DescriptionHigh performance computing is at a critical inflection point. To move beyond Moore's Law, new computational paradigms have become a necessity. Quantum computing is no longer a future-facing technology; it is a powerful accelerator for today’s most complex challenges. This technical presentation introduces a hybrid quantum-classical framework designed to deliver scalable and reliable quantum systems for today's HPC workloads.
We will detail IonQ’s approach, centered on "qubit virtualization," which is enabled by a flexible architecture that supports any error correction code and features all-to-all connectivity. This framework, combined with advanced error mitigation techniques, unlocks the full potential of integrating quantum resources into existing HPC workflows.
Join us to explore how our accelerated roadmap and hybrid quantum architecture unlock and help solve new problem classes in critical areas like materials science and complex optimization.
We will detail IonQ’s approach, centered on "qubit virtualization," which is enabled by a flexible architecture that supports any error correction code and features all-to-all connectivity. This framework, combined with advanced error mitigation techniques, unlocks the full potential of integrating quantum resources into existing HPC workflows.
Join us to explore how our accelerated roadmap and hybrid quantum architecture unlock and help solve new problem classes in critical areas like materials science and complex optimization.
Workshop
Livestreamed
Recorded
TP
W
DescriptionClassical computing hardware has been governed by Moore’s Law for decades, but in the coming years the ever increasing computational performance of classical hardware is expected to plateau. Because of this the computer hardware industry is looking for new routes to increase computational performance. Quantum computing holds great promise to extend computational performance, but the current hardware suffers from high levels of noise making these systems hard to use without error mitigations strategies. Recently, a method called subspace quantum diagonalization or SQD was introduced by IBM to solve electronic structure problems of relevance to chemistry. SQD is an example of quantum centric supercomputing (QCSC) where quantum hardware is used on specific aspects of a computational problem while the classical hardware solves the remaining aspects of the computational task. SQD uses the quantum hardware to access Slater determinants which describe how the electrons are arrayed in the molecular orbitals of a molecule, while the generated Slater determinants are then corrected on the classical hardware. The corrected Slater determinants then form the basis used in several formulations configuration interaction calculations. In my presentation I will review the theoretical details of the SQD method and highlight several applications of SQD to chemistry including the study of intermolecular interactions, calculations of large drug-like molecules and biomolecules, inclusion of solvation effects and the integration of the SQD method with statistical mechanical approaches.
Panel
Applications & Application Frameworks
SC Community Hot Topics
Livestreamed
Recorded
TP
DescriptionQuantum computing is a growing accelerator technology with more computing centers investigating the use of such systems as part of their long-term HPC strategy. However, while the potential computational power of quantum computers for targeted problems is generally accepted in the HPC and QC communities, concrete application use cases are still a rarity. At the same time, there is a plethora of accelerators in this "Era of Heterogeneity." For this panel, we have invited five experts from different areas, backgrounds, and regions. All are investigating the feasibility and opportunities found in different acceleration approaches. We will discuss with them how they see the accelerator landscape developing and what the role of quantum is. With this panel, we aim to provide a realistic picture of where quantum should be in the field of acceleration and what applications are relevant, thereby offering the audience a well-founded picture based on first-hand experiences.
Exhibitor Forum
Quantum & Other Post Moore Computing Technologies
Livestreamed
Recorded
TP
XO/EX
DescriptionD-Wave Quantum Inc. develops quantum computers to tackle hard problems with energy-efficient computation. The promise of quantum computing is to extend computation beyond the capabilities of what’s possible with classical computing architectures, the basis of Moore’s Law. This is being realized today, with recent publications on computational supremacy in magnetic material simulation [King].
D-Wave’s annealing quantum computers are installed in the Jülich Supercomputing Centre, the University of Southern California Information Sciences Institute, and Davidson Technologies Inc., as well as D-Wave’s R&D center in Canada. D-Wave’s Leap™ quantum cloud service delivers greater than 99.9% uptime and its systems provide subsecond response times.
D-Wave’s Exhibitor Forum talk will discuss the differences between annealing quantum systems and other quantum computing modalities while highlighting application areas for this technology, like magnetic material simulation, AI/ML, blockchain, and optimization.
D-Wave’s annealing quantum computers are installed in the Jülich Supercomputing Centre, the University of Southern California Information Sciences Institute, and Davidson Technologies Inc., as well as D-Wave’s R&D center in Canada. D-Wave’s Leap™ quantum cloud service delivers greater than 99.9% uptime and its systems provide subsecond response times.
D-Wave’s Exhibitor Forum talk will discuss the differences between annealing quantum systems and other quantum computing modalities while highlighting application areas for this technology, like magnetic material simulation, AI/ML, blockchain, and optimization.
Exhibitor Forum
Quantum & Other Post Moore Computing Technologies
Livestreamed
Recorded
TP
XO/EX
DescriptionQuantum computing provides an opportunity to greatly accelerate geospatial processing speeds. This session introduces a quantum machine learning (QML) methodology for converting classical algorithms into quantum-ready workflows. In our session we use the scale-invariant feature transform (SIFT) algorithm to demonstrate how QML can recast SIFT’s classical calculations into "quantum-kernels." We explain how design-time analytics such as quantum game theory and empirical parameter estimation yield better initial designs. We then describe how hybrid simulations within our Quantum Circuit Factory use variational and neuroevolutionary algorithms to optimize the quantum-kernels into implemented circuits. Finally, we demonstrate how our factory orchestrates these circuits to mirror classical SIFT, providing quantitative comparisons of classical and quantum processing pipelines. Attendees will gain insights into the engineering involved with converting classical algorithms to quantum, the tools and techniques that support this process, and initial methodologies for comparing classical and quantum dataflows.
Workshop
Partially Livestreamed
Partially Recorded
TP
W
DescriptionCombinatorial Optimization (CO) is one of the most important areas in the field of optimization, with practical applications found in every industry, including both the private and public sectors. In recent years, it was discovered that a mathematical formulation known as QUBO (Quadratic Unconstrained Binary Optimization) problem can embrace an exceptional variety of important CO problems found in the industry. In this work, we explore the Quantum Approximate Optimization Algorithm (QAOA) as a quantum computing approach to solve the Knapsack Problem efficiently by formulating the problem as a QUBO model. Here we implemented QAOA on a quantum emulator, leveraging quantum superposition and entanglement to explore multiple solutions simultaneously. The method offers a promising route for solving large, complex scheduling and resource allocation problems that are tough for classical algorithms, with potential applications in logistics, HPC resource scheduling, and energy optimization.
Art of HPC
Art of HPC
Not Livestreamed
Not Recorded
TP
W
TUT
XO/EX
DescriptionThe artwork explores the phenomena of quantum universality by illustrating the variance of coefficients of the characteristic polynomial for sequences of quantized circulant networks approaching the semiclassical limit, which is the limit of large quantum networks.
Quantum physics describes the behavior of the world at the scale of nanotechnology, where particles behave like waves. Quantum networks of waves in a spider’s web of wires are used to model quantum physics in a complex geometry. A phenomenon observed with quantum physics in a complex environment is universality, where many different quantum systems display the same statistical properties. Universality is possible where quantum waves have large energies or, equivalently, in large networks. In this work, it can be seen as rings of progressively smaller dots approach a constant hue.
We computed the variance of the coefficients of the characteristic polynomial for sequences of quantized circulant networks of increasing size. The radius scales inversely with the size of the network and the hue corresponds to the value of the coefficients.
The characteristic polynomial encodes the spectrum of allowed energy values of the network, which is analogous to the spectrum of musical tones and overtones of a violin. Circulant networks consist of points arranged on a circle where wires connect each point to a set of its closest neighbors, so the network has a symmetry under rotations.
Quantum physics describes the behavior of the world at the scale of nanotechnology, where particles behave like waves. Quantum networks of waves in a spider’s web of wires are used to model quantum physics in a complex geometry. A phenomenon observed with quantum physics in a complex environment is universality, where many different quantum systems display the same statistical properties. Universality is possible where quantum waves have large energies or, equivalently, in large networks. In this work, it can be seen as rings of progressively smaller dots approach a constant hue.
We computed the variance of the coefficients of the characteristic polynomial for sequences of quantized circulant networks of increasing size. The radius scales inversely with the size of the network and the hue corresponds to the value of the coefficients.
The characteristic polynomial encodes the spectrum of allowed energy values of the network, which is analogous to the spectrum of musical tones and overtones of a violin. Circulant networks consist of points arranged on a circle where wires connect each point to a set of its closest neighbors, so the network has a symmetry under rotations.
Workshop
Livestreamed
Recorded
TP
W
DescriptionA quantum-centric supercomputer (QCSC) is a system that integrates quantum computing with traditional high-performance computing (HPC) resources. This next-generation, hybrid approach targets solving complex, real-world problems that are currently beyond the capabilities of classical supercomputers alone. Under this execution paradigm, the strengths of both computing modalities are integrated to enable a hybrid quantum-HPC program. This presentation will describe the RPI and IBM quantum-centric supercomputing (QCSC) architecture, which integrates the AiMOS supercomputer with RPI's 127-qubit Eagle-class quantum computer. From there, demonstration applications that rely on the QCSC architecture will be presented.
Workshop
Livestreamed
Recorded
TP
W
DescriptionAbstract: Quantum–HPC integration has advanced from early experiments to operational prototypes that connect quantum accelerators with leadership-class supercomputers. This talk reviews what has been achieved so far like hybrid runtimes, middleware layers, and workflow orchestration bridging HPC and quantum systems, and what continues to be challenges, like multiple emerging solution and standards, adoption, use cases, and sustained software ecosystems.
It also examines structural challenges such as aligning hardware roadmaps with software readiness and maintaining realistic expectations of computational utility. Rather than promising speedups, the focus is on what hybrid infrastructures enable today: reproducible experimentation, system co-design, and preparation for future fault-tolerant regimes.
It also examines structural challenges such as aligning hardware roadmaps with software readiness and maintaining realistic expectations of computational utility. Rather than promising speedups, the focus is on what hybrid infrastructures enable today: reproducible experimentation, system co-design, and preparation for future fault-tolerant regimes.
Birds of a Feather
Algorithms
Livestreamed
Recorded
TP
XO/EX
DescriptionIn recent years, randomized numerical linear algebra (RandNLA) proved to be more than a theoretical novelty: projects like RandLAPACK demonstrate its practical value across architectures, and projects like RandBLAS build trust in randomization as a tool for high-performance NLA. This BoF considers two main questions. First, what are the pressing issues in software standards and implementation that need to be resolved for RandNLA to become a core component of HPC? Second, how can we mobilize a community effort to make progress on these issues? The BoF will engage the audience to discuss the idea of growing the role of RandNLA in high performance computing and what it would take to scale from niche prototypes to robust, production-quality software libraries.
Research and ACM SRC Posters
Research & ACM SRC Posters
TP
DescriptionThe growing volume of data in high performance computing (HPC) has made spatial query processing increasingly challenging due to high data transfer costs and limited memory bandwidth. To address these bottlenecks and reduce energy wasted on data movement, this work explores processing-in-memory (PIM) systems by executing range queries directly inside memory chips. Unlike prior PIM studies centered on linear scans or hash-based queries, this work is the first to map R-tree range queries onto PIM hardware. The proposed broadcast-based method constructs the R-tree bottom-up on the CPU, broadcasts top levels to UPMEM DPUs (DRAM processing units) for global filtering, and distributes lower levels for parallel batched queries in a CPU–DPU system. On the Lakes dataset (8M rectangles), it achieves 8× speedup over sequential CPU baselines, with synthetic benchmarks up to 10.9×. These results highlight the promise of PIM-based heterogeneous systems for scalable, energy-efficient spatial query processing in HPC workloads.
Workshop
short paper
Livestreamed
Recorded
TP
W
DescriptionAs quantum networking grows in importance, its study is of interest to an ever wider community. Several simulation frameworks allow for testing such systems on commodity hardware, but can be difficult to work with and performance-limited due to their predominantly serial nature. The SeQUeNCe simulator addresses the latter issue, though has not been proven to work well across architectures or larger scales. For the former concern, we introduce BISQIT, a block-diagramme-based framework that models in terms of distinct components and the data flows between them.
This provides a simple and modular approach to experimental design that allows for rapid iteration with a library of reusable parts. We demonstrate the flexibility of its design for prototyping and show a path for how to migrate designed experiments to SeQUeNCe for production-scale testing. Our results show the simplicity of the BISQIT model and provide new insight into SeQUeNCe's scalability behaviour using ORNL's Frontier.
This provides a simple and modular approach to experimental design that allows for rapid iteration with a library of reusable parts. We demonstrate the flexibility of its design for prototyping and show a path for how to migrate designed experiments to SeQUeNCe for production-scale testing. Our results show the simplicity of the BISQIT model and provide new insight into SeQUeNCe's scalability behaviour using ORNL's Frontier.
Paper
Algorithms
Applications
Architectures & Networks
Livestreamed
Recorded
TP
DescriptionThe proliferation of low-precision units in modern high-performance architectures increasingly burdens domain scientists. Historically, the choice in HPC was easy: Can we get away with 32-bit floating-point operations and lower bandwidth requirements, or is FP64 necessary? Driven by artificial intelligence, vendors introduce novel low-precision units for vector and tensor operations, and FP64 capabilities stagnate or are reduced. This forces scientists to re-evaluate their codes, but a trivial search-and-replace approach to go from FP64 to FP16 will not suffice.
We introduce RAPTOR: a numerical profiling tool to guide scientists in their search for code regions where precision lowering is feasible. Using LLVM, we transparently replace high-precision computations using low-precision units, or emulate a user-defined precision. RAPTOR is a novel, feature-rich approach—with a focus on ease of use—to change, profile, and reason about numerical requirements and instabilities, which we demonstrate with four real-world multi-physics Flash-X applications.
We introduce RAPTOR: a numerical profiling tool to guide scientists in their search for code regions where precision lowering is feasible. Using LLVM, we transparently replace high-precision computations using low-precision units, or emulate a user-defined precision. RAPTOR is a novel, feature-rich approach—with a focus on ease of use—to change, profile, and reason about numerical requirements and instabilities, which we demonstrate with four real-world multi-physics Flash-X applications.
ACM Gordon Bell Climate Modeling Finalist
ACM Gordon Bell Finalist
Awards and Award Talks
Applications
GB
Livestreamed
Recorded
TP
DescriptionWe present a Bayesian inversion-based digital twin that employs acoustic pressure data from seafloor sensors, along with 3D coupled acoustic-gravity wave equations, to infer earthquake-induced spatiotemporal seafloor motion in real time and forecast tsunami propagation toward coastlines for early warning with quantified uncertainties. Our target is the Cascadia subduction zone, with one billion parameters. Computing the posterior mean alone would require 50 years on a 512 GPU machine. Instead, exploiting the shift invariance of the parameter-to-observable map and devising novel parallel algorithms, we induce a fast offline-online decomposition. The offline component requires just one adjoint wave propagation per sensor; using MFEM, we scale this part of the computation to the full El Capitan system (43,520 GPUs) with 92% weak parallel efficiency. Moreover, given real-time data, the online component exactly solves the Bayesian inverse and forecasting problems in 0.2 seconds on a modest GPU system, a ten-billion-fold speedup.
SCinet Network Research Exhibition
Not Livestreamed
Not Recorded
DescriptionNRI103,NRI104,NRI106
Research and ACM SRC Posters
Research & ACM SRC Posters
TP
DescriptionField-programmable gate arrays (FPGAs) in reconfigurable systems face escalating security threats from malicious bitstreams capable of causing denial-of-service, data leakage, or covert operations. Traditional detection methods often require source code or netlists, limiting their applicability for real-time protection.
We present a supervised machine learning approach that directly analyzes FPGA bitstreams at the binary level, enabling rapid detection without design-level access. Using byte frequency analysis, truncated singular value decomposition (TSVD), and SMOTE balancing, we developed and evaluated multiple classifiers on a dataset of 122 benign and malicious configurations for the Xilinx PYNQ-Z1 board. Random Forest achieved a macro F1-score of 0.97, validating the method’s effectiveness for resource-constrained devices.
The final model was deployed on PYNQ for integrated, on-device analysis. During the poster session, we will outline our detection pipeline, dataset preparation process, and performance results, emphasizing the novelty of binary-level analysis and its implications for real-time Trojan detection in embedded systems.
We present a supervised machine learning approach that directly analyzes FPGA bitstreams at the binary level, enabling rapid detection without design-level access. Using byte frequency analysis, truncated singular value decomposition (TSVD), and SMOTE balancing, we developed and evaluated multiple classifiers on a dataset of 122 benign and malicious configurations for the Xilinx PYNQ-Z1 board. Random Forest achieved a macro F1-score of 0.97, validating the method’s effectiveness for resource-constrained devices.
The final model was deployed on PYNQ for integrated, on-device analysis. During the poster session, we will outline our detection pipeline, dataset preparation process, and performance results, emphasizing the novelty of binary-level analysis and its implications for real-time Trojan detection in embedded systems.
Birds of a Feather
State of the Practice
Livestreamed
Recorded
TP
XO/EX
DescriptionEmerging scientific needs are driving a new class of workflows that require near real-time processing, urgent computing, and time-sensitive decision-making. These workflows must bypass traditional buffering and shared file systems, instead relying on direct data streaming over WAN into HPC compute nodes. Latency and variability must be minimized to enable timely responses and experiment steering. This BoF will explore strategies, technologies, and policies to support these streaming workflows at scale, with a focus on building shared infrastructure and community practices to routinely enable this next generation of high-impact, time-critical scientific computing.
Workshop
Livestreamed
Recorded
TP
W
DescriptionImproving time-to-solution in molecular dynamics simulations often requires strong scaling due to fixed-sized problems.
GROMACS is highly latency-sensitive, with peak iteration rates in the sub-millisecond, making scalability on heterogeneous supercomputers challenging.
MPI's CPU-centric nature introduces additional latencies on GPU-resident applications' critical path, hindering GPU utilization and scalability.
To address these limitations, we present an NVSHMEM-based GPU kernel-initiated
redesign of the GROMACS domain decomposition halo-exchange algorithm.
Highly tuned GPU kernels fuse data packing and communication, leveraging hardware latency-hiding for fine-grained overlap.
We employ kernel fusion across overlapped data forwarding communication phases and utilize the asynchronous copy engine over NVLink to optimize latency and bandwidth. Our GPU-resident formulation greatly increases communication-computation overlap, improving GROMACS strong scaling performance across NVLink by up to 1.5x (intra-node) and 2x (multi-node), and up to 1.3x multi-node over NVLink+InfiniBand.
This demonstrates the profound benefits of GPU-initiated communication for strong-scaling a broad range of latency-sensitive applications.
GROMACS is highly latency-sensitive, with peak iteration rates in the sub-millisecond, making scalability on heterogeneous supercomputers challenging.
MPI's CPU-centric nature introduces additional latencies on GPU-resident applications' critical path, hindering GPU utilization and scalability.
To address these limitations, we present an NVSHMEM-based GPU kernel-initiated
redesign of the GROMACS domain decomposition halo-exchange algorithm.
Highly tuned GPU kernels fuse data packing and communication, leveraging hardware latency-hiding for fine-grained overlap.
We employ kernel fusion across overlapped data forwarding communication phases and utilize the asynchronous copy engine over NVLink to optimize latency and bandwidth. Our GPU-resident formulation greatly increases communication-computation overlap, improving GROMACS strong scaling performance across NVLink by up to 1.5x (intra-node) and 2x (multi-node), and up to 1.3x multi-node over NVLink+InfiniBand.
This demonstrates the profound benefits of GPU-initiated communication for strong-scaling a broad range of latency-sensitive applications.
Paper
Performance Measurement, Modeling, & Tools
System Software and Cloud Computing
Livestreamed
Recorded
TP
DescriptionCUDA is the de facto programming model for GPUs, widely used in the domains of HPC and AI. To obtain bare-metal performance, vendors and academics develop various profiling tools to guide optimization. However, most existing tools focus on hotspot analysis with limited capabilities in identifying actionable opportunities. To complement existing tools, we present RedSan, a novel profiling tool that leverages binary instrumentation to identify redundant instructions in fully optimized CUDA programs. Guided by RedSan, we are able to optimize programs such as PolybenchGPU, Rodinia, PASTA, DARKNET, and LULESH, yielding up to a 6.27× speedup and 3.00× reduction in memory instructions.
Workshop
Livestreamed
Recorded
TP
W
DescriptionDirective-based programming is a productive way to target parallel architectures like GPUs and CPUs. Popular solutions such as OpenMP and OpenACC are widely used because they are simple and broadly applicable to general-purpose codebases. However, they often fail to deliver consistently high and portable performance, particularly for reduction-intensive computations.
We introduce a new directive design based on Multi-Dimensional Homomorphisms (MDH). Unlike existing methods, this MDH-based directive focuses on data-parallel computations, such as tensor expressions, achieving superior and portable performance even for reduction-heavy workloads found in deep learning, data mining, and quantum chemistry. It also maintains or improves programmer productivity by using Python as the host language.
Experiments show that our approach outperforms current directive-based solutions across various workloads, including linear algebra, stencil computations, data mining, quantum chemistry, and deep learning.
We introduce a new directive design based on Multi-Dimensional Homomorphisms (MDH). Unlike existing methods, this MDH-based directive focuses on data-parallel computations, such as tensor expressions, achieving superior and portable performance even for reduction-heavy workloads found in deep learning, data mining, and quantum chemistry. It also maintains or improves programmer productivity by using Python as the host language.
Experiments show that our approach outperforms current directive-based solutions across various workloads, including linear algebra, stencil computations, data mining, quantum chemistry, and deep learning.
Workshop
Livestreamed
Recorded
TP
W
DescriptionPerformance instability caused by unpredictable system noise remains a persistent challenge in high-performance and parallel computing. This work presents a reproducible methodology to characterize this variability through noise injection, tested using workloads implemented in OpenMP and SYCL to compare their performance resilience under noisy conditions. We design a noise injector that captures real system traces and replays the deltas as controlled noises. Using this approach, we evaluate multiple mitigation efforts, that is, thread pinning, housekeeping core isolation, and simultaneous multithreading (SMT) toggling, under both default and noise-injected executions. Experiments with two benchmarks (N-body, Babelstream) and one mini-application (MiniFE) across two processor platforms show that while OpenMP consistently achieves higher raw performance, SYCL tends to exhibit greater resilience in noisy environments. Mitigation effectiveness varies with workload characteristics, system configuration, and noise intensity, with housekeeping core isolation offering the clearest benefits, particularly in high-noise scenarios.
Panel
AI, Machine Learning, & Deep Learning
HPC Software & Runtime Systems
SC Community Hot Topics
Livestreamed
Recorded
TP
DescriptionAI has drastically altered the area of software engineering. In this session, we will discuss its impact in the context of HPC and research software engineering: development and maintenance of the software used in scientific computing and research. Potential uses of AI include developing new code, developing tests, testing and debugging code, reducing technical debt, porting code, optimizing code, updating software when dependencies are updated, generating code documentation, and building workflows of existing components. Panelists who are both researchers and software developers/maintainers will discuss their experiences in AI and research software, and how they expect things to change in the short term and the long term.
Workshop
Livestreamed
Recorded
TP
W
DescriptionResearch software engineers (RSEs) are critical to the impact of HPC, data science, and the larger scientific community. They have existed for decades, though often not under that name. The past several years, however, have seen the development of the RSE concept, common job titles, and career paths; the creation of professional networks to connect RSEs; and the emergence of RSE groups in universities, national laboratories, and industry.
This workshop will bring together RSEs and allies involved in HPC, from all over the world, to grow the RSE community by establishing and strengthening professional networks of current RSEs and RSE leaders. We'll hear about successes and challenges that RSEs and RSE groups have experienced, and discuss ways to increase awareness of RSE opportunities and improve support for RSEs.
The workshop will be highly interactive, featuring breakout discussions and panels, as well as invited addresses and submitted talks.
This workshop will bring together RSEs and allies involved in HPC, from all over the world, to grow the RSE community by establishing and strengthening professional networks of current RSEs and RSE leaders. We'll hear about successes and challenges that RSEs and RSE groups have experienced, and discuss ways to increase awareness of RSE opportunities and improve support for RSEs.
The workshop will be highly interactive, featuring breakout discussions and panels, as well as invited addresses and submitted talks.
Workshop
Livestreamed
Recorded
TP
W
DescriptionTomographic reconstruction (TR) aims to reconstruct a 3D object from 2D projections. It is an important technique across domains such as medical imaging and materials science, where high-resolution volumetric data is essential for decision-making. With advanced facilities such as the upgraded APS enabling unprecedented data acquisition rates, TR pipelines struggle to handle large data volumes while maintaining low latency, fault tolerance, and scalability. Traditional, tightly coupled, batch-oriented workflows are increasingly inadequate in such high-performance contexts. In response, we propose RESILIO , a composable, high-performance TR framework built atop the Mochi ecosystem that uses persistent streaming and fully leverages HPC platforms. Our design enables scalable and elastic execution across heterogeneous environments. We contribute a reimagined TR architecture, its implementation using Mochi, and an empirical evaluation showing up to 3490× reduction in the per-event overhead compared to the original implementation, and up to 3268× improvement in throughput with performance-tuned configurations using Mofka.
Panel
AI, Machine Learning, & Deep Learning
Architectures
SC Community Hot Topics
Livestreamed
Recorded
TP
DescriptionAs AI and high performance computing (HPC) demands surge, traditional electronic architectures are hitting their limits. This panel explores the next wave of computing, spotlighting photonic computing—a revolutionary approach that uses light instead of electrons to achieve unprecedented energy efficiency, bandwidth, and computational density. We will position photonic computing within a broader landscape of emerging technologies, including heterogeneous ecosystems that blend digital chips (CPUs, GPUs, TPUs) and analogue or quantum systems. Panelists will examine the challenges of power consumption, data movement, and integration with current infrastructures, offering insights into how next-generation architectures can meet AI and HPC demands while advancing environmental sustainability. The discussion will highlight key steps for moving photonic computing from lab-scale innovation to industrial-scale deployment, and address the status, challenges, and opportunities for integrating these technologies into today’s computing infrastructure and industry standards.
Paper
Algorithms
Livestreamed
Recorded
TP
DescriptionThe two-stage eigenvalue decomposition (EVD) method outperforms conventional one-stage methods on GPUs and heterogeneous architectures, especially when eigenvectors are not required. However, its performance advantage diminishes when performing back transformation to obtain eigenvectors. To address this, we propose two key solutions: 1) replacing BLAS3 operations with BLAS2 operations during the bulge-chasing back transformation for better performance, and 2) reordering the back transformation workflow from a backward pattern to a new parallelism-driven pattern to hide divide-and-conquer latency, at the cost of one additional GEMM computation. Experimentally, the proposed back transformation algorithm demonstrates significant performance improvements, outperforming the SOTA implementation in MAGMA by an average factor of 3.58x. For complete FP64 precision symmetric EVD with eigenvectors, the proposed algorithm, incorporating both solutions, surpasses the SOTA implementations in MAGMA and cuSOLVER by average factors of 2.62x and 2.21x, respectively.
Workshop
Livestreamed
Recorded
TP
W
DescriptionConventional von-Neumann architectures cannot deliver the efficient performance required by today's demanding HPC and AI workloads. Dataflow and reconfigurable devices are the promising way of the future, if they could only be more developer friendly, attractive and seamless in the software ecosystem. Our invited speaker, Ilan Tayari will tackle the hurdles on this front, and point out the novel approaches in the field, including NextSilicon's recently launched Intelligent Compute Architecture. These novel approaches will revolutionize the way application developers interact with reconfigurable heterogeneous devices.
Exhibitor Forum
Networking
Livestreamed
Recorded
TP
XO/EX
DescriptionOver the past years, the state of HPC has evolved at an unprecedented rate. As workloads grow in complexity and size, the network infrastructure must keep pace. This session explores Ethernet’s ascent in HPC, diving into its technical evolution, cost-effectiveness, and expansive ecosystem. Drawing comparisons to other interconnects, we’ll examine how Ethernet has matured to meet the demands of modern HPC workloads. Whether you're running an existing interconnect or planning your next deployment, this session will provide valuable insight into the advantages of Ethernet and its growing role in a rapidly changing HPC landscape.
Paper
HPC for Machine Learning
Livestreamed
Recorded
TP
DescriptionThe attention mechanism has become foundational to remarkable AI breakthroughs since the introduction of the Transformer, driving the demand for increasingly longer context to power AI frontier models. However, its quadratic computational and memory complexities pose a major challenge. Here, we propose RingX, a set of scalable parallel attention methods, optimized for HPC. Through better workload partitioning, communication scheme, and load balancing, we achieve up to 3.4X speedup compared to the current state-of-the-art on the Frontier supercomputer. RingX is specifically optimized for both bi-directional and casual attention, and its performance and validity are demonstrated by training of both a Vision Transformer (ViT) and a Generative Pre-trained Transformer (GPT), respectively. An end-to-end speedup of about 1.5X is obtained in both applications. To our knowledge, the achieved 38% model FLOPs utilization (MFU) for training Llama3-8B on a 1M-token sequence length using 4,096 GPUs is among the best training efficiencies on HPC systems.
Workshop
Livestreamed
Recorded
TP
W
DescriptionThe RISC-V Vector Extension (RVV) introduces scalable, vector-length–agnostic operations with strong potential for high-performance computing (HPC). This paper presents a TSVC-based instruction coverage analysis of RVV to evaluate current compiler auto-vectorization support. We compile TSVC with GCC and Clang under both vector-length agnostic (VLA) and vector-length specific (VLS) modes and analyze the emitted instructions against the RVV specification. Our results quantify instruction usage across key groups, identify missed instructions, and classify the causes of failed vectorization, including compiler backend limitations, absent use cases in TSVC, and non-trivial or unsupported patterns. We also highlight TSVC’s limitations, including ambiguous kernel vectorizability and missing representations of modern HPC-relevant patterns. Finally, we suggest directions for enhancing benchmark suites to better reflect RVV capabilities and guide compiler development for HPC workloads.
Workshop
Data Analytics
High Performance I/O, Storage, Archive, & File Systems
Storage
Livestreamed
Recorded
TP
W
DescriptionReinforcement learning (RL) has achieved notable success in complex decision-making tasks. Motivated by these advances, systems researchers have explored RL for optimizing system behavior. However, practical deployment remains uncommon, as existing RL frameworks are ill-suited for system-oriented use cases. To address this gap, we present \textbf{RL4Sys}, a lightweight RL framework designed specifically for seamless system-level integration. RL4Sys includes a minimal client that embeds easily within target systems to record trajectories and run inference from locally cached deep policies. RL4Sys's remote RL trainers executed asynchronously and distributed across servers leverage zero-copy gRPC and adaptive batching to update policies without blocking the original system. Our evaluation shows that RL4Sys matches the convergence behavior of conventional RL frameworks and achieves up to 220% higher throughput in environment-oriented settings compared to state-of-the-art systems such as RLlib, while incurring less than 6% runtime overhead relative to the original non-RL system.
Workshop
Performance Evaluation, Scalability, & Portability
Livestreamed
Recorded
TP
W
DescriptionThe introduction of tightly-coupled heterogeneous architectures, such as AMD's MI300A and NVIDIA's Grace-Hopper(GH200), address a bottleneck in accelerated computing, namely the CPU-GPU interface.
Whereas the GH200 can be seen as a technological leap in CPU-GPU connectivity greatly exceeding PCIe cadence, the unified memory architecture of the MI300A APU enables seamless communication through coherent caches.
When the CPU and GPU execute concurrently, they contend not only for finite bandwidth, but also contend power in a power-constrained environment.
In this paper, we extend the well-established Roofline model to capture the performance implications of contention in concurrent execution on the MI300A and GH200.
We enhance this by noting the impact of different memory allocators, the randomness of data, and the host and device arithmetic intensity.
We conclude with a discussion on the evolution of GPU architectures and the impact in performance, portability, and programmability that emerging tightly-coupled GPUs bring to the HPC landscape.
Whereas the GH200 can be seen as a technological leap in CPU-GPU connectivity greatly exceeding PCIe cadence, the unified memory architecture of the MI300A APU enables seamless communication through coherent caches.
When the CPU and GPU execute concurrently, they contend not only for finite bandwidth, but also contend power in a power-constrained environment.
In this paper, we extend the well-established Roofline model to capture the performance implications of contention in concurrent execution on the MI300A and GH200.
We enhance this by noting the impact of different memory allocators, the randomness of data, and the host and device arithmetic intensity.
We conclude with a discussion on the evolution of GPU architectures and the impact in performance, portability, and programmability that emerging tightly-coupled GPUs bring to the HPC landscape.
Workshop
Livestreamed
Recorded
TP
W
DescriptionScientific computing increasingly uses surrogate models to accelerate high-fidelity simulations, enable real-time predictions, and explore large design spaces. Building surrogates at scale is challenging: simulations are costly, data generation must be managed, and surrogate learning involves large, heterogeneous, evolving workflows. In active learning, where models guide data acquisition, these challenges intensify due to tight coupling between simulation, inference, and training. We present ROSE (RADICAL Orchestrator for Surrogate Exploration), a flexible, portable, and scalable framework supporting the full surrogate modeling lifecycle in HPC environments. ROSE integrates active learning with scalable orchestration, managing asynchronous execution across diverse resources while minimizing user effort. It supports in-situ/ex-situ workflows, online/offline training, and adaptive sampling. Applied to three use cases—electrolyte structure extraction, neutron diffraction structure recovery, and colloid phase classification—ROSE sustains high throughput with low overhead on Polaris, Perlmutter, and Delta, achieving 4–8× end-to-end speedups, with asynchronous orchestration delivering 1.5–3× gains over synchronous baselines.
Workshop
Livestreamed
Recorded
TP
W
Description- Yoshii: RSEs and the Future of HPC Architecture through Open, Scalable Chip Design
- Bello: Empowering AI/ML Innovation through Research Software Engineering
- Bello: Empowering AI/ML Innovation through Research Software Engineering
Workshop
Livestreamed
Recorded
TP
W
Description- Eiffert: Beginner’s Guide to Starting a Research Community (By a Beginner)
- Watson: To Join or Not to Join - RSE Feedback on the Perceived Value of Joining an Open Source Software Foundation
- Watson: To Join or Not to Join - RSE Feedback on the Perceived Value of Joining an Open Source Software Foundation
Workshop
Livestreamed
Recorded
TP
W
DescriptionEffective power management is crucial for balancing high performance and environmental impact in the exascale era, particularly for datacenters dominated by massively parallel GPU systems due to the rise of AI. While many strategies rely on deep application knowledge, there is a growing need for application-agnostic approaches. We introduce a node-level power management runtime designed for regular applications, featuring minimal overhead and seamless deployment across any HPC/AI system. Our approach detects, at runtime, repetitive execution patterns via spectral analysis and then traces per-pattern energy consumption. A simple gradient-descent optimizer gradually adjusts the GPU frequency until the least per-pattern energy (i.e., maximum energy efficiency) is found. With this approach, we demonstrate up to a 15% reduction in energy consumption for equivalent computational tasks, with no overhead and minimal impact on execution time. This solution has been validated across a diverse range of AI applications, and we discuss the resulting energy savings.
Workshop
Debugging & Correctness Tools
HPC Software & Runtime Systems
Livestreamed
Recorded
TP
W
SCinet
Not Livestreamed
Not Recorded
SCinet
Not Livestreamed
Not Recorded
SC26
Livestreamed
Recorded
TP
W
TUT
XO/EX
DescriptionKevin Hayden, SC26 General Chair, is a Senior Network Engineer at the U.S. Department of Energy’s Argonne National Laboratory. For over 30 years, he has been at the center of designing and operating the high-performance research networks that drive modern science. His work has supported the data-intensive collaborations behind some of the world’s most ambitious scientific efforts.
A member of the SC community for two decades, Hayden has contributed to nearly every part of the conference—from technical operations to leadership—serving as SCinet Chair in 2020. He believes the heart of SC lies in the connections between people, technologies, and ideas. As General Chair, he aims to create a space where collaboration thrives, and every participant can help shape the future of high-performance computing, networking, and data science.
A member of the SC community for two decades, Hayden has contributed to nearly every part of the conference—from technical operations to leadership—serving as SCinet Chair in 2020. He believes the heart of SC lies in the connections between people, technologies, and ideas. As General Chair, he aims to create a space where collaboration thrives, and every participant can help shape the future of high-performance computing, networking, and data science.
Workshop
Livestreamed
Recorded
TP
W
DescriptionData Races are bugs whose occurrence in shared memory parallel and concurrent systems can cause unexpected outcomes and undermine the integrity of computational results. CPU/GPU systems that use Unified Heterogeneous Memory—for simplified memory sharing between GPUs and CPUs—are highly prone to data races. Scabbard's design is based on instrumenting the CPU/GPU codes by using LLVM's pass-plugin system to instrument the user's code to record writes, reads and synchronizations into trace files during execution, and subsequently performing offline analysis to report races. The main contribution of this paper is in detailing our algorithms, implementation challenges and the learning process faced by newcomers to the LLVM and ROCm/Hip ecosystems to better the learning experience for new LLVM tool-builders in the CPU/GPU space.
Research and ACM SRC Posters
Research & ACM SRC Posters
TP
DescriptionWe present ACE (Asynchronous Communication and Execution), a C++17 library for scalable asynchronous task execution on high performance computing (HPC) systems. Integrated into a distributed traffic simulation workflow, ACE accelerates the computation of alternative routes, a key performance bottleneck in large-scale simulations. Unlike the previous Rust-based Evkit approach, ACE eliminates the multi-minute worker-spawning overhead and manages task granularity dynamically. Using scenarios for Prague and the Central Bohemia region, with datasets of up to 25 million routes, ACE achieved up to a 15x speed-up on city-scale workloads with shorter routes and a 1.45x improvement on larger regional workloads. These results highlight ACE’s ability to adapt to workload characteristics and improve both efficiency and scalability in HPC-based route computation.
Research and ACM SRC Posters
Research & ACM SRC Posters
TP
DescriptionRCOMPSs is a scalable execution framework that integrates the R programming language with the COMPSs runtime to enable task-based parallel execution on manycore and distributed systems. RCOMPSs extends conventional R workflows by allowing functions to be annotated as tasks, which the runtime system analyzes to construct a task dependency graph (DAG). This graph guides dynamic scheduling, dependency resolution, and data transfers, thereby abstracting parallel execution from the user while preserving correctness. A straightforward example of dataset standardization illustrates the minimal programming effort needed to leverage parallelism. In contrast, more complex applications like K-means clustering demonstrate the framework's capability to represent iterative statistical algorithms in a task-oriented manner. Performance evaluation on Shaheen-III and MareNostrum~5 shows strong scalability up to 32 nodes with near-linear speedup, efficient weak scalability with increasing problem sizes, and effective utilization of up to 128 and 80 threads per node, respectively.
Workshop
Livestreamed
Recorded
TP
W
DescriptionHydroDynamics (HD) and MagnetoHydroDynamics (MHD) simulations play a central role in modeling physical processes in fields as diverse as astrophysics, nuclear fusion, and plasma physics. These simulations often involve the resolution of hyperbolic systems of partial differential equations using finite-volume methods, yielding stencil-like computation patterns over structured grids.
While CPUs and GPUs have long dominated HPC workloads, their efficiency is increasingly challenged by the need for better control over memory access patterns and energy consumption. In contrast, Field-Programmable Gate Arrays (FPGAs) offer reconfigurable hardware with customizable memory hierarchies and dataflow pipelines, making them especially appealing for streaming and memory-bound applications. Their potential to reduce the energy footprint of scientific simulations has attracted growing attention, especially in the context of exascale computing, where power constraints are becoming a limiting factor.
While CPUs and GPUs have long dominated HPC workloads, their efficiency is increasingly challenged by the need for better control over memory access patterns and energy consumption. In contrast, Field-Programmable Gate Arrays (FPGAs) offer reconfigurable hardware with customizable memory hierarchies and dataflow pipelines, making them especially appealing for streaming and memory-bound applications. Their potential to reduce the energy footprint of scientific simulations has attracted growing attention, especially in the context of exascale computing, where power constraints are becoming a limiting factor.
Research and ACM SRC Posters
Research & ACM SRC Posters
TP
DescriptionExascale computing, powered by GPUs, is reshaping high-performance computing. Declarative languages such as Datalog naturally benefit from this shift, as recursive rules can be compiled into GPU-optimized relational operations. Unlike SQL, Datalog executes queries iteratively until a fixed point is reached, making it ideal for graph mining, deductive database, and symbolic AI. Existing engines (SLOG, LogicBlox, and Soufflé) target multi-core architectures and lack support for distributed multi-GPU systems. We address this gap with MNMGDatalog, the first multi-node, multi-GPU Datalog engine, which combines CUDA for intra-node parallelism with MPI for inter-node communication. Our design introduces GPU-parallel joins, scalable recursive aggregation, and iterative all-to-all communication strategies. To assess performance and efficiency, we developed Powerlog, the first GPU-based Datalog engine energy profiler. Experiments on Argonne’s Polaris supercomputer show up to 32× speedups over state-of-the-art distributed engines and reveal tradeoffs between scaling and energy use, establishing a foundation for energy-aware declarative analytics at scale.
Workshop
Livestreamed
Recorded
TP
W
DescriptionTraining large neural networks is computationally demanding and often limited by synchronization overhead in distributed environments. Traditional data-parallel frameworks, such as Horovod or PyTorch DDP, average gradients at every batch, which can limit scalability due to communication bottlenecks.
In this work, we propose two novel data-parallel strategies that reduce synchronization by averaging weights and biases only at the end of each epoch. These methods are implemented using the PyCOMPSs task-based programming model and integrated into dislib, enabled by a new distributed tensor abstraction (ds-tensor) that supports multidimensional data structures suitable for deep learning workloads.
We evaluate our approach on classification and regression tasks using real-world datasets and federated learning scenarios. Results show up to 95% training time reduction and strong scalability up to 64 workers, while maintaining or improving model accuracy. Our strategies enable asynchronous, communication-efficient training and are well-suited for heterogeneous and large-scale HPC systems.
In this work, we propose two novel data-parallel strategies that reduce synchronization by averaging weights and biases only at the end of each epoch. These methods are implemented using the PyCOMPSs task-based programming model and integrated into dislib, enabled by a new distributed tensor abstraction (ds-tensor) that supports multidimensional data structures suitable for deep learning workloads.
We evaluate our approach on classification and regression tasks using real-world datasets and federated learning scenarios. Results show up to 95% training time reduction and strong scalability up to 64 workers, while maintaining or improving model accuracy. Our strategies enable asynchronous, communication-efficient training and are well-suited for heterogeneous and large-scale HPC systems.
Workshop
Livestreamed
Recorded
TP
W
DescriptionAs large-scale deep learning models become integral to scientific discovery and engineering applications, it is increasingly important to teach students how to implement them efficiently and at scale. This paper presents a coding assignment that focuses on optimizing the Softmax function, a central component of many deep learning models, including attention mechanisms in transformer models. The assignment is designed for an undergraduate level Distributed Computing course (CPE 469, 10-week quarter system), and tailored to students with little or no prior experience in machine learning.
This assignment is one of seven designed to reinforce the foundational concepts of parallel programming. It was developed as part of an inquiry-based learning approach \cite{ibl1}, \cite{ibl2}, encouraging students to actively investigate, experiment, and discover solutions to real-world challenges. The assignment introduces essential deep learning concepts, then guides students through identifying independent tasks within the Softmax computation so they can implement parallel solutions using OpenMP and CUDA.
By integrating modern AI workloads into an HPC curriculum, this work equips students with both the conceptual understanding and practical experience needed to build scalable solutions in scientific computing.
This assignment is one of seven designed to reinforce the foundational concepts of parallel programming. It was developed as part of an inquiry-based learning approach \cite{ibl1}, \cite{ibl2}, encouraging students to actively investigate, experiment, and discover solutions to real-world challenges. The assignment introduces essential deep learning concepts, then guides students through identifying independent tasks within the Softmax computation so they can implement parallel solutions using OpenMP and CUDA.
By integrating modern AI workloads into an HPC curriculum, this work equips students with both the conceptual understanding and practical experience needed to build scalable solutions in scientific computing.
Workshop
Livestreamed
Recorded
TP
W
DescriptionUnderstanding the irregular, dynamic communication patterns in HPC applications at scale is critical when evaluating potential software optimizations and hardware architectures. Current systems monitor communication behavior for entire applications as exhaustive traces or general-purpose aggregated statistics. Generally, these approaches often do not scale well and the data gathered is often too generic or inflexible to make specific hardware/software optimizations. This paper describes a new, configurable, histogram-based approach to gathering scalable, high-fidelity monitoring information about HPC communication that we implemented in the Vernier communication monitoring system. This approach enables targeted collection of statistical data about annotated communication patterns for online or offline analysis, benchmarking, or network simulations. We assess these capabilities by collecting communication patterns from several production HPC applications at scale, showing that the resulting statistical representations accurately characterize the communication patterns in these applications, and can be used to provide new insights into communication patterns of complex HPC applications.
Workshop
Livestreamed
Recorded
TP
W
DescriptionPerformant all-to-all collective operations in MPI are critical to fast Fourier transforms, transposition, and machine learning applications. There are many existing implementations for all-to-all exchanges on emerging systems, with the achieved performance dependent on many factors, including message size, process count, architecture, and parallel system partition. This paper presents novel all-to-all algorithms for emerging many-core systems. Further, the paper presents a performance analysis against existing algorithms and system MPI, with novel algorithms achieving up to 3x speedup over system MPI at 32 nodes of state-of-the-art Sapphire Rapids systems.
Workshop
Livestreamed
Recorded
TP
W
DescriptionHybrid quantum–high performance computing (Q-HPC) workflows are emerging as a key strategy for running quantum applications at scale on noisy intermediate-scale quantum (NISQ) devices. These workflows must operate seamlessly across diverse simulators and hardware backends since no single simulator offers the best performance for every circuit type. Efficiency depends strongly on circuit structure, entanglement, and depth, making backend-agnostic execution essential for fair benchmarking, platform selection, and the identification of quantum advantage opportunities. We extend the Quantum Framework (QFw), a modular HPC-aware orchestration layer, to integrate local simulators (Qiskit Aer, NWQ-Sim, QTensor, TN-QVM) and a cloud backend (IonQ) under a unified interface. Benchmarking variational and non-variational workloads reveal workload-specific strengths: Qiskit Aer’s matrix product state excels for large Ising models, NWQ-Sim leads on entanglement and Hamiltonian simulation, and distributed NWQ-Sim accelerates optimization tasks. These findings demonstrate that simulator-agnostic, HPC-aware orchestration enables scalable, reproducible Q-HPC ecosystems, advancing quantum advantage.
Exhibitor Forum
Data Analytics
Livestreamed
Recorded
TP
XO/EX
DescriptionAs AI adoption accelerates, data centers face unprecedented power and thermal challenges, with AI-driven workloads often consuming 10x the power of traditional IT. Meeting these demands requires a complete rethinking of infrastructure, where liquid cooling emerges as a critical enabler on the AI development roadmap. This session will examine the AI Heat Density Roadmap to 2030 and the products and architectures required for heat dissipation, focusing on direct liquid cooling, responsiveness across the entire thermal chain, and the connectivity and telemetry needed to protect this powerful but sensitive IT.
We will explore practical, modular approaches for both retrofits and greenfield builds, including liquid-to-refrigerant coolant distribution units (L2R CDUs) that enable phased deployment and hybrid integration with air-cooling systems. Attendees will gain actionable insights into supporting high-density racks, managing escalating heat loads, and implementing real-time management platforms to create resilient, efficient AI factories that can scale alongside rapidly evolving compute demands.
We will explore practical, modular approaches for both retrofits and greenfield builds, including liquid-to-refrigerant coolant distribution units (L2R CDUs) that enable phased deployment and hybrid integration with air-cooling systems. Attendees will gain actionable insights into supporting high-density racks, managing escalating heat loads, and implementing real-time management platforms to create resilient, efficient AI factories that can scale alongside rapidly evolving compute demands.
Workshop
Livestreamed
Recorded
TP
W
DescriptionWe present a comprehensive benchmarking study that evaluates the scaling performance of RDMA over Converged Ethernet (RoCE) and compares it with Infiniband in the context of large-scale LLM training workloads. While Infiniband is traditionally favored for its low-latency, high-bandwidth characteristics, it imposes significant infrastructure and operational costs. RoCE, leveraging commodity Ethernet and RDMA, offers a cost-effective alternative. Through extensive experiments on production clusters, we demonstrate that RoCE can achieve near-linear scaling performance comparable to Infiniband when properly configured. Our analysis spans data sharding strategies, quantization and activation recomputation techniques, batch size tuning, and system-level optimizations, providing practical guidance for designing scalable and efficient AI infrastructure.
Paper
Architectures & Networks
Livestreamed
Recorded
TP
DescriptionAs AI models exceed single-processor limits, cross-chip interconnects are essential for scalable computing. These links transfer cache-line–sized data at high rates, driving adoption of protocols like CXL, NVLink, and UALink for high bandwidth and small payloads. Faster transfers, however, increase error risk. Standard methods such as CRC and FEC ensure link reliability, but scaling to multi-node systems introduces new challenges, including detecting silently dropped flits in switches.
We present Implicit Sequence Number (ISN), which enables precise flit drop detection and in-order delivery without header overhead. We also propose Reliability Extended Link (RXL), a CXL extension integrating ISN to support scalable, reliable multi-node interconnects while preserving flit format. RXL elevates CRC to a transport-layer role for end-to-end data and sequence integrity, while FEC handles link-layer correction. This approach delivers robust reliability and scalability without reducing bandwidth efficiency.
We present Implicit Sequence Number (ISN), which enables precise flit drop detection and in-order delivery without header overhead. We also propose Reliability Extended Link (RXL), a CXL extension integrating ISN to support scalable, reliable multi-node interconnects while preserving flit format. RXL elevates CRC to a transport-layer role for end-to-end data and sequence integrity, while FEC handles link-layer correction. This approach delivers robust reliability and scalability without reducing bandwidth efficiency.
Best Poster Presentations (Research, ACM SRC Grad/Undergrad)
Research & ACM SRC Posters
TP
DescriptionWe present a unified, out-of-core, GPU-accelerated singular value solver that achieves performance portability across diverse hardware platforms and data precisions for datasets exceeding GPU memory. The singular value decomposition (SVD) is fundamental for processing large-scale datasets, yet the diversity of computing architectures and the proliferation of precision formats pose significant challenges in heterogeneous environments. Traditional HPC libraries require separate implementations for each architecture and precision, limiting scalability and usability. Building on our previous work, where we developed an open-source unified solver achieving performance comparable to vendor-optimized libraries across multiple precisions and GPU platforms, we extend this capability to handle larger-than-memory datasets. We adapt a QR-based communication-hiding strategy to improve the compute-to-communication ratio and leverage Julia's multiple-dispatch for seamless backend integration. Our implementation significantly outperforms CPU-based LAPACK and remains only 3–5× slower than GPU-resident solvers across different hardware and data precision configurations.
Paper
Performance Measurement, Modeling, & Tools
Livestreamed
Recorded
TP
DescriptionMixed-precision algorithms have been proposed as a way for scientific computing to benefit from some of the gains seen for AI on recent high performance computing (HPC) platforms. A few applications dominated by dense matrix operations have seen substantial speedups by utilizing low-precision formats such as FP16. However, a majority of scientific simulation applications are memory bandwidth limited. Beyond preliminary studies, the practical gain from using mixed-precision algorithms on a given HPC system is largely unclear.
The High Performance GMRES Mixed Precision (HPG-MxP) benchmark has been proposed to measure the useful performance of an HPC system on sparse matrix-based mixed-precision applications. In this work, we present a highly optimized implementation of the HPG-MxP benchmark for an exascale system and describe our algorithm enhancements. We show for the first time a speedup of 1.6x using a combination of double and single precision on modern GPU-based supercomputers.
The High Performance GMRES Mixed Precision (HPG-MxP) benchmark has been proposed to measure the useful performance of an HPC system on sparse matrix-based mixed-precision applications. In this work, we present a highly optimized implementation of the HPG-MxP benchmark for an exascale system and describe our algorithm enhancements. We show for the first time a speedup of 1.6x using a combination of double and single precision on modern GPU-based supercomputers.
Birds of a Feather
Community Meetings
Livestreamed
Recorded
TP
XO/EX
DescriptionSoftware has become central to all aspects of modern science and technology. Especially in high performance computing (HPC) and computational science and engineering (CSE), it is becoming ever-larger and more complex while computer platforms evolve and become more diverse. Simultaneously, the teams behind the software are becoming larger, more technically diverse, and more geographically distributed.
This BoF provides an opportunity for people concerned about these topics to share existing experiences and activities, discuss how we can improve on them, and share the results. Presentations and discussion notes will be made available at the BoF series website (http://bit.ly/swe-cse-bof).
This BoF provides an opportunity for people concerned about these topics to share existing experiences and activities, discuss how we can improve on them, and share the results. Presentations and discussion notes will be made available at the BoF series website (http://bit.ly/swe-cse-bof).
SCinet
Not Livestreamed
Not Recorded
SCinet
Not Livestreamed
Not Recorded
DescriptionAn overview of the SC25 SCinet WAN Team's operation.
Research and ACM SRC Posters
Research & ACM SRC Posters
TP
DescriptionAs high performance computing (HPC) systems scale toward the exascale era, operational data analytics (ODA) play an increasingly central role in managing system security, health, scheduling, and scientific productivity. Supercomputing facilities continuously generate massive volumes of logs and system metrics. To make actionable insights, distributed database management systems (DBMSs) are often employed, but their behavior under realistic production HPC workloads remains underexplored. This poster presents ScODA (Supercomputing Operational Data Analytics), an emerging benchmarking pipeline designed to evaluate distributed DBMS solutions—including relational, document, time-series databases and lakehouse solutions—using real and synthetic HPC environment logs. By working alongside our business intelligence colleagues to systematically model and implement common ODA workflows, ScODA enables data-driven comparisons of competing DBMS platforms and identifies trade-offs in ingestion, querying, and concurrent access. We present our methodology, preliminary benchmarks, and lessons learned from applying ScODA to multiple DBMS platforms at Argonne National Laboratory.
Paper
Architectures & Networks
Livestreamed
Recorded
TP
DescriptionRDMA is vital for efficient distributed training across datacenters, but millisecond-scale latencies complicate the design of its reliability layer. We show that depending on long-haul link characteristics, such as drop rate, distance, and bandwidth, the widely used Selective Repeat algorithm can be inefficient, warranting alternatives like erasure coding. To enable such alternatives on existing hardware, we propose SDR-RDMA, a software-defined reliability stack for RDMA. Its core is a lightweight SDR SDK that extends standard point-to-point RDMA semantics — fundamental to AI networking stacks — with a receive buffer bitmap. SDR bitmap enables partial message completion to let applications implement custom reliability schemes tailored to specific deployments, while preserving zero-copy RDMA benefits. By offloading the SDR backend to NVIDIA’s Data Path Accelerator (DPA), we achieve line-rate performance, enabling efficient inter-datacenter communication and advancing reliability innovation for intra-datacenter training.
Workshop
Livestreamed
Recorded
TP
W
DescriptionHigh-performance computing environments face increasing challenges from diverse scientific workflows, imposing conflicting demands for stability, customization, and reproducibility that traditional monolithic software stacks cannot accommodate.
We present a comprehensive approach to seamless end-to-end containerized HPC environments which decomposes the technical challenge into five manageable areas:
specification and construction of environments, session provisioning, scheduler integration, system integration, and security.
We develop and evaluate prototypes across these five technical areas, demonstrating practical feasibility through Spack-based environment construction with CI/CD pipelines, transparent session access via PAM and Kubernetes, and flexible job execution using Slurm's native container support.
% Our system integration framework supports multiple MPI strategies from host library injection to fully containerized stacks, accommodating diverse performance and portability requirements.
Through this work, we demonstrate that comprehensive containerization of HPC environments can be achieved using open standards, providing enhanced reproducibility and flexibility without sacrificing user experience.
We present a comprehensive approach to seamless end-to-end containerized HPC environments which decomposes the technical challenge into five manageable areas:
specification and construction of environments, session provisioning, scheduler integration, system integration, and security.
We develop and evaluate prototypes across these five technical areas, demonstrating practical feasibility through Spack-based environment construction with CI/CD pipelines, transparent session access via PAM and Kubernetes, and flexible job execution using Slurm's native container support.
% Our system integration framework supports multiple MPI strategies from host library injection to fully containerized stacks, accommodating diverse performance and portability requirements.
Through this work, we demonstrate that comprehensive containerization of HPC environments can be achieved using open standards, providing enhanced reproducibility and flexibility without sacrificing user experience.
Research and ACM SRC Posters
Research & ACM SRC Posters
TP
DescriptionWe present a comparative study of the productivity and performance of four programming languages: Python, Julia, C++, and DaphneDSL, for the Connected Components graph algorithm from the GAP benchmark suite. Using various code productivity metrics, we evaluated the effort of scaling applications from a local parallel version to a distributed implementation. Experiments carried out on the Vega EuroHPC system reveal that, with moderate coding effort, Julia offers the best performance, while DaphneDSL enables seamless distributed execution with no code changes, albeit at a small performance cost.
Tutorial
Livestreamed
Recorded
TUT
DescriptionOur goal is to increase the number of people in the workforce who can act as defenders of our high performance computing and data infrastructure. In this tutorial we cover weaknesses from the most recent "Stubborn Weaknesses in the CWE Top 25" list from MITRE. These weaknesses (coding flaws) are the ones most present in real-world security exploits and also the ones that have consistently stayed in the top 25 for at least five years. Attendees will learn how to recognize these weaknesses and code in a way that avoids them. Another issue affecting the security of our cyberinfrastructure is the fact that its software depends upon a myriad of packages and libraries, and those come from different sources. Dependency analysis tools—tools that find weaknesses in the software supply chain and develop a software bill of materials (SBOM)—can catch flaws in those packages and libraries, and that affects the safety of the application. The more programmers are exposed to training in addressing security issues, and the more they learn how to use dependency analysis tools, the bigger the impact that we can make on the security of our cyberinfrastructure.
Workshop
Data Analytics
High Performance I/O, Storage, Archive, & File Systems
Storage
Livestreamed
Recorded
TP
W
DescriptionAs data volumes grow, the cost of moving large datasets increasingly limits scientific visualization performance. One promising solution is to analyze data where it is stored. This paper presents a pushdown architecture for pNFS-based storage systems that offloads early stages of a VTK pipeline—such as reading and filtering—to the pNFS data servers that hold the data. Using a FUSE-based interface, a pNFS client triggers remote processing and retrieves results by writing and reading special command and result files. Our design leverages pNFS clients' ability to locate file-resident servers, along with a recent Linux enhancement that enables efficient local access to pNFS data without exposing filesystem internals. Offloaded code runs with the user's credentials, preserving standard permission checks. Experiments with two real-world scientific datasets show up to 6.1× speedup in end-to-end visualization runtime and up to 7.1× in data loading, thanks to early data filtering that significantly reduces data movement.
Workshop
Livestreamed
Recorded
TP
W
DescriptionHDF5 is a popular data management and I/O library used by numerous scientific and industry applications. HDF5 in the recent years pluggable extensions to enhance HDF5’s functionality to improve performance and to utilize underlying hardware and file systems. Plugins play a crucial role in adding custom features, such as compression filters, virtual file drivers (VFDs), and virtual object layer (VOL) connectors, without requiring extensive changes to the source code or modifications to the main library. While the plugin capability gives power to extend HDF5, plugins could be misused for malicious routing of HDF5 calls. To improve the security of HDF5 systematically, in this study, we explore the option of digitally signing plugins. This would help ensure the authenticity and integrity of any plugins that users may use. We discuss a few implementation scenarios in HDF5 and assess the accuracy and overhead associated with the plugin verification process.
Workshop
Livestreamed
Recorded
TP
W
DescriptionDistributed Filesystems (DFS) are a crucial component of modern computing environments, and their performance is critical to the success of all the facilities that rely on them. However, predicting
the DFS I/O performance solely based on the storage system hardware is not trivial.
In this paper, we address this challenge by presenting an empirical method that tries to quantitatively assess how hardware configuration choices influence the performance of a DFS using
Ceph as a case study. We investigate the influence of three hardware parameters—number of CPU cores, amount of RAM, and disk bandwidth. To control these variables, we relied on the Linux hotplug interface and Cgroups, avoiding additional software overhead.
Our results reveal that for the analyzed workloads, decreasing hardware resources does not always yield proportional performance losses. This method offers practical insights for designing cost-effective distributed storage systems, remaining general enough to be applied to other filesystems.
the DFS I/O performance solely based on the storage system hardware is not trivial.
In this paper, we address this challenge by presenting an empirical method that tries to quantitatively assess how hardware configuration choices influence the performance of a DFS using
Ceph as a case study. We investigate the influence of three hardware parameters—number of CPU cores, amount of RAM, and disk bandwidth. To control these variables, we relied on the Linux hotplug interface and Cgroups, avoiding additional software overhead.
Our results reveal that for the analyzed workloads, decreasing hardware resources does not always yield proportional performance losses. This method offers practical insights for designing cost-effective distributed storage systems, remaining general enough to be applied to other filesystems.
Workshop
Livestreamed
Recorded
TP
W
DescriptionThe escalating demands of scientific collaborations necessitate advanced networking for deterministic, secure, and orchestrated services across multiple administrative domains. The Software-Defined Network for End-to-end Networked Science at the Exascale (SENSE) paradigm addresses these needs through intent-based networking and multi-domain orchestration. This paper evaluates SENSE's performance on a comprehensive multi-domain testbed, including GNA-G AutoGOLE, the National Research Platform (NRP), FABRIC, and production LHC CMS infrastructure. Our results demonstrate that intent-based service requests are successfully translated into network configurations, with average provisioning times of 183 seconds for simple services and 290 seconds for complex multi-domain workflows. Performance monitoring confirms that SENSE maintains guaranteed bandwidth allocations, enabling higher-priority data flows to complete significantly faster than in best-effort scenarios. This capability transforms the network into a first-class schedulable resource, optimizing scientific workflows by prioritizing data criticality and moving beyond best-effort limitations to achieve predictable and efficient data movement for data-intensive scientific endeavors.
Invited Talk
National Strategies
Livestreamed
Recorded
TP
DescriptionCreated in 2007, GENCI is the French HPC agency. As a public HPC infrastructure, it had an initial focus on serving the needs of academic open research. Throughout the years, the missions of this agency have evolved to include AI and quantum computing, and to spread to an ever-growing industrial open research community. It required moving towards a public-private infrastructure continuum, with sovereignty at heart, supported by projects like AI Factory France and CLUSSTER.
The value creation is only possible when users receive the right level of support, with an adequate orientation. That is why GENCI and its partners have set up high-level support teams in HPC and AI, and are currently exploring ways to federate and lead communities towards the use of HPC-QC environments through the HQI program and the Maisons du Quantique network. Convinced by the efficiency of these initiatives, an increasing number of industrial end-users are joining the movement and engaging in collaborations to leverage these computing capabilities to solve practical use cases. This presentation will showcase concrete examples of the results of GENCI's industrial engagement initiatives in HPC, AI, and quantum computing, demonstrating the essential role of public computing infrastructure in supporting technology transfer, sovereignty, and competitiveness.
The value creation is only possible when users receive the right level of support, with an adequate orientation. That is why GENCI and its partners have set up high-level support teams in HPC and AI, and are currently exploring ways to federate and lead communities towards the use of HPC-QC environments through the HQI program and the Maisons du Quantique network. Convinced by the efficiency of these initiatives, an increasing number of industrial end-users are joining the movement and engaging in collaborations to leverage these computing capabilities to solve practical use cases. This presentation will showcase concrete examples of the results of GENCI's industrial engagement initiatives in HPC, AI, and quantum computing, demonstrating the essential role of public computing infrastructure in supporting technology transfer, sovereignty, and competitiveness.
Workshop
Livestreamed
Recorded
TP
W
DescriptionIPDRM invites researchers to present and discuss innovative solutions for runtime systems and middleware, addressing challenges like efficient resource utilization, data movement, memory consistency, task scheduling, energy consumption, and performance portability. This year, we focus on the system software required to bridge post-Moore's Law architectures with classical computing. This workshop will contain both technical papers and invited talks from industry and other practitioners in the field. IPDRM's goal is to promote research, discussion, and collaboration around middleware's role in integrating classical and non-traditional computing systems.
Workshop
Education & Workforce Development
Livestreamed
Recorded
TP
W
DescriptionWorkforce training at national laboratories and computing centers is essential and typically falls into two categories: foundational training for newcomers and advanced training for experienced users. Foundational topics such as version control, build systems, and basic HPC usage are widely transferable, while center-specific training varies with compute platforms and workflows. Emerging technologies span both categories, ranging from broadly applicable programming paradigms to hardware-specific skills.
To reduce redundancy and increase impact, national labs, computing centers, and vendors can collaborate by sharing and co-developing materials, co-presenting, and offering joint training events. Such coordination enhances accessibility, scalability, and consistency in HPC training across the community.
One example is the HPC Training Working Group, a community of teaching enthusiasts that meets monthly to exchange best practices, share challenges and solutions, and develop training materials. Their collaborative initiatives highlight the value of cross-institutional efforts in strengthening HPC workforce development.
To reduce redundancy and increase impact, national labs, computing centers, and vendors can collaborate by sharing and co-developing materials, co-presenting, and offering joint training events. Such coordination enhances accessibility, scalability, and consistency in HPC training across the community.
One example is the HPC Training Working Group, a community of teaching enthusiasts that meets monthly to exchange best practices, share challenges and solutions, and develop training materials. Their collaborative initiatives highlight the value of cross-institutional efforts in strengthening HPC workforce development.
Research and ACM SRC Posters
Research & ACM SRC Posters
TP
DescriptionHigh performance computing systems require intricate platform-specific stacks and configurations, which poses a challenge for reproducing the same HPC ecosystem on a different platform, a key requirement for geo-redundancy, business continuity, and urgent computing. We present a method for declaratively defining portable HPC ecosystems that can be deployed rapidly and reliably with a high degree of automation. Our model enables infrastructure-layer portability, going beyond existing cloud-native solutions.
We introduce a two-tiered modular abstraction framework: provider-specific lower-level modules that handle the implementation details; and provider-agnostic high-level modules that define core infrastructure logic, designed around the versatile software-defined clusters (vClusters) developed at CSCS.
To evaluate our approach, we showcase a portable implementation of the Weather and Climate HPC vCluster that runs on the Alps ecosystem, and deploy it on Google Cloud Platform. Our work demonstrates the effectiveness of our declarative approach in migrating HPC systems across heterogeneous platforms.
We introduce a two-tiered modular abstraction framework: provider-specific lower-level modules that handle the implementation details; and provider-agnostic high-level modules that define core infrastructure logic, designed around the versatile software-defined clusters (vClusters) developed at CSCS.
To evaluate our approach, we showcase a portable implementation of the Weather and Climate HPC vCluster that runs on the Alps ecosystem, and deploy it on Google Cloud Platform. Our work demonstrates the effectiveness of our declarative approach in migrating HPC systems across heterogeneous platforms.
Research and ACM SRC Posters
Research & ACM SRC Posters
TP
DescriptionNeural networks trained on large datasets can be effective policies for the control of robotic manipulators. Using self-supervised learning, these networks can achieve near-perfect success rates on complex pick-and-place-style tasks. However, the speed of task completion is often a barrier to making learned policies practical for deployment. For instance, tasks that require 500 distinct token predictions will require many forward passes through the network, in real time. Moreover, to learn optimal task behavior—as in reinforcement learning—would require state value assignment across a long time horizon. This is often an impediment to learning. To address these challenges, we present Shortcut Mixup Policy, a method to artificially reduce the task horizon length. Our method consists of training a model on next-token prediction tasks optionally conditioned on a target state-shortcut size. We present initial results using Shortcut Mixup Policy and propose future directions for improvement.
Birds of a Feather
Translation of HPC into Societal Context
Livestreamed
Recorded
TP
XO/EX
DescriptionThe annual business meeting of SIGHPC is your opportunity to hear about and discuss the status of SIGHPC and its chapters. All of the elected officers and many of the other volunteers will be present to answer your questions about SIGHPC. Representatives from our chapters will also be available. We will also be discussing upcoming plans for the year.
Paper
Algorithms
Applications
Data Analytics
Livestreamed
Recorded
TP
DescriptionSubgraph isomorphism is a fundamental graph problem with applications in diverse domains. Of particular interest is molecular matching, which uses a subgraph isomorphism formulation for the drug discovery process. While subgraph isomorphism is known to be NP-complete, in molecular matching a number of domain constraints allow for efficient implementations.
This paper presents SIGMo, a high-throughput, portable subgraph isomorphism framework for GPUs, specifically designed for batch molecular matching. SIGMo takes advantage of the specific domain formulation to provide a more efficient filter-and-join strategy: the framework introduces a novel multi-level iterative filtering technique based on neighborhood signature encoding to efficiently prune candidates before the join phase.
SIGMo is written in SYCL, allowing portable execution on AMD, Intel, and NVIDIA GPUs. Our experimental evaluation on a large dataset from ZINC demonstrates up to 1,470x speedup over state-of-the-art frameworks, achieving a throughput of 7.7 billion matches per second on a cluster with 256 GPUs.
This paper presents SIGMo, a high-throughput, portable subgraph isomorphism framework for GPUs, specifically designed for batch molecular matching. SIGMo takes advantage of the specific domain formulation to provide a more efficient filter-and-join strategy: the framework introduces a novel multi-level iterative filtering technique based on neighborhood signature encoding to efficiently prune candidates before the join phase.
SIGMo is written in SYCL, allowing portable execution on AMD, Intel, and NVIDIA GPUs. Our experimental evaluation on a large dataset from ZINC demonstrates up to 1,470x speedup over state-of-the-art frameworks, achieving a throughput of 7.7 billion matches per second on a cluster with 256 GPUs.
Art of HPC
Art of HPC
Not Livestreamed
Not Recorded
TP
W
TUT
XO/EX
DescriptionThis image combines three datasets into a single composition, revealing the beauty of structures explored with high performance computing.
On the left, a large-scale cosmology simulation on the Aurora supercomputer shows the distribution of dark matter in the universe. The visualized subset contains 40 million particles from a large-scale cosmology simulation. These particles were mapped to a continuous density field using a Smooth Particle Hydrodynamics interpolation in ParaView and rendered as a luminous volumetric cloud.
In the center, a procedurally generated 3D wavelet forms a flowing orange structure. Its oscillating patterns evoke both the dynamics of physical waves and the abstract elegance of signal analysis.
On the right, a Mandelbulb fractal emerges from mathematics. Computed with ParaView’s programmable data source, it was isosurfaced, smoothed, and rendered as a detailed polygonal form.
By placing physics simulation and abstract mathematical forms side by side, the image highlights HPC’s ability to bridge natural phenomena and computation, revealing patterns that unite science and art.
Acknowledgments: Simulation data courtesy of the HACC collaboration. This research used resources of the Argonne Leadership Computing Facility, a U.S. Department of Energy (DOE) Office of Science user facility at Argonne National Laboratory, and is based on research supported by the U.S. DOE Office of Science-Advanced Scientific Computing Research Program, under Contract No. DE-AC02-06CH11357.
On the left, a large-scale cosmology simulation on the Aurora supercomputer shows the distribution of dark matter in the universe. The visualized subset contains 40 million particles from a large-scale cosmology simulation. These particles were mapped to a continuous density field using a Smooth Particle Hydrodynamics interpolation in ParaView and rendered as a luminous volumetric cloud.
In the center, a procedurally generated 3D wavelet forms a flowing orange structure. Its oscillating patterns evoke both the dynamics of physical waves and the abstract elegance of signal analysis.
On the right, a Mandelbulb fractal emerges from mathematics. Computed with ParaView’s programmable data source, it was isosurfaced, smoothed, and rendered as a detailed polygonal form.
By placing physics simulation and abstract mathematical forms side by side, the image highlights HPC’s ability to bridge natural phenomena and computation, revealing patterns that unite science and art.
Acknowledgments: Simulation data courtesy of the HACC collaboration. This research used resources of the Argonne Leadership Computing Facility, a U.S. Department of Energy (DOE) Office of Science user facility at Argonne National Laboratory, and is based on research supported by the U.S. DOE Office of Science-Advanced Scientific Computing Research Program, under Contract No. DE-AC02-06CH11357.
Workshop
Livestreamed
Recorded
TP
W
DescriptionAs digital scaling trends have slowed over the past decade, there has been renewed interest in new computing paradigms such as analog. Analog computing has the potential to provide performance and efficiency beyond what is achievable by digital systems; however, many challenges remain. One such challenge is supporting complex applications using analog components that implement few computational kernels. We consider on a class of hybrid analog + digital systems where analog accelerators are used as tightly integrated coprocessors within each core. The RISC-V ISA simplifies the design of hybrid systems, providing a mature software stack for the digital components allowing system designers to focus on the analog-specific aspects of the architecture and software. To investigate the viability of these architectures for high performance computing we evaluate two iterative linear solvers using hybrid analog + RISC-V processors using the Structural Simulation Toolkit.
ACM Gordon Bell Finalist
Awards and Award Talks
Applications
GB
Livestreamed
Recorded
TP
DescriptionWe present an optimized implementation of the recently proposed information geometric regularization (IGR) for unprecedented scale simulation of compressible fluid flows applied to multi-engine spacecraft boosters. We improve upon state-of-the-art computational fluid dynamics (CFD) techniques along computational cost, memory footprint, and energy-to-solution metrics.
Unified memory on coupled CPU-GPU or APU platforms increases problem size with negligible overhead. Mixed half/single-precision storage and computation on well-conditioned numerics is used. We simulate flow at 200 trillion grid points and 1 quadrillion degrees of freedom, exceeding the current record by a factor of 20. A factor of 4 wall-time speedup is achieved over optimized baselines. Ideal weak scaling is seen on OLCF Frontier, LLNL El Capitan, and CSCS Alps using the full systems. Strong scaling is near ideal at extreme conditions, including 80\% efficiency on CSCS Alps with an 8-node baseline and stretching to the full system.
Unified memory on coupled CPU-GPU or APU platforms increases problem size with negligible overhead. Mixed half/single-precision storage and computation on well-conditioned numerics is used. We simulate flow at 200 trillion grid points and 1 quadrillion degrees of freedom, exceeding the current record by a factor of 20. A factor of 4 wall-time speedup is achieved over optimized baselines. Ideal weak scaling is seen on OLCF Frontier, LLNL El Capitan, and CSCS Alps using the full systems. Strong scaling is near ideal at extreme conditions, including 80\% efficiency on CSCS Alps with an 8-node baseline and stretching to the full system.
Paper
HPC for Machine Learning
State of the Practice
System Software and Cloud Computing
Livestreamed
Recorded
TP
DescriptionHPC systems use monitoring and operational data analytics to ensure efficiency, performance, and orderly operations. Application-specific insights are crucial for analyzing the increasing complexity and diversity of HPC workloads, particularly through the identification of unknown software and recognition of repeated executions, which facilitate system optimization and security improvements. However, traditional identification methods using job or file names are unreliable for arbitrary user-provided names. Fuzzy hashing the content of executables detects similarities despite different code versions or compilation approaches while preserving privacy and file integrity, overcoming these limitations. We introduce SIREN, a process-level data collection framework for software identification and recognition. SIREN improves observability in HPC job execution by enabling analysis of process metadata, environment information, and executable fuzzy hashes. Findings from an opt-in deployment campaign on LUMI show SIREN’s ability to provide insights into software usage, recognition of repeated executions of known applications, and similarity-based identification of unknown applications.
Workshop
Livestreamed
Recorded
TP
W
DescriptionInteractivity enables the exploitation of HPC in new and revolutionary ways, delivering many new and exciting opportunities for our community. Interactive HPC involves users being in the loop during job execution where a human is monitoring a job, steering the experiment, or visualizing results to make immediate decisions about the results to influence the current or subsequent interactive jobs. Likewise, urgent computing combines interactive computational modeling with the near-real-time detection of unfolding disasters to take real-time actions. Supporting interactive and urgent workloads on HPC requires expertise in a wide range of areas and the solving of numerous technical and organizational challenges.
This workshop brings together stakeholders, researchers, and practitioners from across interactive and urgent computing with the wider HPC community. We will share success stories, case studies, and technologies to continue community building around leveraging interactive HPC as an important tool responding to disasters and societal issues.
This workshop brings together stakeholders, researchers, and practitioners from across interactive and urgent computing with the wider HPC community. We will share success stories, case studies, and technologies to continue community building around leveraging interactive HPC as an important tool responding to disasters and societal issues.
Doctoral Showcase
Research & ACM SRC Posters
Livestreamed
Recorded
TP
DescriptionSketching is a widely used class of techniques aimed at generating compact representations of longer biological sequences. Instead of comparing sequences, sketches allow us to sample from a subspace of k-mers and use those samples for comparison, saving both time and memory in the end application. One of the key metrics to consider here is density, which refers to the fraction of the sampled k-mers retained by the sketch. While a lower density is preferable for space considerations, it could also impact the sensitivity of the mapping process.
In this work, we study sketch-based data sparsification with high performance computing to improve scalability in mapping. Our contributions are twofold: 1) we present a scalable parallel algorithmic framework for alignment-free mapping, called JEM-mapper, and 2) we present a sketch library called MHSketch by extending JEM-mapper to adopt different sequence sketching schemes. Experimental evaluation demonstrates the ability of our approach to significantly reduce density and reap performance benefits from it. In particular, results show that MHSketch achieves accurate mapping while reducing time-to-solution (speedups between 2.2x to 9.3x), and drastically reducing memory usage (>92% savings) compared to other tools.
In this work, we study sketch-based data sparsification with high performance computing to improve scalability in mapping. Our contributions are twofold: 1) we present a scalable parallel algorithmic framework for alignment-free mapping, called JEM-mapper, and 2) we present a sketch library called MHSketch by extending JEM-mapper to adopt different sequence sketching schemes. Experimental evaluation demonstrates the ability of our approach to significantly reduce density and reap performance benefits from it. In particular, results show that MHSketch achieves accurate mapping while reducing time-to-solution (speedups between 2.2x to 9.3x), and drastically reducing memory usage (>92% savings) compared to other tools.
Workshop
Livestreamed
Recorded
TP
W
DescriptionScience, data analytics, and AI workloads depend on distributed matrix multiplication. Prior work has developed an array of algorithms suitable for different problem sizes, partitionings, and replication factors. A limitation of existing algorithms is that they are limited to a subset of partitionings. Multiple algorithm implementations are required to support the full space of possible partitionings. If no algorithm implementation is available for a set of partitions, one or more operands must be redistributed, increasing communication overhead. We present a one-sided algorithm for distributed matrix multiplication supporting all combinations of partitionings and replication factors. Our algorithm uses index arithmetic to compute sets of overlapping tiles that must be multiplied together. This list of local matrix multiplies can then either be executed directly, or reordered and lowered to an optimized IR to maximize overlap. We implement our algorithm using a high-level C++-based PGAS programming framework, finding it competitive with state-of-the-art systems.
Workshop
Data Analytics
High Performance I/O, Storage, Archive, & File Systems
Storage
Livestreamed
Recorded
TP
W
DescriptionIn-Memory Databases (IMDBs) are widely used with HPC applications to manage transient data, often using snapshot-based persistence for backups. Redis, a representative IMDB, employs both snapshot and Write-Ahead Log (WAL) mechanisms, storing data on persistent devices via the traditional kernel I/O path. This method incurs syscall overhead, I/O contention between processes, and SSD garbage collection (GC) delays. To address these issues, we propose SlimIO, which adopts I/O passthru to minimize syscall overhead and inter-process I/O interference.
Additionally, it leverages Flexible Data Placement (FDP) SSDs as backup storage to avoid performance degradation from SSD GC. Experimental results show that SlimIO reduces snapshot time by up to 25%, increases query throughput by up to 30% during non-snapshot periods, and lowers 99.9%-ile latency by up to 50%. Furthermore, it achieves a write amplification factor (WAF) of 1.00, indicating no redundant internal writes, thus extending SSD lifespan.
Additionally, it leverages Flexible Data Placement (FDP) SSDs as backup storage to avoid performance degradation from SSD GC. Experimental results show that SlimIO reduces snapshot time by up to 25%, increases query throughput by up to 30% during non-snapshot periods, and lowers 99.9%-ile latency by up to 50%. Furthermore, it achieves a write amplification factor (WAF) of 1.00, indicating no redundant internal writes, thus extending SSD lifespan.
Paper
HPC for Machine Learning
Livestreamed
Recorded
TP
DescriptionPipeline parallelism serves as a crucial technique for training large language models, owing to its capability to alleviate memory pressure from model states with low communication overhead. However, in long-context scenarios, existing pipeline parallelism methods fail to address the substantial activation memory pressure, due to the peak memory consumption resulting from the accumulation of activations across multiple microbatches. Moreover, these approaches inevitably introduce considerable pipeline bubbles, further hindering efficiency.
To tackle these challenges, we propose SlimPipe, a novel approach to fine-grained pipeline parallelism that employs uniform sequence slicing coupled with one-forward-one-backward scheduling. It reduces the accumulated activations from several microbatches to just one, which is split into several slices. Although the slices are evenly partitioned, the computation cost is not equal across slices due to causal self-attention. We develop a sophisticated workload redistribution technique to address this load imbalance. SlimPipe achieves near-zero memory overhead and minimal pipeline bubbles simultaneously.
To tackle these challenges, we propose SlimPipe, a novel approach to fine-grained pipeline parallelism that employs uniform sequence slicing coupled with one-forward-one-backward scheduling. It reduces the accumulated activations from several microbatches to just one, which is split into several slices. Although the slices are evenly partitioned, the computation cost is not equal across slices due to causal self-attention. We develop a sophisticated workload redistribution technique to address this load imbalance. SlimPipe achieves near-zero memory overhead and minimal pipeline bubbles simultaneously.
Birds of a Feather
System Software
Livestreamed
Recorded
TP
XO/EX
DescriptionSlurm is an open-source workload manager used on many of the TOP500 systems and provides a rich set of features.
An updated Slurm community survey will be distributed ahead of the BoF, and introduced at the start of the session.
Changes made in the Slurm 25.05 and 25.11 releases will be presented, alongside the future roadmap for 26.05 and beyond.
Initial results from the community survey will be discussed. Discussion will focus around how Slurm development should react to changes in Linux distribution lifecycles, Linux cgroup versions, container runtimes, and external tools such as MPI/PMIx.
Remaining time will be used as an open community forum.
Everyone interested in Slurm use and development is encouraged to attend.
An updated Slurm community survey will be distributed ahead of the BoF, and introduced at the start of the session.
Changes made in the Slurm 25.05 and 25.11 releases will be presented, alongside the future roadmap for 26.05 and beyond.
Initial results from the community survey will be discussed. Discussion will focus around how Slurm development should react to changes in Linux distribution lifecycles, Linux cgroup versions, container runtimes, and external tools such as MPI/PMIx.
Remaining time will be used as an open community forum.
Everyone interested in Slurm use and development is encouraged to attend.
Flash Session
Not Livestreamed
Not Recorded
TP
XO/EX
DescriptionWith the ever-increasing capacity expansion of AI and GenAI in data centers, fiber-optic cable density has experienced exponential growth while pathways and spaces within cabinets remain static. Efficiencies are needed to accommodate and properly manage the significant cable assembly volume. Properly routing cables on-site between GPUs/CPUs to leaf and spine switches among other applications can be challenging and time-consuming. Co-locating the network electronics with power and cooling apparatus in cabinets also adds to the cable routing challenge. This presentation will review an innovative approach that reduces connection time and risk on-site, provides spare connections for operations, and is optimized for efficiencies in both rack scale GPU/CPU and switch cabinets.
Workshop
Data Analytics
High Performance I/O, Storage, Archive, & File Systems
Storage
Livestreamed
Recorded
TP
W
DescriptionThe I/O subsystem is often a critical bottleneck in HPC systems, making its optimization crucial for performance. Existing efforts to optimize the I/O in HPC applications rely on approaches that are time consuming and resource intensive. This work introduces SmartIO, an end-to-end workflow to optimize the I/O performance of HPC systems at runtime without requiring prior model training, profiling, or parameter searches. SmartIO leverages context-free grammars to predict future I/O calls during the application runtime, learns key I/O characteristics from predicted calls, and proactively optimizes performance before those calls occur using a rules-based mapping engine. Our evaluation shows that SmartIO achieves up to ∼ 13× and ∼ 12× improvements in IOR read and write bandwidth, respectively, and delivers a ∼ 4× speedup in overall I/O bandwidth for Flash-X—all with negligible overhead. Compared to the state-of-the-art I/O optimization tools, SmartIO delivers comparable or better performance while drastically reducing the tuning cost.
Workshop
Livestreamed
Recorded
TP
W
DescriptionAs the field of HPC grows ever larger, now more than ever, it is important to adapt to the rapidly evolving hardware landscape, leveraging the cutting edge and advancing beyond the limits of what is considered conventional computing. Smart Network Interface Cards (SmartNICs) are one such emerging technology that have the potential to overhaul classical computing paradigms. This paper will provide an overview of a novel data exchange framework which leverages SmartNICs to gather arbitrary host data from HPC systems and exchange it via three different methods with minimal system overhead. We discuss the latency with which the framework operates, along with the ways in which its varying configurations affect performance. Finally, we provide some context as to how the field of HPC will benefit from the introduction of SmartNICs, especially as they leverage the presented framework for a variety of future applications.
Workshop
Livestreamed
Recorded
TP
W
DescriptionNetwork accessible databases are a common use case in modern data centers, often paired with pre-processing before storing results for later use. However, general purpose CPUs struggle to keep up with current Ethernet line speeds. Furthermore, in such a compute pipeline, the CPU is mostly used to manage storage accesses, wasting compute resources and communication bandwidth.
Due to their wide data path, FPGAs are very suitable for network applications.
Hence, we propose an open-source framework for the seamless high-performance integration of custom FPGA-based network-to-storage accelerators. Our solution leverages the flexible communication interfaces of FPGAs, namely Ethernet and PCIe for direct access to NVMe storage, without host CPU interaction. We are able to saturate the bandwidth of both 100G Ethernet and state-of-the-art SSDs, and demonstrate our implementation in a case study performing DNN-based classification on an image stream.
Due to their wide data path, FPGAs are very suitable for network applications.
Hence, we propose an open-source framework for the seamless high-performance integration of custom FPGA-based network-to-storage accelerators. Our solution leverages the flexible communication interfaces of FPGAs, namely Ethernet and PCIe for direct access to NVMe storage, without host CPU interaction. We are able to saturate the bandwidth of both 100G Ethernet and state-of-the-art SSDs, and demonstrate our implementation in a case study performing DNN-based classification on an image stream.
Birds of a Feather
Community Meetings
Livestreamed
Recorded
TP
XO/EX
DescriptionSpack is an open-source package manager for scientific computing with a rapidly growing community of over 1,500 contributors. This year, Spack has undergone some of the most significant changes in its 12-year history, with the release of v1.0. This BoF will feature an update from developers, covering 1.0’s enhanced compiler dependency model, improved parallelism, and stable package API. We’ll announce version 1.1 with performance and usability improvements, and we will conduct a poll to understand how users have received v1.0. Finally, we’ll open the floor for questions. Help us make installing HPC software simple!
Paper
HPC for Machine Learning
Performance Measurement, Modeling, & Tools
Programming Frameworks
Livestreamed
Recorded
TP
DescriptionPreconditioned iterative sparse linear solvers are memory-efficient for large scientific simulations, but the dependences between iterations introduced by preconditioners limit parallelization. This issue is exacerbated on GPUs, which feature many parallel cores. We propose a sparsified preconditioned conjugate gradient (SPCG) solver that increases parallelism by reducing dependences through sparsification, while preserving convergence behavior. We evaluate the proposed SPCG using both ILU(0) and ILU(K) preconditioners on a wide range of symmetric positive definite (SPD) matrices. The proposed SPCG improves the performance of the iterative phase of SPCG by a geometric mean speedup of 1.23$\times$ and 1.65$\times$ over the non-sparsified PCG using ILU(0) and ILU(K), respectively, on an NVIDIA A100 GPU. SPCG also yields geometric mean end-to-end speedups of 1.68$\times$ and 3.73$\times$ over the non-sparsified versions with ILU(0) and ILU(K), respectively, on the same platform.
Paper
Algorithms
BSP
Livestreamed
Recorded
TP
DescriptionSparse Tensor Cores offer exceptional performance gains for AI workloads by exploiting structured 2:4 sparsity. However, their potential remains untapped for core scientific workloads such as stencil computations, which exhibit irregular sparsity patterns.
This paper presents SparStencil, the first system to retarget sparse TCUs for scientific stencil computations through structured sparsity transformation. SparStencil introduces three key techniques:
(1) Adaptive Layout Morphing, which restructures stencil patterns into staircase-aligned sparse matrices via a flatten-and-crush pipeline;
(2) Structured Sparsity Conversion, which formulates transformation as a graph matching problem to ensure compatibility with 2:4 sparsity constraints;
(3) Automatic Kernel Generation, which compiles transformed stencils into optimized sparse MMA kernels via layout search and table-driven memory mapping.
Evaluated on 79 stencil kernels spanning diverse scientific domains, SparStencil achieves up to 7.1x speedup (3.1x on average) over state-of-the-art framework, while reducing code complexity and matching or exceeding expert-tuned performance in both compute throughput and memory efficiency.
This paper presents SparStencil, the first system to retarget sparse TCUs for scientific stencil computations through structured sparsity transformation. SparStencil introduces three key techniques:
(1) Adaptive Layout Morphing, which restructures stencil patterns into staircase-aligned sparse matrices via a flatten-and-crush pipeline;
(2) Structured Sparsity Conversion, which formulates transformation as a graph matching problem to ensure compatibility with 2:4 sparsity constraints;
(3) Automatic Kernel Generation, which compiles transformed stencils into optimized sparse MMA kernels via layout search and table-driven memory mapping.
Evaluated on 79 stencil kernels spanning diverse scientific domains, SparStencil achieves up to 7.1x speedup (3.1x on average) over state-of-the-art framework, while reducing code complexity and matching or exceeding expert-tuned performance in both compute throughput and memory efficiency.
Exhibitor Forum
Networking
Livestreamed
Recorded
TP
XO/EX
DescriptionThe University of Texas at Dallas (UT Dallas) launched the High Performance Computing for Research and Education (HPCRE) initiative to support cutting-edge research with advanced computing capabilities. As HPC demands increased, legacy network infrastructure struggled with performance, visibility, and operational efficiency. In response, UT Dallas partnered with Juniper Networks to modernize its HPC network, replacing outdated Cisco and Dell systems with Juniper QFX series switches and Apstra automation. This transformation delivered reduced latency, proactive fault detection, and a unified network view, streamlining collaboration between IT and research teams. The session highlights how automation and predictive analytics improved capacity planning and decreased resolution times. Attendees will learn how a modern network architecture enhanced speed, agility, and reliability in HPC workloads—accelerating research outcomes and fostering innovation. This case study offers valuable insights into reimagining infrastructure to meet the growing demands of high performance computing in academia.
Workshop
Livestreamed
Recorded
TP
W
DescriptionWe utilized high-performance computing techniques, including GPU accelerators, to speed up calculations of the phonon dynamic structure factor used to model spectroscopy data measured at neutron and x-ray scattering facilities. This faster workflow is a first step toward experimental steering, enabling facility users to make informed decisions during beam time rather than after returning to their home institutions. A collection of functions in Phonopy, a mostly serial Python+C code, was identified as a bottleneck for high-fidelity calculations utilizing hundreds of thousands of points in reciprocal space. We created a proxy application replicating Phonopy’s 𝑟𝑢𝑛_𝑑𝑦𝑛𝑎𝑚𝑖𝑐_𝑠𝑡𝑟𝑢𝑐𝑡𝑢𝑟𝑒_𝑓 𝑎𝑐𝑡𝑜𝑟 function and used the Numba and CuPy libraries to run on the latest NVIDIA GH200 and AMD MI300A GPU accelerators. Two representative use cases, CaHgO2 and CsSnBr3, showed speedups of up to 10× and 15×, respectively. The utilization of high-performance computing accelerators showcases the potential use of Oak Ridge Leadership Computing Facility resources to rapidly analyze experimental data from the Spallation Neutron Source at Oak Ridge National Laboratory.
Research and ACM SRC Posters
Research & ACM SRC Posters
TP
DescriptionAchieving 400 Gbps requires aggregating multiple flows across cores rather than pushing single-flow limits. However, with 16×25 Gbps TCP flows, random ephemeral ports cause receiver-side packet steering (RSS) to concentrate flows on single CPU cores, degrading throughput from 25 Gbps to below 5 Gbps.
Unlike receiver-side solutions that suffer from cache misses and state migration overhead, SRAP transparently controls source ports at senders, ensuring collision-free mapping without runtime remapping costs or application modification.
With 16×25 Gbps flows, random assignment achieves optimal distribution with probability 1.04\%, causing throughput to vary from 44.8 to 395 Gbps, while our approach consistently delivers 23.3-25 Gbps per flow with guaranteed 1:1 flow-to-core mapping.
Unlike receiver-side solutions that suffer from cache misses and state migration overhead, SRAP transparently controls source ports at senders, ensuring collision-free mapping without runtime remapping costs or application modification.
With 16×25 Gbps flows, random assignment achieves optimal distribution with probability 1.04\%, causing throughput to vary from 44.8 to 395 Gbps, while our approach consistently delivers 23.3-25 Gbps per flow with guaranteed 1:1 flow-to-core mapping.
Paper
Algorithms
Applications
State of the Practice
Livestreamed
Recorded
TP
DescriptionCheckpoint/Restart (C/R) strategies are vital for fault tolerance in PDE-based scientific simulations, yet traditional checkpointing incurs significant I/O overhead. Lossy compression offers a scalable solution by reducing checkpoint data size, but conventional methods often lack control over physical invariants (e.g., energy), leading to instability such as oscillations or divergence in partial differential equation (PDE) systems. This paper introduces a stability-preserving compression approach tailored for PDE simulations by explicitly controlling kinetic and potential energy perturbations to ensure stable restarts. Extensive experiments conducted across diverse PDE configurations demonstrate that our method maintains numerical stability with minimal error magnification—even across multiple checkpoint-restart cycles—outperforming state-of-the-art lossy compressors. Parallel evaluations on the Frontier supercomputer show up to 8.4× improvement in checkpoint write performance and 6.3× in read performance, while maintaining relative $L^2$ errors $\sim$2e-6 throughout continued simulation. These results provide practical guidance for balancing compression accuracy, stability, and computational efficiency in large-scale PDE applications.
Workshop
Livestreamed
Recorded
TP
W
DescriptionAsynchronous Many-Task (AMT) runtimes manage parallelism by suspending and migrating tasks between processes, with their state captured in continuations. The efficiency of suspending, migrating, and resuming these continuations is critical to application performance.
This work directly compares stackful and stackless coroutines as continuation implementations in a cluster environment using RDMA-based coordinated work stealing. We implement and evaluate two functionally equivalent AMT runtimes for a fine-grained, recursive workload: one using traditional stackful coroutines, and another using C++20 stackless coroutines.
Our results show that both approaches yield nearly identical overall performance for small-state tasks. Stackful coroutines are created 2.4x faster, while stackless coroutines switch context 3.5x faster and have smaller frames. However, the smaller frame size of stackless coroutines does not significantly reduce communication time, which is dominated by network latency. We conclude that both coroutine types are viable, with stackless coroutines offering advantages as task state increases.
This work directly compares stackful and stackless coroutines as continuation implementations in a cluster environment using RDMA-based coordinated work stealing. We implement and evaluate two functionally equivalent AMT runtimes for a fine-grained, recursive workload: one using traditional stackful coroutines, and another using C++20 stackless coroutines.
Our results show that both approaches yield nearly identical overall performance for small-state tasks. Stackful coroutines are created 2.4x faster, while stackless coroutines switch context 3.5x faster and have smaller frames. However, the smaller frame size of stackless coroutines does not significantly reduce communication time, which is dominated by network latency. We conclude that both coroutine types are viable, with stackless coroutines offering advantages as task state increases.
SCinet Network Research Exhibition
Not Livestreamed
Not Recorded
Workshop
Livestreamed
Recorded
TP
W
DescriptionThe High Performance Computing (HPC) community is facing a period of change, where access to resources is uncertain and workflows must move between available environments. The economic and innovative power of cloud presents opportunity by offering state-of-the-art orchestration frameworks like Kubernetes. However, the existence of environments does not guarantee access to them, and porting HPC applications to cloud is non-trivial. In this work, we redesign the orchestration of an ensemble-based workflow – the Multiscale Machine-Learned Modeling Infrastructure (MuMMI) for Kubernetes. We perform experiments representing a progression from a traditional MuMMI run on HPC to a fully portable variant running in cloud to assess the relative contributions of cloud-native features to workflow performance improvement. Moving from a traditional design based on service and filesystem components to an event-driven design we demonstrate 62.24% and 40.29% faster workflow completion times for CPU and GPU setups, respectively, resulting in 45.0% and 38.3% lower costs.
Community Engagement and Support
Not Livestreamed
Not Recorded
TP
XO/EX
DescriptionWriting high performance parallel code takes a long time and offers a steep learning curve. Today's LLMs are helpful, but not quite up to the task. We create an agent architecture with a fine tuned model that achieves state of the art results, allowing anyone to write code for GPUs effectively.
Paper
Data Analytics, Visualization & Storage
Livestreamed
Recorded
TP
DescriptionI/O performance is crucial to efficiency in data-intensive scientific computing, but tuning large-scale storage systems is complex, costly, and notoriously manpower-intensive, making it inaccessible for most domain scientists. In this study, we propose STELLAR, an autonomous tuner for high-performance parallel file systems. Our evaluations show that STELLAR always selects near-optimal configurations for the parallel file systems within the first five attempts, even for previously unseen applications. STELLAR’s human-like efficiency is fundamentally different from existing auto-tuning methods, which often require hundreds of thousands of iterations to converge. STELLAR achieves this through Retrieval-Augmented Generation, external tool execution, LLM-based reasoning, and a multi-step agent design to stabilize reasoning and combat hallucinations. STELLAR's architecture opens new avenues for addressing complex system optimization problems, especially those characterized by vast search spaces and high exploration costs. Its extremely efficient autonomous tuning will broaden access to I/O performance optimizations for domain scientists with minimal additional resource investment.
Paper
Applications
Performance Measurement, Modeling, & Tools
State of the Practice
Livestreamed
Recorded
TP
DescriptionThis study characterizes GPU resilience in Delta, a large-scale AI system that consists of 1,056 A100 and H100 GPUs, with over 1,300 petaflops of peak throughput. We used 2.5 years of operational data (11.7 million GPU hours) on GPU errors. Our major findings include:
1) H100 GPU memory resilience is worse than A100 GPU memory, with 3.2x lower per-GPU MTBE for memory errors.
2) The GPU memory error-recovery mechanisms on H100 GPUs are insufficient to handle the increased memory capacity.
3) H100 GPUs demonstrate significantly improved GPU hardware resilience over A100 GPUs with respect to critical hardware components.
4) GPU errors on both A100 and H100 GPUs frequently result in job failures due to the lack of robust recovery mechanisms at the application level.
5) We project the impact of GPU node availability on larger scales and find that significant overprovisioning of 5% is necessary to handle GPU failures.
1) H100 GPU memory resilience is worse than A100 GPU memory, with 3.2x lower per-GPU MTBE for memory errors.
2) The GPU memory error-recovery mechanisms on H100 GPUs are insufficient to handle the increased memory capacity.
3) H100 GPUs demonstrate significantly improved GPU hardware resilience over A100 GPUs with respect to critical hardware components.
4) GPU errors on both A100 and H100 GPUs frequently result in job failures due to the lack of robust recovery mechanisms at the application level.
5) We project the impact of GPU node availability on larger scales and find that significant overprovisioning of 5% is necessary to handle GPU failures.
Paper
HPC for Machine Learning
Performance Measurement, Modeling, & Tools
Programming Frameworks
Livestreamed
Recorded
TP
DescriptionGraph convolutional networks (GCNs) are a fundamental approach to deep learning on graph-structured data. However, they face a significant challenge in training efficiency due to the high computational cost of Sparse-Dense Matrix Multiplication (SpMM). This paper presents StraGCN, the first GPU-accelerated SpMM implementation based on Strassen’s algorithm specifically designed for GCN training. First, we propose a horizontal fusion model for GPU kernels as an alternative to commonly-used multi-stream CUDA models, significantly improving data locality of on-chip shared memory for Strassen’s SpMM. Second, StraGCN exploits the immutability of the adjacency matrix in GCNs to reuse intermediate results from submatrix operations, substantially reducing redundant computations. Third, we propose two-stage matrix partitioning to mitigate load imbalance caused by the irregular distribution of non-zero elements. We evaluate StraGCN with 15 benchmark datasets. Experimental results show that StraGCN achieves performance speedups of 2.1×, 2.6×, and 3.3× compared with state-of-the-art GCN frameworks—GNNA, PyG, and DGL, respectively.
Workshop
Livestreamed
Recorded
TP
W
DescriptionScientific data streaming enables the real-time transfer, processing, and analysis of high-throughput experimental data, providing low-latency insights that are essential for adaptive experiment control. We present SciStream-as-a-Service (StreamHub), a secure, high-performance, and scalable framework that integrates Globus Compute with SciStream to deliver continuous memory-to-memory data transfer, in-transit processing, and robust zero-trust security while orchestrating the entire streaming setup across distributed facilities with only approximately 4.3s overhead. We describe StreamHubs design, core components, and deployment model, and evaluate its performance under diverse configurations. Our results demonstrate that StreamHub can be deployed without privileged access, ensures end-to-end encryption and authentication across institutions, and achieves near line-rate throughput with under 2% overhead compared to unencrypted transfers, making it a practical solution for real-time scientific discovery and experiment steering.
Workshop
Livestreamed
Recorded
TP
W
DescriptionPropelled by the increasing need for the near real-time feedback for user experiments on its X-ray beamlines, the Advanced Photon Source continues to investigate the use of streaming workflows, with several of those being successfully deployed on its local computing infrastructure. With ever-growing data volumes and compute resource needs, the ability to analyze beamline data at remote facilities is becoming more and more important.
In this paper we investigate the possibility of using ESnet JLab FPGA Accelerated Transport (EJFAT) project infrastructure to bring X-ray detector data directly from the instrument into an analysis application running at a remote high performance computing center. To that end, we describe successful integration of PvaPy, a Python API for the EPICS PV Access protocol, with the EJFAT software library. We also discuss potential use cases, as well as illustrate system performance in terms of maximum achievable frame and data rates in a test environment.
In this paper we investigate the possibility of using ESnet JLab FPGA Accelerated Transport (EJFAT) project infrastructure to bring X-ray detector data directly from the instrument into an analysis application running at a remote high performance computing center. To that end, we describe successful integration of PvaPy, a Python API for the EPICS PV Access protocol, with the EJFAT software library. We also discuss potential use cases, as well as illustrate system performance in terms of maximum achievable frame and data rates in a test environment.
Doctoral Showcase
Research & ACM SRC Posters
Livestreamed
Recorded
TP
DescriptionHeterogeneous platforms introduce new complexities into performance modeling and prediction. The intrinsic performance asymmetries found in these platforms require radically different approaches to manage the compute diversity of this polymorphic architectural design space. Core heterogeneity augments our traditional notion of compute semantics, making simply defining the number of compute units of a given speed no longer sufficient. Workload adaptation and complexity mitigation is no longer simply a question of volume (number of cores) but also constitution (types of cores). To ensure performance portability of workloads on these platforms requires an understanding of and investigation into the interactions of workloads, compute units, and parameters that create a spectrum of performance opportunities extending across classes of platforms.
Using the Orange Pi 5, a heterogeneous asymmetric multiprocessing platform (AMP), to examine this prolific domain, we embark on a principled analysis journey into the performance implications of classical workloads (i.e., matrix-matrix multiply) on these platforms. We demonstrate techniques enabling complexity mitigation and performance portability across compute unit groups. Finally, by applying structural equation modeling (SEM) to this reference platform, we discover the most critical components impacting performance for these classical workloads, revealing component interactions affecting platform performance, and articulating the impact of parameter effects on platform performance using a novel and unprecedented approach in computer engineering.
Using the Orange Pi 5, a heterogeneous asymmetric multiprocessing platform (AMP), to examine this prolific domain, we embark on a principled analysis journey into the performance implications of classical workloads (i.e., matrix-matrix multiply) on these platforms. We demonstrate techniques enabling complexity mitigation and performance portability across compute unit groups. Finally, by applying structural equation modeling (SEM) to this reference platform, we discover the most critical components impacting performance for these classical workloads, revealing component interactions affecting platform performance, and articulating the impact of parameter effects on platform performance using a novel and unprecedented approach in computer engineering.
Students@SC
Not Livestreamed
Not Recorded
TP
W
TUT
XO/EX
DescriptionFuel your future with HPC at SC25! Join this dynamic and interactive career panel, moderated by Andrekka “AJ” Lanier from Lawrence Livermore National Laboratory, where innovation meets inspiration. In this 60-minute session, students will connect with professionals from government agencies, national labs, and industry who are driving breakthroughs in high performance computing. Discover how HPC ignites possibilities and empowers the next generation to blaze their own trails toward impactful careers. Don’t miss this opportunity to spark your journey in HPC and shape the future!
Students@SC
Not Livestreamed
Not Recorded
TP
W
TUT
XO/EX
DescriptionIgnite your professional potential with our HPC Portfolio Course at SC25! Designed to showcase your skills and achievements in high performance computing, this course will guide you through creating a standout portfolio that sparks interest and opens doors to new opportunities. Whether you're advancing your career or preparing for your next big step, this is your chance to fuel your future in HPC.
Paper
Data Analytics, Visualization & Storage
Livestreamed
Recorded
TP
DescriptionError-bounded lossy compression is one of the most efficient solutions to reduce the volume of scientific data. For lossy compression, progressive decompression and random-access decompression are critical features that enable on-demand data access and flexible analysis workflows. However, these features can severely degrade compression quality and speed. To address these limitations, we propose a novel streaming compression framework that supports both progressive decompression and random-access decompression while maintaining high compression quality and speed. Our contributions are three-fold: (1) we design the first compression framework that simultaneously enables both progressive decompression and random-access decompression; (2) we introduce a hierarchical partitioning strategy to enable both streaming features, along with a hierarchical prediction mechanism that mitigates the impact of partitioning and achieves high compression quality—even comparable to state-of-the-art (SOTA) non-streaming compressor SZ3; and (3) our framework delivers high compression and decompression speeds, up to 6.7x faster than SZ3.
Workshop
Livestreamed
Recorded
TP
W
DescriptionThere is a growing need to support high-volume, concurrent transaction processing on shared data in both high-performance computing and data center environments. A recent innovation in server architectures is the use of disaggregated memory organizations based on the Compute eXpress Link (CXL) interconnect protocol. While CXL memory architectures alleviate many concerns in data centers, enforcing ACID semantics for transactions in CXL memory faces many challenges.
This paper is a summary of a full paper at MEMSYS25, where we describe a novel solution for supporting ACID (Atomicity, Consistency, Isolation, Durability) transactions in a CXL-based disaggregated shared-memory architecture. We call this solution HTCXL for Hierarchical Transactional CXL.
HTCXL is implemented in a software library that enforces transaction semantics within a host, along with a back-end controller to detect conflicts across
hosts. HTCXL is a modular solution allowing different combinations of HTM or software-based transaction management to be mixed as needed.
This paper is a summary of a full paper at MEMSYS25, where we describe a novel solution for supporting ACID (Atomicity, Consistency, Isolation, Durability) transactions in a CXL-based disaggregated shared-memory architecture. We call this solution HTCXL for Hierarchical Transactional CXL.
HTCXL is implemented in a software library that enforces transaction semantics within a host, along with a back-end controller to detect conflicts across
hosts. HTCXL is a modular solution allowing different combinations of HTM or software-based transaction management to be mixed as needed.
Birds of a Feather
Community Meetings
Livestreamed
Recorded
TP
XO/EX
DescriptionMembers of underrepresented groups often lack access to role models within their minority. The HPC community is still predominantly male, making it difficult for young women to find female "superheroes" to identify with. Such role models are crucial for career planning and guidance. This session aims to provide especially women with the opportunity to meet influential, well-recognized female HPC "superheroes" from academia, research labs, HPC centers, and industry. Join us to be inspired and find relatable role models as we work together to build a more inclusive and connected HPC community.
Birds of a Feather
Applications
Livestreamed
Recorded
TP
XO/EX
DescriptionWith the increasing prevalence of electric vehicles and Net Zero targets, battery simulation is crucial. HPC facilitates accurate simulations of battery electrochemistry and thermal behavior, allowing improved predictions of cell performance, cyclic life, and safety to inform R&D. This BoF will address the challenges of bringing a mature software ecosystem into the exascale and AI era. The BoF will consist of a combined talk from an international group who are engaged in developing battery simulations for exascale applications, followed by an audience discussion and panel session sharing insights, obstacles, and solutions to accelerate the collaborative development of batteries.
Invited Talk
Livestreamed
Recorded
TP
DescriptionThe aviation industry is entering a new era of innovation, with aircraft and engine manufacturers advancing technologies that will shape the future of flying.
GE Aerospace, a leading jet engine maker, is leveraging U.S. Department of Energy exascale supercomputers—cutting-edge tools that enable the rapid development of new jet engine technologies, such as the Open Fan engine architecture.
The Open Fan design eliminates the traditional outer casing, allowing for a larger fan size with reduced drag, which significantly improves fuel efficiency. Historically, designing, developing, and testing a new type of jet engine could take decades. However, high performance computing now accelerates technology development, enabling faster and more efficient designs. This breakthrough allows GE Aerospace to develop the Open Fan engine—targeting over 20% fuel efficiency improvement compared to current engines—within a single generation. This level of efficiency represents an unprecedented milestone for the industry.
Using advanced simulations, GE Aerospace engineers analyze the aerodynamics of an Open Fan mounted on an aircraft wing under simulated flight conditions. These simulations optimize hardware designs for enhanced efficiency, reduced noise, and improved overall performance.
GE Aerospace, a leading jet engine maker, is leveraging U.S. Department of Energy exascale supercomputers—cutting-edge tools that enable the rapid development of new jet engine technologies, such as the Open Fan engine architecture.
The Open Fan design eliminates the traditional outer casing, allowing for a larger fan size with reduced drag, which significantly improves fuel efficiency. Historically, designing, developing, and testing a new type of jet engine could take decades. However, high performance computing now accelerates technology development, enabling faster and more efficient designs. This breakthrough allows GE Aerospace to develop the Open Fan engine—targeting over 20% fuel efficiency improvement compared to current engines—within a single generation. This level of efficiency represents an unprecedented milestone for the industry.
Using advanced simulations, GE Aerospace engineers analyze the aerodynamics of an Open Fan mounted on an aircraft wing under simulated flight conditions. These simulations optimize hardware designs for enhanced efficiency, reduced noise, and improved overall performance.
Workshop
Livestreamed
Recorded
TP
W
DescriptionSustainable supercomputing is a pressing topic for our community, industry, and governments. Supercomputing has an ever-increasing need for computational cycles while facing the increasing challenges of delivering performance/Watt advances within the context of climate change, the drive towards net-zero, and geo-political-economic pressures.
Improving supercomputing sustainability provides many opportunities considering an end-to-end, holistic view of the HPC system, facility, site, and broader environment. All elements of the HPC system must be considered, from low-level circuits, up the software stack and beyond to power/cooling systems. The drive towards more sustainable supercomputing requires measurements, metrics, goals, and improvement processes.
This workshop will gather users, researchers, and developers to address the opportunities and challenges of supercomputing sustainability. Topics include, but are not limited to:
• Deployment of supercomputing systems
• Data center efficiency
• Software tools for measuring energy efficiency throughout the supercomputing system
• Standardization of measurement/reporting of key sustainability metrics and emissions
Improving supercomputing sustainability provides many opportunities considering an end-to-end, holistic view of the HPC system, facility, site, and broader environment. All elements of the HPC system must be considered, from low-level circuits, up the software stack and beyond to power/cooling systems. The drive towards more sustainable supercomputing requires measurements, metrics, goals, and improvement processes.
This workshop will gather users, researchers, and developers to address the opportunities and challenges of supercomputing sustainability. Topics include, but are not limited to:
• Deployment of supercomputing systems
• Data center efficiency
• Software tools for measuring energy efficiency throughout the supercomputing system
• Standardization of measurement/reporting of key sustainability metrics and emissions
Birds of a Feather
Programming Frameworks
Livestreamed
Recorded
TP
XO/EX
DescriptionFor the past decade, the open-standard SYCL programming model has provided a portable way to program heterogeneous systems across application domains such as fusion energy, molecular dynamics, aerospace, and AI, running on some of the largest GPU-accelerated machines, such as Aurora.
In this BoF, we will bring together the community of everyone using and developing SYCL applications and implementations. We will showcase new features due for release at SC25, and discuss feedback and priorities for SYCL Next. A panel of SYCL experts, runtime/compiler implementers, and application specialists will lead an audience discussion and Q&A.
In this BoF, we will bring together the community of everyone using and developing SYCL applications and implementations. We will showcase new features due for release at SC25, and discuss feedback and priorities for SYCL Next. A panel of SYCL experts, runtime/compiler implementers, and application specialists will lead an audience discussion and Q&A.
Research and ACM SRC Posters
Research & ACM SRC Posters
TP
DescriptionSparse matrix kernels such as SpMV, SpTRSV, and Gauss-Seidel are critical in scientific computing, AI, and engineering, but they remain difficult to parallelize due to irregular memory access patterns. Traditional compiler techniques assume affine array accesses, which do not hold in sparse formats like CSR and CSC. As a result, existing compilers often leave sparse code under-optimized, missing significant opportunities for parallelism.
We present a sync-free, runtime-based transformation that automates loop parallelization for sparse kernels with loop-carried dependencies. Our approach traces memory reads and writes to construct dependence sets, then generates Triton kernels that use flag arrays to enforce correctness without global synchronization. This method generalizes across sparse kernels by leveraging properties such as associativity and affine simplifications, enabling efficient parallel execution.
We demonstrate our work with sparse triangular solves and related kernels, and will present performance results, methodology, and case studies in the poster session.
We present a sync-free, runtime-based transformation that automates loop parallelization for sparse kernels with loop-carried dependencies. Our approach traces memory reads and writes to construct dependence sets, then generates Triton kernels that use flag arrays to enforce correctness without global synchronization. This method generalizes across sparse kernels by leveraging properties such as associativity and affine simplifications, enabling efficient parallel execution.
We demonstrate our work with sparse triangular solves and related kernels, and will present performance results, methodology, and case studies in the poster session.
Workshop
Livestreamed
Recorded
TP
W
DescriptionThe integration of quantum computing and classical high-performance computing (HPC) powered recent examples of calculations at the utility scale with practical applications. This talk presents two workflows implementing synergies between quantum and classical computing. First, the Quantum Resource Management Interface and its Slurm SPANK plugin expose quantum processing units as HPC resources for seamless scheduling and execution. Second, Qiskit Machine Learning stands out as a robust library for the large-scale experimentation of quantum machine learning workflows across heterogeneous architectures. Together, these case studies illustrate how open-source innovation, supported by public-private initiatives like HNCDI (STFC-IBM), is transforming hybrid quantum-classical computing from proof-of-concepts to scalable assets.
Panel
Debugging & Correctness Tools
HPC Software & Runtime Systems
Systems Administration and/or Resource Management of HPC Systems
Livestreamed
Recorded
TP
DescriptionHigh performance computing (HPC) and its application software face unique testing challenges due to extreme concurrency, complex systems architecture, and variety of correctness requirements associated with floating point numbers. Ensuring correctness, performance stability, and resilience at scale requires testing not only application codes and libraries but also system software and hardware-software interactions. Despite its critical importance, HPC testing methodologies often lag behind the rapid growth in system and application complexity. This panel brings together experts from system software, programming models, numerical libraries, and large-scale applications to discuss evolving challenges and emerging opportunities in HPC testing. Panelists will share lessons from real-world failures, explore new approaches—including formal verification, large-scale fault injection, and property-based testing—and debate whether testing should become a first-class research and operational priority for future HPC systems. Audience engagement will be encouraged to collaboratively envision how testing practices must evolve to meet the demands of exascale computing and beyond.
Exhibitor Forum
Software Tools
Livestreamed
Recorded
TP
XO/EX
DescriptionAs HPC systems scale to meet AI workloads, processor power densities are increasing beyond 1,000 W. Two-phase (2P) direct-to-chip (DTC) cooling offers a high-efficiency, reliable, and future-proof alternative to traditional air or single-phase liquid cooling. This talk presents a comprehensive system-level thermal analysis for 2P DTC cooling, breaking down the end-to-end temperature difference (from processor case temperature to facility water supply temperature) into three contributing thermal resistances: cold plate, vapor line pressure drop, and condenser. Each of the contributions are analyzed quantitatively for two different refrigerants: R1233zd(E) and R515B. The thermal stack-up for 2P DTC cooling is compared with single-phase DTC cooling, showing that 2P systems support higher allowable facility water temperatures with high power/heat flux processors, enabling improved energy efficiency. This presentation offers HPC system architects and thermal engineers insights into benchmarking, optimizing, and deploying 2P cooling technologies to meet the thermal challenges of AI and exascale computing.
Paper
Applications
Livestreamed
Recorded
TP
DescriptionCryo-electron microscopy (cryo-EM) is a key technique for structural biology, but its computational efficiency, particularly during 3D reconstruction, remains a bottleneck. We introduce T2-RELION, a highly optimized version of RELION for cryo-EM 3D reconstruction on CPU-GPU platforms. RELION is a widely used open-source package in the cryo-EM community. We identify and resolve key inefficiencies in RELION’s parallelization strategy and memory management by proposing task parallelism and a three-phase GPU memory management strategy.
Furthermore, we leverage Tensor Cores to accelerate the hot-spot kernel for difference calculation, employing an advanced pipelining strategy to hide latency and enable thread block-level data reuse. On a quad-A100 GPU machine, performance evaluations demonstrate that T2-RELION outperforms RELION 4.0. For the hot-spot kernel, our optimizations achieve 1.90-23.7 times speedup. For the whole application using CNG and Trpv1 datasets, we observe 3.86 times and 2.68 times speedups, respectively.
Furthermore, we leverage Tensor Cores to accelerate the hot-spot kernel for difference calculation, employing an advanced pipelining strategy to hide latency and enable thread block-level data reuse. On a quad-A100 GPU machine, performance evaluations demonstrate that T2-RELION outperforms RELION 4.0. For the hot-spot kernel, our optimizations achieve 1.90-23.7 times speedup. For the whole application using CNG and Trpv1 datasets, we observe 3.86 times and 2.68 times speedups, respectively.
Workshop
Livestreamed
Recorded
TP
W
DescriptionWe present an approach to add native pulse-level control to heterogeneous HPCQC stacks, using the Munich Quantum Software Stack (MQSS) as a case study. Pulse programs are captured by three low-level abstractions, that is, ports (I/O channels), frames (reference signals), and waveforms (pulse envelopes). We identify representation challenges at the user-interface, compiler (IR), backend-interface, and exchange-format layers, and propose specific solutions: 1) a compiled C/C++ pulse API to avoid Python overhead, 2) LLVM extensions for pulse instructions, 3) a C-based backend interface to query hardware constraints 4) and a portable pulse-sequence exchange format. The design provides an end-to-end pulse-aware compilation and runtime path for HPC environments and an architectural blueprint to integrate pulse-level operations without disrupting classical workflows.
Paper
HPC for Machine Learning
System Software and Cloud Computing
Not Livestreamed
Not Recorded
Partially Livestreamed
Partially Recorded
TP
DescriptionExisting DGNN solutions still suffer from low data parallelism. To address this problem, we propose the topology-aware DGNN accelerator TaGNN. It presents a topology-aware concurrent execution approach in the accelerator design that calculates the final features of affected vertices while ensuring that unaffected vertices are loaded and computed only once per layer across multiple snapshots, maximizing data parallelism while minimizing memory usage. TaGNN develops a similarity-aware cell skipping strategy to selectively reuse the RNN results from the previous snapshot to bypass RNN operations, further improving data parallelism with minimal accuracy loss. TaGNN on a Xilinx Alveo U280 FPGA shows average speedups of 535.2x and 84.3x, and energy savings of 742.6x and 104.9x over state-of-the-art software DGNNs on Intel Xeon CPUs and NVIDIA A100 GPUs, respectively. TaGNN also outperforms DGNN-Booster, E-DGCN, and Cambricon-DG by average speedups of 13.5x, 10.2x, and 6.5x and energy savings of 15.9x, 11.7x, and 7.8x, respectively.
Doctoral Showcase
Research & ACM SRC Posters
Livestreamed
Recorded
TP
DescriptionDynamic resource management (DRM) enables the resources assigned to a job to be adjusted during execution. From a system perspective, DRM adds flexibility to resource allocation and job scheduling, with the potential to improve utilization, throughput, energy efficiency, and responsiveness. From an application perspective, it allows users to match resource requests to evolving needs, potentially reducing queue times and costs.
Despite these benefits and a decade of research, DRM remains largely an academic concept in HPC rather than a production feature. This is due to the need for coordinated changes across the entire software stack—applications, programming models, process managers, and resource managers—along with a holistic co-design effort to develop new scheduling and optimization policies.
We present a novel, end-to-end approach to DRM in HPC, introducing generic design principles for parallel programming models that integrate applications’ dynamic process management with the resource managers’ optimization capabilities. We apply these principles across the HPC stack, incorporating standards such as MPI and PMIx, to create a fully dynamic environment supporting diverse applications. This is paired with a performance-aware scheduling strategy based on steepest-ascent optimization.
Experiments on up to 100 nodes show moderate overheads for application process reconfiguration while delivering substantial gains in system throughput and average job turnaround time compared to static scheduling under high-load conditions.
Despite these benefits and a decade of research, DRM remains largely an academic concept in HPC rather than a production feature. This is due to the need for coordinated changes across the entire software stack—applications, programming models, process managers, and resource managers—along with a holistic co-design effort to develop new scheduling and optimization policies.
We present a novel, end-to-end approach to DRM in HPC, introducing generic design principles for parallel programming models that integrate applications’ dynamic process management with the resource managers’ optimization capabilities. We apply these principles across the HPC stack, incorporating standards such as MPI and PMIx, to create a fully dynamic environment supporting diverse applications. This is paired with a performance-aware scheduling strategy based on steepest-ascent optimization.
Experiments on up to 100 nodes show moderate overheads for application process reconfiguration while delivering substantial gains in system throughput and average job turnaround time compared to static scheduling under high-load conditions.
Workshop
Education & Workforce Development
Livestreamed
Recorded
TP
W
DescriptionBy centering the workshop on a single, richly structured dataset, participants gained technical skills while developing deeper data intuition and problem-solving abilities across the AI/ML pipeline. Moving seamlessly from data familiarization to feature engineering, model selection, and evaluation, learners explored how algorithmic choices interact with dataset characteristics and research questions. The integrated hackathon reinforced these concepts, allowing teams to pose their own questions, identify necessary features, select appropriate models, and iterate on solutions within a realistic, end-to-end workflow. This continuity reduced cognitive load, encouraged reflection on successes and failures, and highlighted the trade-offs inherent in different analytical approaches. Together, these outcomes demonstrate how a project-based, single-dataset framework fosters holistic understanding, preparing participants to apply AI/ML methods thoughtfully and effectively. This approach sets the stage for discussing the broader novelty and pedagogical impact of the workshop.
Workshop
Livestreamed
Recorded
TP
W
DescriptionConventional parallel programming using explicit multithreading over modern multicore processors imposes significant complexity in organizing and balancing work across threads. Task-based models simplify parallel programming using runtimes that handle task scheduling and resource management, improving scalability and reducing developer effort.
This paper presents the structure and experiences of teaching the Parallel Runtimes for Modern Processors course (PRMP) at IIIT Delhi. The course introduces a basic task-based parallel programming model in the async–finish style. Students implement this programming model together with a general-purpose dynamic load-balancing runtime system. As the course advances, students gradually improve both the parallel programming model and the runtime to overcome limitations and challenges of modern processor architectures. We conclude with a qualitative and quantitative evaluation of the three offerings of PRMP to date, showing that the course has significantly improved student's understanding of how to write and execute parallel programs effectively.
This paper presents the structure and experiences of teaching the Parallel Runtimes for Modern Processors course (PRMP) at IIIT Delhi. The course introduces a basic task-based parallel programming model in the async–finish style. Students implement this programming model together with a general-purpose dynamic load-balancing runtime system. As the course advances, students gradually improve both the parallel programming model and the runtime to overcome limitations and challenges of modern processor architectures. We conclude with a qualitative and quantitative evaluation of the three offerings of PRMP to date, showing that the course has significantly improved student's understanding of how to write and execute parallel programs effectively.
Workshop
Livestreamed
Recorded
TP
W
DescriptionGraph processing workloads continue to grow in scale and complexity, demanding architectures that can adapt to diverse compute and memory requirements. Traditional scale-out accelerators couple compute and memory resources, resulting in resource underutilization when executing workloads with varying compute-to-memory intensities. In this paper, we present TEGRA, a composable, scale-up architecture for large-scale graph processing. TEGRA leverages disaggregated memory via CXL and a message-passing communication model to decouple compute and memory, enabling independent scaling of each. Through detailed evaluation using the gem5 simulator, we show that TEGRA improves memory bandwidth utilization by up to 15\% over state-of-the-art accelerators by dynamically provisioning compute based on workload demands. Our results demonstrate that TEGRA provides a flexible and efficient foundation for supporting emerging graph analytics workloads across a wide range of arithmetic intensities.
Research and ACM SRC Posters
Research & ACM SRC Posters
TP
DescriptionWe present a framework that implements multiresolution analysis (MRA) on top of Template Task Graph (TTG), a distributed, task-based data-flow programming model. MRA is broadly applied across scientific domains for its ability to capture both local and global features with high accuracy, and its adaptive tree-based structure maps naturally onto data-flow execution models. TTG addresses central challenges in modern high performance computing by improving programmer productivity and enabling performance portability across heterogeneous architectures. To the best of our knowledge, this ongoing work is the first demonstration of a multiwavelet-based MRA achieving substantial performance gains on GPUs. We will present our work using visual artifacts in the poster to demonstrate the challenges and proposed solution.
Research and ACM SRC Posters
Research & ACM SRC Posters
TP
DescriptionThe evaluation of long-range pairwise electrostatic forces is the most computationally intensive component of molecular dynamics (MD) simulations. The fast multipole method (FMM) is an alternative to reduce the computational complexity. In this work, we implemented a hybrid parallel FMM with MPI and GPU acceleration. We further utilized the Tensor Core to accelerate the expensive multipole-to-local (M2L) kernel, and optimized the M2L kernel by reducing redundant computation. Meanwhile, we integrated our FMM into GROMACS. The experimental results show that the use of Tensor Core improves the performance of the M2L kernel, and our FMM implementation, integrated with GROMACS, is effective.
Paper
Applications
Livestreamed
Recorded
TP
DescriptionAI has been integrated into HPC across various scientific fields, significantly enhancing performance. In molecular dynamics simulations, HPC+AI facilitates the investigation of atomic-scale physical properties using machine-learning interatomic potentials (MLIPs). However, general-purpose ML tools (e.g., TensorFlow) used in MLIPs are not optimally matched, leading to missed optimization opportunities due to the higher computational complexity and greater diversity of HPC+AI applications compared to pure AI scenarios. To address this, we introduce TENSORMD, an MLIP independent of existing ML tools, enabling flexible optimizations that standard ML frameworks cannot support. TENSORMD outperforms a state-of-the-art MLIP—winner of the 2020 Gordon Bell Prize and built on an ML tool—by 1.88x on NVIDIA A100 GPU. Additionally, TENSORMD was evaluated on two supercomputers with different architectures, achieving significantly reduced time-to-solution and supporting molecular dynamics simulations at scales beyond 50 billion atoms.
Workshop
Livestreamed
Recorded
TP
W
DescriptionDeploying new supercomputers requires testing and evaluation via application codes. Portable, user-friendly tools enable evaluation, and the Multicomponent Flow Code (MFC), a computational fluid dynamics (CFD) code, addresses this need. MFC is adorned with a toolchain that automates input generation, compilation, batch job submission, regression testing, and benchmarking. The toolchain design enables users to evaluate compiler-hardware combinations for correctness and performance with limited software engineering experience. As with other PDE solvers, wall time per spatially discretized grid point serves as a figure of merit. We present MFC benchmarking results for five generations of NVIDIA GPUs, three generations of AMD GPUs, and various CPU architectures, utilizing Intel, Cray, NVIDIA, AMD, and GNU compilers. These tests have revealed compiler bugs and regressions on recent machines such as Frontier and El Capitan. MFC has benchmarked approximately 50 compute devices and 5 flagship supercomputers.
Workshop
Livestreamed
Recorded
TP
W
DescriptionModern scientific discovery increasingly requires coordinating distributed facilities and heterogeneous resources, forcing researchers to act as manual workflow coordinators rather than scientists. Advances in AI leading to AI agents show exciting new opportunities that can accelerate scientific discovery by providing intelligence as a component in the ecosystem. However, it is unclear how this new capability would materialize and integrate in the real world. To address this, we propose a conceptual framework where workflows evolve along two dimensions which are intelligence (from static to intelligent) and composition (from single to swarm) to chart an evolutionary path from current workflow management systems to fully autonomous, distributed scientific laboratories. With these trajectories in mind, we present an architectural blueprint that can help the community take the next steps towards harnessing the opportunities in autonomous science with the potential for 100x discovery acceleration and transformational scientific workflows.
Workshop
Livestreamed
Recorded
TP
W
DescriptionIn this new exascale computing era, applications must increasingly perform online data analysis and reduction — tasks that introduce algorithmic, implementation, and programming model challenges unfamiliar to many scientists and with major implications for the design and use of various elements of exascale systems. There are at least three important topics that this workshop is striving to address: (1) whether several orders of magnitude of data reduction is possible for exascale sciences; (2) understanding the performance and accuracy trade-off of data reduction; and (3) solutions to effectively reduce data while preserving the information hidden in large scientific datasets. Tackling these challenges requires expertise from computer science, mathematics, and application domains to study the problem holistically and develop solutions and robust software tools.
Workshop
Livestreamed
Recorded
TP
W
DescriptionThis workshop aims to enhance the computer networking research track within the SC and HPC communities, highlighting innovations in high-performance networking, networking testbeds, and integrated research infrastructure. A particular focus for INDIS 2025 is the emergence of intelligent networking within advanced cyber-infrastructure, including the use of AI, automation, and in-network telemetry and edge services. Moreover, the workshop aims to foster dialogue among experimental demonstrators and developers of production-quality services, while promoting the reproducibility and wide adoption of the showcased research. This workshop also brings together participants in SCinet’s Network Research Exhibitions (NRE) and Experimental Networks of the Future (XNet) teams to present papers on their latest innovations, designs, and solutions, and to showcase the next generation of networking challenges and solutions for HPC.
Exhibitor Forum
Data Analytics
Livestreamed
Recorded
TP
XO/EX
DescriptionThis presentation provides a blueprint for designing and building an "AI-ready" data center. We will dissect the modern AI and HPC data stack, from the physical infrastructure to the application layer. Learn about the key hardware and architectural considerations for supporting large-scale AI workloads, the role of a multi-vendor ecosystem, and the critical decision between cloud and on-premise deployments.
Workshop
Livestreamed
Recorded
TP
W
DescriptionDuring this talk, Wes and Kush will discuss experiences with creative implementations of containers and orchestrators like Kubernetes with traditional batch schedulers like Slurm through Slinky, but also newer schedulers with integration built-in such as Flux from LLNL.
We will also highlight some of the ways that LLMs can assist scientists and users with both containerization and running across a variety of schedulers, inclusive of MCP and IDE integrations.
The overall theme of our presentation is to highlight the benefits of containerization with HPC, but to also discuss some of the ideas and implementations around how to best add generative AI into both of these converging domains.
We will also highlight some of the ways that LLMs can assist scientists and users with both containerization and running across a variety of schedulers, inclusive of MCP and IDE integrations.
The overall theme of our presentation is to highlight the benefits of containerization with HPC, but to also discuss some of the ideas and implementations around how to best add generative AI into both of these converging domains.
Panel
Cloud, Data Center, & Distributed Computing
SC Community Hot Topics
Livestreamed
Recorded
TP
W
TUT
XO/EX
DescriptionThis panel discussion will focus on how both the legacy and the next-gen tech of supercomputing pioneers enables future applications (such as cloud HPC, GenAI, etc.).
The interactive panel will recount the evolution of liquid cooling in supercompute and the decades of research, expertise, and lessons learned that have now, in turn, enabled the rapid deployment of liquid cooling for AI.
Leveraging industry knowledge and hand-on experience, the panel will share best practices for those looking to introduce liquid into application, and views on risks and opportunities for the future.
Panelists will provide insight into their learnings; "hot takes" on where technology is heading—including future requirements in terms of performance and reliability; and advice for attendees on how to overcome tomorrow's challenges for success at scale.
The interactive panel will recount the evolution of liquid cooling in supercompute and the decades of research, expertise, and lessons learned that have now, in turn, enabled the rapid deployment of liquid cooling for AI.
Leveraging industry knowledge and hand-on experience, the panel will share best practices for those looking to introduce liquid into application, and views on risks and opportunities for the future.
Panelists will provide insight into their learnings; "hot takes" on where technology is heading—including future requirements in terms of performance and reliability; and advice for attendees on how to overcome tomorrow's challenges for success at scale.
Workshop
Livestreamed
Recorded
TP
W
DescriptionOperational machine learning (ML) requires skills beyond model development, including infrastructure provisioning, large-scale training across clusters, model deployment with consideration of operational performance, monitoring, and automation - capabilities grounded in high-performance computing and distributed systems. This paper presents the design and infrastructure requirements of a graduate-level course on ML Systems Engineering and Operations, aimed at equipping students with these skills. Using 186,692 total compute instance hours on the Chameleon Cloud testbed, students built end-to-end ML pipelines incorporating distributed training, reproducible experiment tracking, automated re-training and re-deployment, and continuous monitoring. We analyze compute usage across assignments, compare expected versus actual resource consumption, and estimate that replicating the course on commercial cloud platforms would cost approximately $250 per student (almost $50,000 for our course with enrollment of 191 students).
All course materials are publicly available for reuse.
All course materials are publicly available for reuse.
Workshop
Livestreamed
Recorded
TP
W
DescriptionThe importance of high performance computing is ever increasing as a critical component of cancer research and clinical applications. The current global cancer ecosystem includes new scientific methods, AI, ever expanding sources of data, and use of simulations. These dynamic changes have set the stage for tremendous growth in HPC for cancer research and clinical application, particularly with the U.S. Cancer Moonshot 2.0 initiative, which aims to reduce mortality of cancer by 50% in 25 years. Originally established as part of SC15 during the advent of the Precision Medicine Initiative and the U.S. National Strategic Computing Initiative, this workshop provides a key venue for multiple disciplines and interests to converge, share insights, and develop collaborations in which HPC and computational approaches will advance the frontiers of cancer research and cancer care.
Workshop
Livestreamed
Recorded
TP
W
DescriptionThis workshop focuses on the development of software frameworks and workload management strategies that are crucial for quantum-HPC (Q-HPC) ecosystems. As quantum computing progresses, integrating quantum processors with HPC systems presents significant opportunities to tackle complex, large-scale problems. Experts from academia, industry, and national labs will discuss the challenges of managing hybrid resources, along with cutting-edge research on middleware, scheduling algorithms, decomposition strategies, and benchmarking methodologies for Q-HPC systems. The workshop will include keynote talks, paper presentations, panel discussions, and interactive demos to foster collaboration and advance the state of hybrid computing. By the end of the workshop, attendees will have gained valuable insights into best practices, emerging technologies, and future directions in Q-HPC integration, contributing to the broader goal of making quantum computing a practical extension of HPC environments.
Paper
Applications
GBC
Livestreamed
Recorded
TP
DescriptionA major goal of computational astrophysics is to simulate the Milky Way galaxy with sufficient resolution, down to individual stars. However, the scaling fails due to some small-scale, short-timescale phenomena, such as supernova explosions. We have developed a novel integration scheme of $N$-body/hydrodynamics simulations working with machine learning. This approach bypasses the short timesteps caused by supernova explosions using a surrogate model, thereby improving scalability. With this method, we reached 300 billion particles using 148,900 nodes, equivalent to 7,147,200 CPU cores, breaking through the billion-particle barrier currently faced by state-of-the-art simulations. This resolution allows us to perform the first star-by-star galaxy simulation, which resolves individual stars in the Milky Way galaxy. The performance scales over $10^4$ CPU cores, an upper limit in the current state-of-the-art simulations using both A64FX and X86-64 processors and NVIDIA CUDA GPUs.
Birds of a Feather
Democritization of HPC
Livestreamed
Recorded
TP
XO/EX
DescriptionThe U.S. National Science Foundation's vision and investment plans for cyberinfrastructure (CI) are designed to address the evolving needs of the science and engineering research community. Senior leadership and program staff from NSF’s Office of Advanced Cyberinfrastructure (OAC) will discuss OAC's vision, strategic and national priorities, as well as highlights from the latest funding opportunities across all aspects of the research cyberinfrastructure ecosystem. Substantial time will be devoted to audience Q&A between attendees and NSF staff and unstructured time to meet informally with NSF staff.
Birds of a Feather
Architectures & Networks
Livestreamed
Recorded
TP
XO/EX
DescriptionThe rapid evolution of AI workloads is pushing the boundaries of interconnect design, requiring innovative approaches to balance performance, scalability, and efficiency. This BoF will explore cutting-edge advancements in AI interconnects and their impact on performance optimization.
Discussion topics include the evolving role of open standards in AI infrastructure, scale-out vs. scale-up networking solutions for AI workloads, innovative interconnect designs that redefine performance limits, and future-proofing AI networks: what’s next?
Attendees will engage in a forward-looking discussion with industry leaders and pioneering AI infrastructure startups that will tackle the most pressing questions shaping the future of AI interconnects.
Discussion topics include the evolving role of open standards in AI infrastructure, scale-out vs. scale-up networking solutions for AI workloads, innovative interconnect designs that redefine performance limits, and future-proofing AI networks: what’s next?
Attendees will engage in a forward-looking discussion with industry leaders and pioneering AI infrastructure startups that will tackle the most pressing questions shaping the future of AI interconnects.
Birds of a Feather
Standards
Livestreamed
Recorded
TP
XO/EX
DescriptionPython is the most widely used programming language today—but on HPC systems, it’s often more pain than power. From slow loading on shared filesystems, to packaging headaches and incompatible frameworks, these hurdles block new users and frustrate seasoned developers alike. Meanwhile, major changes to Python’s core—Array APIs, packaging, JIT compilation, and threading—are happening without HPC at the table. This BoF brings together language leaders and HPC practitioners to change that. Join us to shape Python’s future, learn what’s coming, connect with peers, and help launch a new HPC working group to influence Python from the inside out.
Birds of a Feather
Community Meetings
Livestreamed
Recorded
TP
XO/EX
DescriptionWith power being a first-order design constraint on par with performance, it is important to measure and analyze energy-efficiency trends in supercomputing. To raise the awareness of greenness as a first-order design constraint, the Green500 seeks to characterize the energy efficiency of supercomputers for different metrics, workloads, and methodologies. This BoF discusses trends across the Green500 and highlights from the current Green500 list. In addition, the Green500, Top500, and Energy Efficient HPC Working Group have been working together on improving power-measurement methodology, and this BoF presents recommendations for changes that will improve ease of submission without compromising accuracy.
Research and ACM SRC Posters
Research & ACM SRC Posters
TP
DescriptionIn recent years, the RISC-V vector extension (RVV) has attracted increasing attention. The RVV allows programs to be executed on processors with various maximum vector lengths (MVLs). Consequently, even when running the same program, the memory access pattern may vary depending on the MVL of the processor, potentially leading to changes in the optimal cache management technique. In this poster, we focus on replacement policies and the use of non-temporal hints. We execute the same program on processors with different MVLs and compare multiple cache management techniques. The results demonstrate that the optimal cache management technique can vary with the MVL. This finding highlights the necessity of selecting cache management techniques while taking the MVL into account when utilizing RVV. In the poster session, we will explain these research findings using charts that illustrate the performance results.
Birds of a Feather
State of the Practice
Livestreamed
Recorded
TP
XO/EX
DescriptionThe convergence of traditional HPC simulations and large-scale AI is reshaping data center infrastructure. This session addresses the unique challenges of supporting both modeling and machine learning workloads on a unified platform. Key topics include navigating the demands of specialized hardware like GPUs and custom accelerators, managing complex software stacks with containers and workflow managers, and optimizing high-performance storage for diverse I/O patterns. We'll also discuss scheduling strategies for fair and efficient resource allocation. Join this interactive discussion to share your experiences, challenges, and solutions for building and managing a truly converged HPC and AI infrastructure
Workshop
Livestreamed
Recorded
TP
W
DescriptionWe describe a new end-to-end experimental data streaming framework designed from the ground up to support new types of applications – AI training, extremely high-rate X-ray time-of-flight analysis, crystal structure determination with distributed processing, and custom data science applications and visualizers yet to be created. Throughout, we use design choices merging cloud microservices with traditional HPC batch execution models for security and flexibility. This project makes a unique contribution to the DOE Integrated Research Infrastructure (IRI) landscape. By creating a flexible, API-driven data request service, we address a significant need for high-speed data streaming sources for the X-ray science data analysis community. With the combination of data request API, mutual authentication web security framework, job queue system, high-rate data buffer, and complementary nature to facility infrastructure, the LCLStreamer framework has prototyped and implemented several new paradigms critical for future generation experiments.
Workshop
Livestreamed
Recorded
TP
W
DescriptionThis paper presents an analysis of memory hierarchy latency across AMD Instinct™ MI300A, MI300X, and MI250X GPUs using a fine-grained pointer-chasing microbenchmark. We characterize the scalar L1 (sL1), L2, AMD Infinity Cache™ referred to as the MALL (Memory Attached Last Level), and HBM (High Bandwidth Memory), revealing distinct latency levels and architectural trade-offs. MI300A and MI300X, based on the CDNA3 architecture, exhibit nearly identical latency profiles, while MI250X lacks a MALL, resulting in different performance characteristics. Memory latency remains consistent across compute partitioning modes, but NUMA Partitioning per Socket (NPS) significantly impacts performance. In NPS4 mode, partitioning improves locality, reducing latency by up to 1.42× in MALL and 1.31× in HBM. We further analyze MALL contention and Translation Lookaside Buffer (TLB) behavior under varying parallelism levels, identifying conditions where MALL performance degrades. These findings provide actionable insights for optimizing memory access patterns and improving performance on AMD’s latest GPU architectures.
SCinet Network Research Exhibition
Not Livestreamed
Not Recorded
DescriptionNRI103,NRI104,NRI106
Birds of a Feather
Artificial Intelligence & Machine Learning
Livestreamed
Recorded
TP
XO/EX
DescriptionThe confluence of HPC, AI, and cloud is entering a new phase, catalyzed by the integration of AI and its influence on HPC hardware. Scientific workflows are evolving to treat simulation, AI, and analytics as a a deeply connected continuum. In this session, we’ll explore how the need to incorporate large-scale AI models and agentic systems—from training on scientific data to real-time inferencing—is creating new patterns of hybrid HPC-cloud usage. We will also discuss the architectural, software, and policy challenges and opportunities facing HPC professionals in a world where AI and Cloud are now driving the evolution of high performance computing infrastructure.
Panel
Architectures
Performance Evaluation, Scalability, & Portability
SC Community Hot Topics
Livestreamed
Recorded
TP
DescriptionWith Moore's law nearing its end, hardware specialization is becoming crucial for continued HPC advancement. Quantum computing (QC) is gaining significant attention due to its potential to dramatically accelerate diverse HPC workloads, from materials science and chemistry to high-energy physics and cryptography, to name a few. However, with hardware and respective software infrastructure under development, when and how to integrate QC into the HPC ecosystem poses fundamental challenges—from logistics to support. Realizing quantum-accelerated HPC requires closely integrating QC with classical CPU- and GPU-accelerated computing and carefully considering this new computing paradigm's unique advantages and limitations. Together, HPC centers, quantum hardware providers, and quantum software specialists are collaborating to solve integration challenges in middleware, job schedulers, and co-processing models to develop tools for workflow management, and also address the growing need for user awareness and training.
Exhibitor Forum
Quantum & Other Post Moore Computing Technologies
Livestreamed
Recorded
TP
XO/EX
DescriptionNeutral-atom quantum computing presents a compelling path toward scalable, fault-tolerant systems, with advantages in qubit count, reconfigurable connectivity, and manufacturability. This presentation details QuEra’s roadmap from today’s processors toward large-scale error-corrected architectures capable of deep integration with high performance computing (HPC) environments. We outline progress in increasing physical qubit numbers, implementing mid-circuit measurement and feed-forward, and developing modular architectures for millions of qubits. Emphasis will be placed on resource estimates for logical qubits, error-correction overheads, and co-design approaches that align quantum hardware evolution with HPC workflows. By exploring use cases in quantum simulation, optimization, and machine learning, we highlight how fault-tolerant quantum systems can augment classical supercomputers, accelerating solutions to problems beyond classical reach. The talk will map a technically credible path from today’s systems to tomorrow’s exascale-class quantum-HPC hybrid platforms.
Exhibitor Forum
Data Analytics
Livestreamed
Recorded
TP
XO/EX
DescriptionIndustrial engineering is undergoing a significant transformation, propelled by advancements in accelerated computing, AI physics, and digital twins. This presentation will detail how these hardware and software innovations are helping to compress engineering design cycles and accelerate research and development.
We will deep-dive into mixed-precision techniques, new ML architectures, emerging quantum computing applications and the evolution of GPU hardware. Through real-world use cases from the automotive, aerospace, and semiconductor industries, this session will provide the HPC community with actionable insights into how AI-driven simulation and accelerated computing are fundamentally reshaping the industrial engineering landscape.
We will deep-dive into mixed-precision techniques, new ML architectures, emerging quantum computing applications and the evolution of GPU hardware. Through real-world use cases from the automotive, aerospace, and semiconductor industries, this session will provide the HPC community with actionable insights into how AI-driven simulation and accelerated computing are fundamentally reshaping the industrial engineering landscape.
Workshop
Livestreamed
Recorded
TP
W
DescriptionThis workshop brings together HPC researchers, operators, and vendors from around the globe to present and discuss state-of-the-art HPC system testing methodologies, tools, benchmarks, procedures, and best practices. The increasing complexity of HPC architectures and growing need to leverage HPC for integrated workflows necessitates more extensive testing than ever in order to thoroughly evaluate the status of the system after installation or a software upgrade and ensure proper operation before it is transitioned to production. Different methodologies are used to evaluate systems during their lifetime, not only at the beginning during the installation, but also during maintenance windows and alongside regular operations. This workshop provides a venue to present and discuss the latest HPC system testing technologies and methodologies. The event will include an opening talk focused on current HPC system testing topics, followed by a series of paper presentations from peer-reviewed accepted submissions, and concluding with a panel discussion.
Paper
Energy Efficiency
Performance Measurement, Modeling, & Tools
Power Use Monitoring & Optimization
State of the Practice
Livestreamed
Recorded
TP
DescriptionHigh performance computing (HPC) systems are becoming increasingly water-intensive due to their reliance on water-based cooling and the energy used in power generation. However, the water footprint of HPC remains relatively underexplored—especially in contrast to the growing focus on carbon emissions. In this paper, we present ThirstyFLOPS, a comprehensive water footprint analysis framework for HPC systems. Our approach incorporates region-specific metrics, including water usage effectiveness, power usage effectiveness, and energy water factor, to quantify water consumption using real-world data. Using four representative HPC systems—Marconi, Fugaku, Polaris, and Frontier—as examples, we provide implications for HPC system planning and management. We explore the impact of regional water scarcity and nuclear-based energy strategies on HPC sustainability. Our findings aim to advance the development of water-aware, environmentally responsible computing infrastructures.
Workshop
Livestreamed
Recorded
TP
W
DescriptionModern computing systems face significant security challenges. While vulnerabilities in CPUs have been extensively studied, GPUs--an increasingly important component of today's computing platforms--have received much less attention. In this talk, I will present our recent studies that aim to bridge this gap. In the first part, I will discuss our findings on GPU memory management systems and demonstrate how weaknesses in their design can be exploited to compromise GPU applications and, in some cases, even CPU applications. In the second part, I will introduce hardware side channels on modern GPUs and show that, despite the adoption of hardware isolation mechanisms, powerful side-channel attacks can still be launched, which pose serious privacy risks to applications such as video games. Finally, I will conclude the talk with a brief discussion of potential countermeasures and directions for future research in GPU security.
Paper
HPC for Machine Learning
State of the Practice
System Software and Cloud Computing
Livestreamed
Recorded
TP
DescriptionGraph partitioning is essential for effectively processing large-scale graphs in distributed computing systems. However, traditional graph partitioning strategies frequently lead to elevated communication costs, particularly within distributed computing systems that utilize thousands of computing nodes. This is because prior partitioning methods fail to consider the variations in communication costs across the communication hierarchies. We propose TianheEngine for leveraging the communication hierarchy among distributed computing systems containing thousands of computing nodes. TianheEngine introduces an adaptive, communication hierarchy-aware methodology to partition and distribute large graphs across computing nodes. It exploits the communication hierarchy of the underlying distributed computing system and the sparsity characteristics of the input graphs to improve communication efficiency. We evaluated TianheEngine on fundamental graph operations using both synthetic and real-world datasets. Experimental results show that TianheEngine is superior to state-of-the-art graph partitioning methods and parallel graph systems and outperforms top-ranked systems on the latest Graph 500 list.
Research and ACM SRC Posters
Research & ACM SRC Posters
TP
DescriptionAccurate forecasting of water levels is essential for flood mitigation. Traditionally, predictions have been based on harmonic analysis and sensor networks maintained by the National Oceanographic and Atmospheric Administration. However, these methods struggle with high-variance events that change water levels from the long-term tidal baseline. TidalMark evaluates the ability of a variety of deep learning models to model these high-variance events. Through extensive hyperparameter sweeps and comparisons across model variants, we have evaluated trade-offs in accuracy, generalization, and scalability. Our results show that properly tuned machine learning models consistently outperform the scientific-standard harmonic approaches between 2.1X and 4.7X (between one- to seven-day predictions) with the goal towards achieving adaptive, scalable, and accurate forecasting of coastal water levels.
Early Career
Not Livestreamed
Not Recorded
TP
W
TUT
XO/EX
DescriptionThis session will cover systems and tools for time and task management in a fast-paced workplace. Participants will learn from others and share what works for them.
Research and ACM SRC Posters
Research & ACM SRC Posters
TP
DescriptionThis work presents a time-stepping Hamiltonian simulation framework for nonlinear PDEs on a hybrid quantum–classical approach. Using warped phase transform (WPT)–based Schrödingerization, spatial discretizations are reformulated as Hermitian/anti-Hermitian operators for Schrödinger-type equations, enabling unitary propagation even for dissipative systems. In contrast to traditional linearizations (Carleman, KvN) that cause exponential statevector growth and truncation errors on NISQ hardware, we update nonlinear terms classically at each step, incorporate into the Hamiltonian, and calculate it by unitary evolution on the quantum circuit. This local linear approximation over small time intervals prevents dimensional inflation while securing calculation accuracy. We implement the framework in Qiskit and evaluate it with the Qiskit Aer statevector simulator on linear advection–diffusion and nonlinear problems including Burgers and Allen–Cahn phase field models. The results show good agreement with classical solutions, highlighting its potential for efficiently simulating nonlinear dynamics without dimensional inflation.
Workshop
short paper
Livestreamed
Recorded
TP
W
DescriptionModern scientific instruments generate data at rates that increasingly outpace local compute capabilities, making traditional file-based workflows inadequate for time-sensitive analysis and experimental steering. Real-time streaming frameworks promise lower latency and improved efficiency, but lack a principled feasibility assessment. We introduce a quantitative framework and accompanying Streaming Speed Score to evaluate if remote high-performance computing (HPC) resources can provide timely data processing compared to local alternatives. Our model incorporates key parameters including data generation rate, transfer efficiency, remote processing power, and file I/O overhead to compute total processing completion time (Tpct) and identify regimes where streaming is beneficial. We validate our approach through case studies from facilities such as APS, FRIB, LCLS-II, and the LHC. Our measurements show streaming can achieve up to 97% lower end-to-end completion time than file-based methods under high data rates, while worst-case congestion can increase transfer times by over an order of magnitude.
Workshop
Livestreamed
Recorded
TP
W
DescriptionLarge Language Model (LLM) training workloads share computational characteristics with high-performance computing applications, requiring intensive parallel processing, complex matrix operations, and distributed computing with frequent synchronization -- requiring specialized hardware to deliver optimal performance.
This talk presents insights from Vela, a cloud-native system architecture introduced in 2021 for LLM training using commercial hardware and open-source software. The Vela architecture combines off-the-shelf hardware, Linux KVM virtualization with PCIe passthrough, and virtualized RDMA over Converged Ethernet networks. The system employs software-defined networking with SRIOV technology for GPU Direct RDMA, achieving near-bare-metal performance while maintaining virtualization benefits.
Based on multiple data center deployments and iterations, we present two case studies examining what it takes for virtualization-based systems to deliver (a) bare-metal RoCE-like performance and (b) bare-metal InfiniBand-like performance for LLM training workloads. The discussion focuses on virtualization challenges, experiences, and runtime optimizations required for optimal performance in cloud-native training infrastructure.
This talk presents insights from Vela, a cloud-native system architecture introduced in 2021 for LLM training using commercial hardware and open-source software. The Vela architecture combines off-the-shelf hardware, Linux KVM virtualization with PCIe passthrough, and virtualized RDMA over Converged Ethernet networks. The system employs software-defined networking with SRIOV technology for GPU Direct RDMA, achieving near-bare-metal performance while maintaining virtualization benefits.
Based on multiple data center deployments and iterations, we present two case studies examining what it takes for virtualization-based systems to deliver (a) bare-metal RoCE-like performance and (b) bare-metal InfiniBand-like performance for LLM training workloads. The discussion focuses on virtualization challenges, experiences, and runtime optimizations required for optimal performance in cloud-native training infrastructure.
Tutorial
Livestreamed
Recorded
TUT
DescriptionHigh performance computing and machine learning applications increasingly rely on mixed-precision arithmetic on CPUs and GPUs for superior performance. However, this shift introduces several challenging numerical issues such as increased round-off errors, and INF and NaN exceptions that can render the computed solutions useless. At present, this places a heavy burden on developers, interrupting their work while they diagnose these problems manually. This tutorial presents three tools that target specific issues leading to floating-point bugs. First, we present FPChecker, which not only detects and reports INF/NaN exceptions in parallel and distributed CPU codes, but also tells programmers about the exponent value ranges for avoiding exceptions while also minimizing rounding errors. Second, we present GPU-FPX, which detects floating-point exceptions generated by NVIDIA GPUs, including their Tensor Cores via a "nixnan" extension to GPU-FPX. Third, we present FloatGuard, a unique tool that detects exceptions in AMD GPUs. The tutorial is aimed at helping programmers avoid exception bugs; for this, we will demonstrate our tools on simple examples with seeded bugs. Attendees may optionally install and run our tools. The tutorial also allocates question/answer time to address real situations faced by the attendees.
Birds of a Feather
Community Meetings
Livestreamed
Recorded
TP
XO/EX
DescriptionThe TOP500 list of supercomputers serves as a “Who’s Who” in the field of high performance computing (HPC). It started as a list of the most powerful supercomputers in the world and has evolved to a major source of information about trends in HPC. The 66th TOP500 list will be published in November 2025, just in time for SC25.
This BoF will present detailed analyses of the TOP500 and discuss the changes in the HPC marketplace during the past years. The BoF is meant as an open forum for discussion and feedback between the TOP500 authors and the user community.
This BoF will present detailed analyses of the TOP500 and discuss the changes in the HPC marketplace during the past years. The BoF is meant as an open forum for discussion and feedback between the TOP500 authors and the user community.
Workshop
Livestreamed
Recorded
TP
W
DescriptionIn higher education, there is a growing need for reproducible exercise environments that facilitate teaching of HPC and data science. While JupyterHub can provide students with a consistent environment, challenges remain in operating multiple exercises simultaneously and in building and maintaining the JupyterHub system itself. To address these issues, we have developed MCJ-CloudHub, which is designed to support the concurrent operation of multiple exercises through
integration with Moodle and JupyterHub and to enable exercises that use GPU computing. Using Virtual Cloud Provider (VCP) technology, MCJ-CloudHub can be deployed flexibly across both on-premises and cloud environments.
integration with Moodle and JupyterHub and to enable exercises that use GPU computing. Using Virtual Cloud Provider (VCP) technology, MCJ-CloudHub can be deployed flexibly across both on-premises and cloud environments.
Research and ACM SRC Posters
Research & ACM SRC Posters
TP
DescriptionWe present a WebGPU-based framework for real-time visualization of large-scale protein–protein interaction (PPI) networks directly in standard browsers. Built on GraphWaGu, our extended graph rendering API integrates GPU-accelerated force-directed layout computation with dynamic edge filtering and degree-based visual encoding. Users can adjust parameters such as confidence thresholds, iterations, and cooling factors, with layout updates. This approach sustains high frame rates for networks with millions of edges, mitigates the hairball effect, and enables biologists to explore complex PPI networks efficiently, intuitively, and without specialized software installation.
Workshop
Livestreamed
Recorded
TP
W
DescriptionRobust execution environments for quantum computing can aid the industry with key challenges like application development, portability, and reproducibility, and help unlock the development of more modular quantum program, driving forward hybrid quantum workflows.
In this work, we show progress towards a basic, but portable, runtime environment for developing and executing hybrid quantum-classical programs running in High Performance Computing environments enhanced with Quantum Processing Units (QPUs). The middleware includes a second layer of scheduling after the main HPC resource manager in order to improve the utilization of the QPU, and extra functionality for observability, monitoring, and admin access.
We show how this allows us to manage several programming Software Development Kits as first-class citizens in the environment by building on a recently proposed vendor-neutral Quantum Resource Management Interface (QRMI). Lastly, we discuss and show a solution for the monitoring and observability stack, completing our description of the hybrid system architecture.
In this work, we show progress towards a basic, but portable, runtime environment for developing and executing hybrid quantum-classical programs running in High Performance Computing environments enhanced with Quantum Processing Units (QPUs). The middleware includes a second layer of scheduling after the main HPC resource manager in order to improve the utilization of the QPU, and extra functionality for observability, monitoring, and admin access.
We show how this allows us to manage several programming Software Development Kits as first-class citizens in the environment by building on a recently proposed vendor-neutral Quantum Resource Management Interface (QRMI). Lastly, we discuss and show a solution for the monitoring and observability stack, completing our description of the hybrid system architecture.
Workshop
Data Analytics
High Performance I/O, Storage, Archive, & File Systems
Storage
Livestreamed
Recorded
TP
W
DescriptionScientific data management requires researchers to navigate fragmented toolchains spanning data gathering, resource allocation, application deployment, and analysis. While Large Language Models offer natural language interfaces for HPC tasks, existing approaches suffer from system-specific training dependencies and lack standardized tool integration. We present IOWarp-mcps, a comprehensive suite of Model Context Protocol (MCP) tools enabling AI-driven scientific data management across complete workflows. Our framework addresses large-scale scientific datasets through two core principles: chunked I/O access for memory-efficient data partitioning and label-based filtering for selective data reduction before model ingestion. We evaluate IOWarp-mcps across three scenarios: automated dataset discovery from the National Data Platform, molecular dynamics trajectory analysis from LAMMPS simulations, and parallel I/O benchmark deployment. Results demonstrate significant productivity improvements, with configuration tasks reduced from 5-10 minutes to approximately one minute. IOWarp-mcps bridges the gap between conversational AI and scientific computing, providing intuitive interfaces for complex data management operations.
Workshop
Debugging & Correctness Tools
HPC Software & Runtime Systems
Livestreamed
Recorded
TP
W
DescriptionWe present early work towards establishing an automated workflow for numerical error analysis of SYCL kernels. The method leverages the OpenCL Intercept Layer to record the execution of GPU-targeted SYCL kernels and replay them with CPU runtimes, enabling detailed floating-point error evaluation without modifying the original application. We analyze the force kernel from a large-scale cosmology application code (HACC) using PoCL and Verificarlo, exploring both IEEE-compliant configurations and reduced-precision modes (\eg, FP16, TF32, BF16), as well as the effects of the \texttt{-ffast-math} compiler optimization using stochastic arithmetic via MCA and PRISM. By leveraging open tools and standards, this work contributes a reusable path toward broader adoption of numerical accuracy evaluation in HPC.
Research and ACM SRC Posters
Research & ACM SRC Posters
TP
DescriptionModern HPC systems generate large amounts of GPU and network telemetry, typically used for system health monitoring. At NERSC, we are developing a Performance API/UI that generates a job report card from this telemetry, providing an overview of performance characteristics. Using DCGM counters, we report GPU memory, compute, and power usage, and present preliminary investigations of job-level network activity. Without traditional profiling tools, this application-agnostic approach helps identify resource utilization imbalances, detect anomalies such as memory leaks, and assess overall performance for the user without additional effort.
Paper
Applications
Architectures & Networks
HPC for Machine Learning
Livestreamed
Recorded
TP
DescriptionEfficient LLM inference remains challenging due to the autoregressive decoding process, which generates only one token at a time. Speculative decoding has been introduced to address the limitation by using small speculative models (SSMs) to speed up LLM inference. However, the low acceptance rate of SSMs and the high verification cost of LLMs prohibit further performance improvement. In this paper, we present Smurfs, an LLM inference system designed to accelerate LLM inference through collective and adaptive speculative decoding. Smurfs adopts a majority-voted mechanism that harnesses multiple SSMs to collaboratively predict LLM outputs in multi-task scenarios. It also decouples SSM speculation from LLM verification and uses a pipelined execution to reduce the latency of SSM speculation. Additionally, Smurfs proposes a mechanism to dynamically determine the optimal speculation length of SSM at runtime. The experimental results demonstrate the superiority of Smurfs in terms of inference throughput and latency compared to state-of-the-art systems.
Workshop
Livestreamed
Recorded
TP
W
DescriptionEfficient graph processing is essential for a wide range of applications.
Scalability and memory access patterns are still a challenge, especially with the Breadth-First Search algorithm. This work focuses on leveraging multi-GPU HPC nodes with peer-to-peer support of the Intel oneAPI implementation of SYCL.
We propose three GPU-based load-balancing methods: work-group localisation for efficient data access, even workload distribution for higher GPU occupancy, and a hybrid strided-access approach for heuristic balancing. These methods ensure performance, portability, and productivity with a unified codebase.
Our proposed methodologies outperform state-of-the-art single-GPU implementations based on CUDA on synthetic RMAT graphs. We analysed BFS performance across NVIDIA A100, Intel Max 1550, and AMD MI300X GPUs, achieving a peak performance of 153.27 GTEPS on an RMAT25-64 graph using 8 GPUs on the NVIDIA A100. Furthermore, our work handles RMAT graphs up to scale 29, achieving superior performance on synthetic graphs and competitive results on real-world datasets.
Scalability and memory access patterns are still a challenge, especially with the Breadth-First Search algorithm. This work focuses on leveraging multi-GPU HPC nodes with peer-to-peer support of the Intel oneAPI implementation of SYCL.
We propose three GPU-based load-balancing methods: work-group localisation for efficient data access, even workload distribution for higher GPU occupancy, and a hybrid strided-access approach for heuristic balancing. These methods ensure performance, portability, and productivity with a unified codebase.
Our proposed methodologies outperform state-of-the-art single-GPU implementations based on CUDA on synthetic RMAT graphs. We analysed BFS performance across NVIDIA A100, Intel Max 1550, and AMD MI300X GPUs, achieving a peak performance of 153.27 GTEPS on an RMAT25-64 graph using 8 GPUs on the NVIDIA A100. Furthermore, our work handles RMAT graphs up to scale 29, achieving superior performance on synthetic graphs and competitive results on real-world datasets.
Workshop
Livestreamed
Recorded
TP
W
DescriptionKubernetes is the de facto standard for container orchestration but was not designed for hostile multi-tenancy. Native constructs such as namespaces, role-based access control, and admission controllers provide logical separation but lack the strong isolation required in adversarial environments. This paper presents a Kubernetes-compatible architecture that integrates per-tenant virtual control planes, hypervisor-backed sandboxes, and automated policy enforcement to achieve secure multi-tenancy. Each tenant receives a dedicated virtual control plane (via vCluster) linked to a virtual node that schedules workloads into VM-based sandboxes (Azure Container Instances), preserving the Kubernetes API experience. A policy engine (Kyverno) hardens namespaces by enforcing network segmentation, resource limits, and strict security contexts at admission time. Evaluation demonstrates that this approach delivers strong inter-tenant isolation with negligible performance overhead, providing a practical model for zero-trust container orchestration in hostile cloud and edge environments.
Doctoral Showcase
Research & ACM SRC Posters
Livestreamed
Recorded
TP
DescriptionWell calibrated mathematical and computational models enable the prediction and control of complex systems. These models can be utilized to design engineering systems or to develop treatment protocols. In contrast to one-size-fits-all approaches that seek to mitigate risk at the population level, digital twins enable personalized modeling that seeks to improve decisions at the level of the individual to improve cohort outcomes. This tailored approach is crucial in applications such as precision oncology. In particular, high-grade gliomas exhibit significant heterogeneity in physiology and response to treatment that result in low median survival rates despite an aggressive standard-of-care.
We develop a computational pipeline that utilizes longitudinally collected MRI data to generate a patient-specific computational geometry and estimate the tumor cellularity. The data are then used to inform the spatially varying parameters of mathematical models for tumor growth through the solution of an inverse problem. The high-consequence nature of downstream decisions prompts a rigorous approach to uncertainty quantification. We utilize a Bayesian framework with a focus on scalable and efficient methods to characterize the uncertainty in the model inputs from the sparse, noisy imaging data. Furthermore, we show promising results for therapy planning using a risk-based formulation for optimization under uncertainty.
We develop a computational pipeline that utilizes longitudinally collected MRI data to generate a patient-specific computational geometry and estimate the tumor cellularity. The data are then used to inform the spatially varying parameters of mathematical models for tumor growth through the solution of an inverse problem. The high-consequence nature of downstream decisions prompts a rigorous approach to uncertainty quantification. We utilize a Bayesian framework with a focus on scalable and efficient methods to characterize the uncertainty in the model inputs from the sparse, noisy imaging data. Furthermore, we show promising results for therapy planning using a risk-based formulation for optimization under uncertainty.
Workshop
Livestreamed
Recorded
TP
W
DescriptionAs quantum computers grow in qubit count and fidelity, translating applications into hardware-specific instructions becomes essential. Intermediate representations (IRs) help optimize this process. One such IR is Microsoft’s Quantum Intermediate Representation (QIR), built on the LLVM compiler framework. This article explores various ways QIR can be integrated into quantum computing workflows. It demonstrates how to convert an existing quantum circuit simulator into a QIR runtime, showing that the transition is straightforward and does not compromise performance. In fact, adopting QIR enables advanced features like classical control flow, which is crucial for testing quantum error correction protocols. The implementation is open-source and available at https://github.com/cda-tum/mqt-core, and the article concludes with future directions for QIR development.
Paper
Performance Measurement, Modeling, & Tools
System Software and Cloud Computing
Livestreamed
Recorded
TP
DescriptionTrace analysis of large-scale parallel applications is crucial for understanding and optimizing performance. It primarily focuses on the interaction behaviors between different parallel processes, such as synchronization waits and asynchronous overlaps. The trace size explodes as the parallel scale of applications, thus current methods analyze traces in parallel to ensure analysis speed. However, due to the interaction pattern-agnostic trace distribution, they often introduce inter-process communications to fetch non-local event data during interaction analysis, leading to excessively long trace analysis time.
To address this issue, we propose TraceFlow, a trace analysis tool for large-scale parallel applications, which achieves a nearly communication-free analysis through an interaction pattern-aware trace distribution strategy. We evaluate the efficiency of TraceFlow on widely used benchmarks and several real-world applications with up to 8,192 processes. Experimental results show that TraceFlow achieves an average speedup of 13.49× in the analysis time compared to the state-of-the-art approaches.
To address this issue, we propose TraceFlow, a trace analysis tool for large-scale parallel applications, which achieves a nearly communication-free analysis through an interaction pattern-aware trace distribution strategy. We evaluate the efficiency of TraceFlow on widely used benchmarks and several real-world applications with up to 8,192 processes. Experimental results show that TraceFlow achieves an average speedup of 13.49× in the analysis time compared to the state-of-the-art approaches.
Workshop
Livestreamed
Recorded
TP
W
DescriptionGlobal file systems, whose access spans multiple systems/sub-systems within a High-Performance Computing (HPC) center, are common at many institutions due to a range of benefits they provide. In the vast majority of cases however, they operate under a single authentication domain or in more complex cases, support multiple domains with each getting siloed data access. Recently at the National Center for Supercomputing Applications (NCSA), we have integrated a cluster that pushed us to engineer a solution that provides users the ability to seamlessly access their data, regardless of which authentication domain the system is tied to.
This paper describes the design, technologies, and processes NCSA architected to deliver this capability to researchers and shares the suggested practices we’ve discovered while operating it. Additionally, we lay out the benefits researchers gained by us providing this level of integration between the different authentication domains at the global file system layer.
This paper describes the design, technologies, and processes NCSA architected to deliver this capability to researchers and shares the suggested practices we’ve discovered while operating it. Additionally, we lay out the benefits researchers gained by us providing this level of integration between the different authentication domains at the global file system layer.
Paper
Applications
Livestreamed
Recorded
TP
DescriptionStructure-based virtual screening confronts a grand challenge in scaling to trillion-ligand libraries for drug discovery. We present SWDOCKP\textsuperscript{2}, a performance-portable virtual screening framework achieving 1.9 trillion ligand-receptor pairs daily across eight targets on the Sunway OceanLight supercomputer with 40 million cores—10× faster than the prior state of the art. Key innovations combine (1) a ligand database optimizer with conformational sorting and merging, (2) multi-receptor grid alignment enabling parallel target screening and SIMD-accelerated trilinear interpolation, and (3) a Sunway architecture emulator for cross-platform efficiency. These advancements bridge computational scalability with novel drug discovery demands, offering a blueprint for next-generation supercomputing in structure-based drug design. Additionally, SWDOCKP\textsuperscript{2} will generate an unprecedented dataset of predicted protein-ligand interactions, creating a transformative resource for machine learning applications. By addressing experimental data scarcity, this dataset empowers accurate ligand prediction, generative chemistry, and AI-driven drug discovery.
Panel
AI, Machine Learning, & Deep Learning
SC Community Hot Topics
Scientific & Information Visualization
Livestreamed
Recorded
TP
DescriptionThis panel will discuss the role of uncertainty in HPC from the perspective of predictive simulation and data-driven modeling, with a focus on future scientific workloads and interpretable AI for science. Why is treatment of uncertainty a necessity for robust prediction? What are the particular challenges and opportunities for probabilistic methods in ModSim at exascale? How can uncertainty quantification be a scaffold for scientific AI/ML, and what are the pitfalls? This discussion will lay the foundation for future work in HPC co-design at the interface of theory, software, and hardware optimization as we prepare for new paradigms of predictive modeling and simulation in the era of AI.
Birds of a Feather
Architectures & Networks
Security & Privacy
Livestreamed
Recorded
TP
XO/EX
DescriptionWe explore the multifaceted aspects of managing risk for research projects involving sensitive data and AI models which depend deeply on supercomputing infrastructures. This risk spectrum spans traditional technical cyber controls as well as policy and sociological (including human factors) risks. In the context of multi-facility, multi-institutional workflows such as IRI and the American Science Cloud (AmSC), our goal is to advance progress in developing secure and trustworthy infrastructures for AI and integrated science.
Three intertwined challenges emerge as we advance this vision for IRI and AmSC: technological, policy, and sociological. With the rise of AI and the increasing use of sensitive data for training models, our goal is to leverage this BoF to build a community of practice that will advance a secure and trusted research environment (TRE) that addresses challenges in all three domains. How do we best achieve a TRE that is transparent, reproducible, ethical, secure, worthwhile, and collaborative, with clear data provenance and assurance? How might trust be rightfully earned and retained through modern workflows through managed risk and secure governance?
The outcomes from this BoF are: (1) explore TRE challenges in the age of AI and science integration; (2) identify alignment and divergence in TRE practices; (3) learn from complementary efforts across institutions around the globe; and (4) build a community of practice committed to trustworthy integrated science in the age of AI. We invite an audience with interests in these topics to participate in advancing these outcomes.
Three intertwined challenges emerge as we advance this vision for IRI and AmSC: technological, policy, and sociological. With the rise of AI and the increasing use of sensitive data for training models, our goal is to leverage this BoF to build a community of practice that will advance a secure and trusted research environment (TRE) that addresses challenges in all three domains. How do we best achieve a TRE that is transparent, reproducible, ethical, secure, worthwhile, and collaborative, with clear data provenance and assurance? How might trust be rightfully earned and retained through modern workflows through managed risk and secure governance?
The outcomes from this BoF are: (1) explore TRE challenges in the age of AI and science integration; (2) identify alignment and divergence in TRE practices; (3) learn from complementary efforts across institutions around the globe; and (4) build a community of practice committed to trustworthy integrated science in the age of AI. We invite an audience with interests in these topics to participate in advancing these outcomes.
Paper
HPC for Machine Learning
Performance Measurement, Modeling, & Tools
Livestreamed
Recorded
TP
DescriptionWe propose Tensor-Trained Low-Rank Adaptation Mixture-of-Experts (TT-LoRA MoE), a novel computational framework integrating parameter-efficient fine-tuning (PEFT) with sparse MoE routing to address scalability challenges in large model deployments. Unlike traditional MoE approaches, which face substantial computational overhead as expert counts grow, TT-LoRA MoE decomposes training into two distinct, optimized stages. First, we independently train lightweight, tensorized low-rank adapters (TT-LoRA experts), each specialized for specific tasks. Subsequently, these expert adapters remain frozen, eliminating inter-task interference and catastrophic forgetting in multi-task settings. A sparse MoE router, trained separately, dynamically leverages base model representations to select exactly one specialized adapter per input at inference time, automating expert selection without explicit task specification. Comprehensive experiments confirm our architecture retains the memory efficiency of low-rank adapters, seamlessly scales to large expert pools, and achieves robust task-level optimization. This structured decoupling significantly enhances computational efficiency and flexibility, enabling practical and scalable multi-task inference deployments.
Paper
Applications
Architectures & Networks
HPC for Machine Learning
Livestreamed
Recorded
TP
DescriptionFourier neural operators (FNOs) are widely used for learning partial differential equation solution operators. However, FNOs lack architecture-aware optimizations, with their Fourier layers executing FFT, filtering, GEMM, zero padding, and iFFT as separate stages, incurring multiple kernel launches and significant global memory traffic. We propose TurboFNO, the first fully fused FFT-GEMM-iFFT GPU kernel with built-in FFT optimizations. We first develop FFT and GEMM kernels from scratch, achieving performance comparable to cuBLAS and cuFFT. Additionally, our FFT integrates a built-in high-frequency truncation, input zero-padding, and pruning feature to avoid additional memory copy kernels. To fuse FFT and GEMM, we propose an FFT variant where a threadblock iterates over hidden dimension to align with GEMM’s $k$-loop, along with two shared memory swizzling patterns that ensure 100\% bank utilization when forwarding FFT output to GEMM and retrieving results for iFFT. Experimental results show TurboFNO outperforms PyTorch, cuBLAS, and cuFFT by up to 150\%.
Art of HPC
Art of HPC
Not Livestreamed
Not Recorded
TP
W
TUT
XO/EX
DescriptionWe here visualize a snapshot in time from one of the largest numerical simulations of stratified turbulence performed to date, generated by directly simulating the Navier–Stokes equations on the Frontier supercomputer (Oak Ridge National Laboratory) through an INCITE award. Stratified turbulence refers to chaotic motion arising in a fluid of variable density, where buoyancy forces strongly influence the flow dynamics. This type of turbulence plays a central role in a variety of natural and industrial processes, such as influencing the dispersion of heat and pollutants in the ocean and atmosphere, but remains poorly understood due to the vast range of interacting length scales that must be resolved.
A zoomed-in vertical slice of the 6 trillion grid point simulation is shown here, yielding unprecedented resolution into the rich variety of flow structures underpinning the turbulence. Energy is injected at large scales to drive the turbulence, which cascades down through progressively smaller structures until eventually being dissipated by viscosity. The colors in the visualization show perturbations to the fluid’s density relative to the stable background gradient: red and blue indicate lighter and denser fluid, respectively.
This simulation is one of the first to fully resolve stratified turbulence at high Prandtl number, meaning that momentum diffuses significantly faster than density, as is characteristic of ocean flows. A challenge in simulating high Prandtl flows is that the density field develops extremely fine structures that require immense resolution to capture, necessitating the use of Frontier.
A zoomed-in vertical slice of the 6 trillion grid point simulation is shown here, yielding unprecedented resolution into the rich variety of flow structures underpinning the turbulence. Energy is injected at large scales to drive the turbulence, which cascades down through progressively smaller structures until eventually being dissipated by viscosity. The colors in the visualization show perturbations to the fluid’s density relative to the stable background gradient: red and blue indicate lighter and denser fluid, respectively.
This simulation is one of the first to fully resolve stratified turbulence at high Prandtl number, meaning that momentum diffuses significantly faster than density, as is characteristic of ocean flows. A challenge in simulating high Prandtl flows is that the density field develops extremely fine structures that require immense resolution to capture, necessitating the use of Frontier.
Workshop
Partially Livestreamed
Partially Recorded
TP
W
DescriptionQuantum computers have grown rapidly, enabling execution of complex quantum circuits. However, for most researchers, access to compute time on quantum hardware is limited. Thus we need to build simulators that mimic execution of quantum circuits on noisy quantum hardware efficiently.
Here, we propose TUSQ, that can perform noisy simulation of up to 30-qubit Adder circuits on single A100 GPU in lessthan 820seconds. To represent the stochastic noisy channels, we average the output of multiple quantum circuits with fixed noisy gates sampled from channels. This leads to increase in circuit overhead, which slows down the simulation. To eliminate this overhead, TUSQ uses two modules: the \textit{Error Characterization} (ECM), and the Tree-based Execution} (TEM).
The ECM tracks number of unique circuit executions to accurately represent noise. This is followed by TEM, which reuses computation across reduced circuits. We evaluate and report average speedup of $52.5\times$ and $12.53\times$ over Qiskit and CUDA-Q.
Here, we propose TUSQ, that can perform noisy simulation of up to 30-qubit Adder circuits on single A100 GPU in lessthan 820seconds. To represent the stochastic noisy channels, we average the output of multiple quantum circuits with fixed noisy gates sampled from channels. This leads to increase in circuit overhead, which slows down the simulation. To eliminate this overhead, TUSQ uses two modules: the \textit{Error Characterization} (ECM), and the Tree-based Execution} (TEM).
The ECM tracks number of unique circuit executions to accurately represent noise. This is followed by TEM, which reuses computation across reduced circuits. We evaluate and report average speedup of $52.5\times$ and $12.53\times$ over Qiskit and CUDA-Q.
Workshop
Livestreamed
Recorded
TP
W
DescriptionThe inherent wide distribution, heterogeneity, and dynamism of the current and emerging high-performance computing and software environments increasingly challenge cyberinfrastructure facilitators, trainers, and educators. The challenge is how to support and train the current multidisciplinary users and prepare the future educators, researchers, developers, and policymakers to keep pace with the rapidly evolving HPC environments to advance discovery and economic competitiveness for many generations.
The twelfth annual full-day workshop on HPC training and education is an ACM SIGHPC Education Chapter coordinated effort, aimed at fostering more collaborations among the practitioners from traditional and emerging fields to explore educational needs in HPC, to develop and deploy HPC training, and to identify new challenges and opportunities for the latest HPC platforms. The workshop will also be a platform for disseminating results and lessons learned in these areas and will be captured in a special edition of the Journal of Computational Science Education.
The twelfth annual full-day workshop on HPC training and education is an ACM SIGHPC Education Chapter coordinated effort, aimed at fostering more collaborations among the practitioners from traditional and emerging fields to explore educational needs in HPC, to develop and deploy HPC training, and to identify new challenges and opportunities for the latest HPC platforms. The workshop will also be a platform for disseminating results and lessons learned in these areas and will be captured in a special edition of the Journal of Computational Science Education.
Workshop
Livestreamed
Recorded
TP
W
DescriptionHeterogeneous node architectures are ubiquitous in today’s HPC landscape. Exploiting the compute capability, while maintaining code portability and maintainability, necessitates effective accelerator programming approaches. The use of these programming approaches remains a research activity, and there are many possible trade-offs between performance, portability, maintainability, and ease of use that must be considered. Additionally, new heterogeneous computing concepts are being deployed, like ML/AI chips and QPUs, introducing challenges related to algorithms, portability, and standardization of programming models.
The WACCPD workshop highlights the improvements over state-of-the-art through accepted papers and talks. The event will also foster discussion with invited talks and a panel to draw the community’s attention to key areas that will facilitate the transition to accelerator-based HPC, including AI, and quantum computing. The workshop aims to showcase all aspects of innovative language features, lessons learned while using directives/abstractions to migrate scientific code, and experiences using novel accelerator architectures, among others.
The WACCPD workshop highlights the improvements over state-of-the-art through accepted papers and talks. The event will also foster discussion with invited talks and a panel to draw the community’s attention to key areas that will facilitate the transition to accelerator-based HPC, including AI, and quantum computing. The workshop aims to showcase all aspects of innovative language features, lessons learned while using directives/abstractions to migrate scientific code, and experiences using novel accelerator architectures, among others.
Birds of a Feather
Artificial Intelligence & Machine Learning
Livestreamed
Recorded
TP
XO/EX
DescriptionTraditional interconnects no longer meet the performance, latency, and scalability demands of AI systems, prompting the need for new approaches to data movement at the node and data center levels. In response, two new open industry organizations have emerged: UALink and the Ultra Ethernet Consortium (UEC). This BoF will discuss the two organizations' approaches to meeting the growing network demands for AI applications.
This presentation will also explore the architectural challenges driving the need for these new fabrics and explain how UALink and UEC are addressing AI networking challenges.
This presentation will also explore the architectural challenges driving the need for these new fabrics and explain how UALink and UEC are addressing AI networking challenges.
Flash Session
Not Livestreamed
Not Recorded
TP
XO/EX
DescriptionThe Ultra Ethernet Consortium is reimagining Ethernet to surpass InfiniBand—while preserving its legacy of ubiquity and cost-effectiveness. HPE Slingshot is not just aligned with this vision—it’s helping lead it. Slingshot engineers are contributing extensively to the Ultra Ethernet specification, with several foundational components drawn directly from Slingshot’s proven innovations in fabric switching and RDMA NICs. But Slingshot goes further. While Ultra Ethernet raises the industry floor—establishing a genuine path to multi-vendor interoperability for RDMA for the first time—HPE Slingshot pushes the ceiling. It delivers advanced capabilities that far exceed the specification to deliver unmatched scalability, effective congestion control, and price/performance. From exascale systems like El Capitan to mainstream HPC in engineering, energy, and academia, HPE Slingshot is making the vision of Ultra Ethernet a reality today.
Paper
HPC for Machine Learning
State of the Practice
System Software and Cloud Computing
Livestreamed
Recorded
TP
DescriptionLong-context comprehension is a crucial capability for LLM. Context parallelism and irregular block sparse attention are two key technologies to accelerate long contextual training and inference. Existing context parallelism techniques for attention suffer from poor scalability, owing to their common characteristics: the striped-like partition pattern. The striped-like partition pattern causes high communication traffic and inflexible kernel granularity, which in turn results in low single-kernel device utilization.
To address these problems, we propose UltraAttn, a novel context parallelism solution for irregular attention. UltraAttn hierarchically tiles the context to reduce communication cost. UltraAttn also performs context-tiling at the kernel level to adjust the granularity of kernels to trade off between kernel overlap and single-kernel device utilization. UltraAttn executes distributed attention with an ILP-based runtime to optimize latency. We evaluate UltraAttn on 64 GPUs. UltraAttn achieves 5.5× speedup on average in different types of irregular attention over the state-of-the-art context parallelism techniques.
To address these problems, we propose UltraAttn, a novel context parallelism solution for irregular attention. UltraAttn hierarchically tiles the context to reduce communication cost. UltraAttn also performs context-tiling at the kernel level to adjust the granularity of kernels to trade off between kernel overlap and single-kernel device utilization. UltraAttn executes distributed attention with an ILP-based runtime to optimize latency. We evaluate UltraAttn on 64 GPUs. UltraAttn achieves 5.5× speedup on average in different types of irregular attention over the state-of-the-art context parallelism techniques.
Workshop
Livestreamed
Recorded
TP
W
DescriptionModern high-performance computing (HPC) systems present application
developers with increasingly complex memory hierarchies
that include multiple types of memory with varying access patterns,
capacities, and performance characteristics. Managing these
resources efficiently while maintaining code portability across different
architectures remains a significant challenge. To address
these challenges, Umpire was developed at Lawrence Livermore
National Laboratory (LLNL) as an open-source library that provides
a unified, portable memory management API for modern HPC platforms
with multiple memory devices like NUMA and GPUs. This
paper explores Umpire’s design principles, outlines Umpire’s primary
performance advantages, and examines howits memory pools
can provide speedups of 15x or greater. Next, it demonstrates how
its integration with the RAJA Portability Suite enables the development
of portable and performant HPC applications. With real-world
examples from LLNL’s production codes, Umpire provides a comprehensive
solution for managing the challenges of performance
portability in modern HPC environments.
developers with increasingly complex memory hierarchies
that include multiple types of memory with varying access patterns,
capacities, and performance characteristics. Managing these
resources efficiently while maintaining code portability across different
architectures remains a significant challenge. To address
these challenges, Umpire was developed at Lawrence Livermore
National Laboratory (LLNL) as an open-source library that provides
a unified, portable memory management API for modern HPC platforms
with multiple memory devices like NUMA and GPUs. This
paper explores Umpire’s design principles, outlines Umpire’s primary
performance advantages, and examines howits memory pools
can provide speedups of 15x or greater. Next, it demonstrates how
its integration with the RAJA Portability Suite enables the development
of portable and performant HPC applications. With real-world
examples from LLNL’s production codes, Umpire provides a comprehensive
solution for managing the challenges of performance
portability in modern HPC environments.
Research and ACM SRC Posters
Research & ACM SRC Posters
TP
DescriptionAs large language models (LLMs) grow in parameter count, efficient generation requires inference to scale beyond a single node. Current approaches use tensor parallelism (TP) or pipeline parallelism (PP), but TP incurs high communication volume, while PP suffers from pipeline bubbles and is unsuitable for latency-critical scenarios. We present Yalis (Yet Another LLM Inference System), a lightweight and modular distributed inference framework that performs comparably to existing state-of-the-art systems for offline inference, while enabling rapid prototyping. Using Yalis, we study strong scaling of LLM inference on the Alps and Perlmutter supercomputers, revealing the poor scaling performance of existing parallelism strategies due to high communication overheads. We further compare the all-reduce performance of NCCL and MPI in the small-message regime, finding that while NCCL is efficient intra-node, MPI can outperform it cross-node for messages between 256-1024 KB. These results motivate the need for communication-efficient parallelism strategies for multi-node LLM inference.
Research and ACM SRC Posters
Research & ACM SRC Posters
TP
DescriptionGPGPU-based clusters and supercomputers have grown significantly in popularity over the past decade. While numerous GPGPU hardware counters are available to users, their potential for workload characterization remains underexplored. In this work, we analyze previously overlooked GPU hardware counters collected via the Lightweight Distributed Metric Service on Perlmutter. We examine spatial imbalance, defined as uneven GPU usage within the same job, and perform a temporal analysis of how counter values change during execution. Using temporal imbalance, we capture deviations from average usage over time. Our findings reveal inefficiencies and imbalances that can guide workload optimization and inform future HPC system design.
Research and ACM SRC Posters
Research & ACM SRC Posters
TP
DescriptionLarge language models (LLMs) are increasingly used in HPC for tasks like code generation and analysis, but their internal reasoning remains opaque. To address this, we study three tasks—OpenMP code completion, data race detection, and OMP code generation—using mechanistic interpretability. Sparse autoencoder ablations reveal causal features, function vector injection improves zero-shot predictions and direction vector shifts the model's output toward a desired behavior or style, even without explicitly stating it in the prompt. These methods expose and influence LLM behavior in HPC contexts.
Birds of a Feather
Architectures & Networks
Community Meetings
Livestreamed
Recorded
TP
XO/EX
DescriptionIn order to exploit the capabilities of new HPC systems and to meet their demands in scalability, communication software needs to scale on millions of cores and support applications with adequate functionality. UCX is a collaboration between industry, national labs, and academia that consolidates and provides a unified open-source framework.
The UCX project is managed by the UCF Consortium (http://www.ucfconsortium.org/) and includes members from Los Alamos National Laboratory, Argonne National Laboratory, Ohio State University, AMD, NVIDIA, and more. The session will serve as the UCX community meeting and will introduce the latest developments to HPC developers and the broader user community.
The UCX project is managed by the UCF Consortium (http://www.ucfconsortium.org/) and includes members from Los Alamos National Laboratory, Argonne National Laboratory, Ohio State University, AMD, NVIDIA, and more. The session will serve as the UCX community meeting and will introduce the latest developments to HPC developers and the broader user community.
Best Poster Presentations (Research, ACM SRC Grad/Undergrad)
Research & ACM SRC Posters
TP
DescriptionModern HPC applications increasingly use GPUs to solve larger problems with higher accuracy and speed. However, committing resources to these large-scale systems is often costly and time-consuming. Hence, performance modeling enables developers to estimate runtime, analyze scalability, and identify resource bottlenecks in advance. In this work, we propose a unified software ecosystem for end-to-end performance modeling of distributed GPU applications. To this end, we propose a combination of analytical and machine learning-based modeling methodology, and design a comprehensive software stack to combine the various components for implementing such an approach. We validate the proposed framework using two real-life applications and provide performance estimations for the GPU kernel and inter-GPU MPI communications.
Research and ACM SRC Posters
Research & ACM SRC Posters
TP
DescriptionPerformance variability is often a critical issue on GPU-accelerated systems, undermining efficiency and reproducibility. Since large-scale investigations of performance variability on GPU clusters are lacking, we set up a longitudinal experiment on Perlmutter and Frontier. We benchmark representative HPC and AI applications and collect detailed performance data to assess the impact of compute variability, allocated node topology, and network conditions on overall runtime. We also use an ML-based approach to identify potential correlations between these factors and to forecast the execution time. Our analysis identifies network performance as the dominant source of runtime variability. These findings provide crucial insights that can inform the development of future mitigation strategies.
Paper
Architectures & Networks
Livestreamed
Recorded
TP
DescriptionCloud computing and AI workloads are driving unprecedented demand for efficient communication within and across datacenters. However, the coexistence of intra- and inter-datacenter traffic within datacenters, plus the disparity between the RTTs of intra- and inter-datacenter networks, complicates congestion management and traffic routing. Particularly, faster congestion responses of intra-datacenter traffic causes rate unfairness when competing with slower inter-datacenter flows. Additionally, inter-datacenter messages suffer from slow loss recovery and, thus, require reliability. Existing solutions overlook these challenges and handle inter- and intra-datacenter congestion with separate control loops or at different granularities. We propose Uno, a unified system for both inter- and intra-datacenter environments that integrates a transport protocol for rapid congestion reaction and fair rate control with a load-balancing scheme that combines erasure coding and adaptive routing. Our findings show that Uno significantly improves the completion times of both inter- and intra-datacenter flows compared to state-of-the-art methods such as Gemini.
Research and ACM SRC Posters
Research & ACM SRC Posters
TP
DescriptionIntegral field spectroscopy is a powerful technique in observational astrophysics enabling the study of spatially-complex objects like distant strongly-lensed galaxies. Integral field units (IFUs) are an increasingly common addition to many powerful ground- and space-based observatories, which motivates the need for efficient and accurate data reduction pipelines. Using Parsl, we developed a scalable processing pipeline to obtain spatially-resolved calibrated spectra for the integral field fiber head at Magellan (IFU-M) and the Magellan/Michigan Fiber System (M2FS) at Las Campanas Observatory, Chile. To enable fast filtering of cosmic rays, we integrated an Academy agent into the pipeline that can learn the time-evolving parameters of the instrument and accelerate that step by 1.5 while reducing the noise in the output spectra. We scaled the pipeline to 32 nodes, allowing one night of data to be processed in 25 minutes, a 16x speedup.
Workshop
Livestreamed
Recorded
TP
W
DescriptionThe up-scaling of Python workflows from the execution on a local workstation to the parallel execution on an HPC typically faces three challenges: (1) the management of inter-process communication, (2) the data storage and (3) the management of task dependencies during the execution. These challenges commonly lead to a rewrite of major parts of the reference serial Python workflow to improve computational efficiency. Executorlib addresses these challenges by extending Python’s ProcessPoolExecutor interface to distribute Python functions on HPC systems. It interfaces with the job scheduler directly without the need for a database or daemon process, leading to seamless up-scaling.
Paper
Applications
Architectures & Networks
Livestreamed
Recorded
TP
DescriptionApproximate Nearest Neighbor Search (ANNS) is a critical component of modern AI systems, such as recommendation engines and retrieval-augmented large language models (RAG-LLMs). However, scaling ANNS to billion-entry datasets exposes critical inefficiencies: CPU-based solutions are bottlenecked by memory bandwidth limitations, while GPU implementations underutilize hardware resources, leading to suboptimal performance and energy consumption. We introduce UpANNS, a novel framework leveraging Processing-in-Memory (PIM) architecture to accelerate billion-scale ANNS. UpANNS integrates four key innovations, including: architecture-aware data placement to minimize latency through workload balancing; dynamic resource management for optimal PIM utilization; co-occurrence optimized encoding to reduce redundant computations; and an early-pruning strategy for efficient top-k selection. Evaluation on commercial UPMEM hardware demonstrates that UpANNS achieves 4.3x higher QPS than CPU-based Faiss, while matching GPU performance with 2.3x greater energy efficiency. Its near-linear scalability ensures practicality for growing datasets, making it ideal for applications like real-time LLM serving and large-scale retrieval systems.
Workshop
Livestreamed
Recorded
TP
W
DescriptionThe rise of AI and the economic dominance of cloud computing have created a new nexus of innovation for high performance computing (HPC), which has a long history of driving scientific discovery. Beyond performance needs, scientific workflows increasingly demand capabilities of cloud environments: portability, reproducibility, dynamism, and automation. As converged cloud-HPC environments emerge, there is growing need to study their suitability for HPC use cases. Here we present a cross-platform usability study that assesses 11 different HPC proxy applications and benchmarks across three clouds (Microsoft Azure, Amazon Web Services, and Google Cloud), six environments, and two compute configurations (CPU and GPU) against on-premises HPC clusters at the Lawrence Livermore National Laboratory. We perform applications scaling tests in all environments up to 28,672 CPUs and 256 GPUs. We present methodology and results to guide future study and provide a foundation to define best practices for running HPC workloads in cloud.
Workshop
Debugging & Correctness Tools
HPC Software & Runtime Systems
Livestreamed
Recorded
TP
W
In-person
DescriptionWe examine the code generator-based MPI correctness benchmark MPI-BugBench (MBB) by analyzing the code coverage it triggers in three tools: MUST, PARCOACH, and clang-tidy. Our analysis complements MBB’s design, which prunes potentially exhaustive test sets based on real-world MPI usage. Our assessment identifies two key limitations in the generated tests: incomplete coverage of MPI features, such
as varying-count collectives, and limited structural diversity of the generated tests, such as lack of loops and lack of array-based MPI handles. We find increasing test volume alone offers limited benefit for exercising the tool's analysis code in our assessment.
To address these gaps, we propose a new generation level with the missing features and more varied code structures. To that end, we implemented 34 additional tests to exercise previously uncovered analysis code, adding as many as 770 lines of code coverage in MUST with a single test for varying-count collectives.
as varying-count collectives, and limited structural diversity of the generated tests, such as lack of loops and lack of array-based MPI handles. We find increasing test volume alone offers limited benefit for exercising the tool's analysis code in our assessment.
To address these gaps, we propose a new generation level with the missing features and more varied code structures. To that end, we implemented 34 additional tests to exercise previously uncovered analysis code, adding as many as 770 lines of code coverage in MUST with a single test for varying-count collectives.
Tutorial
Livestreamed
Recorded
TUT
DescriptionThe use of containers has revolutionized the way in which industries and enterprises have developed and deployed computational software and distributed systems. This containerization model has gained traction within the HPC community as well, with the promise of improved reliability, reproducibility, portability, and levels of customization that were not previously possible on supercomputers. This adoption has been enabled by a number of HPC container runtimes that have emerged, including Singularity, Shifter, Sarus, Podman, and others. This hands-on tutorial looks to train users on the use of containers for HPC use cases. We will provide a detailed background on Linux containers, along with an introductory hands-on experience building a container image, sharing the container, and running it on an HPC cluster. Furthermore, the tutorial will provide more advanced information on how to run MPI-based and GPU-enabled HPC applications, how to optimize I/O intensive workflows, and how to set up GUI-enabled interactive sessions. Cutting-edge examples will include machine learning and bioinformatics. Users will leave the tutorial with a solid foundational understanding of how to utilize containers on HPC resources using Podman, Shifter, and Singularity, and in-depth knowledge to deploy custom containers on their own resources.
Best Poster Presentations (Research, ACM SRC Grad/Undergrad)
Research & ACM SRC Posters
TP
DescriptionModern GPUs play a crucial role in accelerating a wide range of computational workloads. However, their performance is often limited by the memory access patterns of the kernels they execute. AMD’s MI300A APU supports multiple logical GPU partitioning modes to optimize compute resource allocation, offering new opportunities for performance tuning. In this work, we evaluate how different GPU kernels from the RAJA Performance Suite perform in various partitioning modes. Using hardware counters, we compare two kernels with identical computational complexity but different data layouts, highlighting how memory organization can influence performance outcomes. The results demonstrate that data layout and access patterns have a significant impact on runtime performance across different partitioning modes, even when computational complexity and problem size remain constant.
Birds of a Feather
Artificial Intelligence & Machine Learning
Livestreamed
Recorded
TP
XO/EX
DescriptionThis BoF explores the emerging role of AI inference services in the academic HPC community. Participants will share use cases, best practices, and strategies for deploying inference across research, education, and operations, including specialized applications such as indigenous language models. Through lightning talks, live polls, and interactive panel discussions, the session aims to identify shared challenges, opportunities for service sharing, and the academic value proposition of inference in science. The ultimate goal is to foster a cross-institutional “community of practice” among global academic and governmental HPC centers.
Paper
Algorithms
Livestreamed
Recorded
TP
DescriptionSchur complement matrices emerge in many domain decomposition methods that can utilize supercomputers to solve complex engineering problems. As most of today's high-performance clusters' performance lies in GPUs, these methods should also be accelerated.
Typically, the offloaded components are the explicitly assembled dense Schur complement matrices used later in the iterative solver for multiplication with a vector. As the explicit assembly is expensive, it adds a significant overhead to this approach of acceleration. It has already been shown that the overhead can be minimized by assembling the Schur complements directly on the GPU.
This paper shows that the GPU assembly can be further improved by wisely utilizing the matrix sparsity. In the context of FETI, we achieved a speedup of 5.1 in the GPU section of the code and 3.3 for the whole assembly, making the acceleration beneficial from as few as 10 iterations for subdomains with 1,000-70,000 unknowns.
Typically, the offloaded components are the explicitly assembled dense Schur complement matrices used later in the iterative solver for multiplication with a vector. As the explicit assembly is expensive, it adds a significant overhead to this approach of acceleration. It has already been shown that the overhead can be minimized by assembling the Schur complements directly on the GPU.
This paper shows that the GPU assembly can be further improved by wisely utilizing the matrix sparsity. In the context of FETI, we achieved a speedup of 5.1 in the GPU section of the code and 3.3 for the whole assembly, making the acceleration beneficial from as few as 10 iterations for subdomains with 1,000-70,000 unknowns.
Research and ACM SRC Posters
Research & ACM SRC Posters
TP
DescriptionProof-of-work blockchains, like Bitcoin, consume substantial energy, motivating greener alternatives such as proof-of-space (PoSp), which relies on storage rather than computation. Existing PoSp implementations face scalability challenges due to high memory and I/O requirements, especially when generating large plots needed for fast lookups.
We present VaultX Merge, a novel out-of-memory PoSp plot generation method. VaultX Merge first creates multiple small in-memory subplots and then merges them into large plots, reducing redundant storage I/O, minimizing data size, and improving lookup latency. Our approach enables efficient operation across a range of devices, from small nodes like Raspberry Pis to high-end servers.
In our poster, we will demonstrate VaultX Merge’s performance across different hardware and storage configurations, showing up to 50% faster plot generation compared to previous out-of-RAM implementations, and highlight how this approach facilitates scalable, energy-efficient blockchain participation.
We present VaultX Merge, a novel out-of-memory PoSp plot generation method. VaultX Merge first creates multiple small in-memory subplots and then merges them into large plots, reducing redundant storage I/O, minimizing data size, and improving lookup latency. Our approach enables efficient operation across a range of devices, from small nodes like Raspberry Pis to high-end servers.
In our poster, we will demonstrate VaultX Merge’s performance across different hardware and storage configurations, showing up to 50% faster plot generation compared to previous out-of-RAM implementations, and highlight how this approach facilitates scalable, energy-efficient blockchain participation.
Doctoral Showcase
Research & ACM SRC Posters
Livestreamed
Recorded
TP
DescriptionPneumonia is a dreadful condition that is the primary cause of death globally for individuals of all ages, but it is especially dangerous for small children who are younger than five. The radiological results obtained from an X-ray could lead to mistakes, incorrect diagnoses, and unnecessary delays. Datasets with chest X-ray were acquired from a Hopkins Diagnostic Center and 10 classifiers were applied. This work aims to develop ensemble machine and transfer learning models to classify viral pneumonia disease and apply ensemble techniques to the models. The models incorporate a variety of machine learning approaches, including k-nearest neighbors (KNN), decision tree (DT), random forest (RF), logistic regression (LR), and support vector machine (SVM). Furthermore, the transfer learning approach is used on the deep learning architectures VGG-19, DenseNet-121, GoogLeNet, AlexNet, and MobileNet-V2. The Keras code backend was implemented using TensorFlow.
On the general model’s performance: The model's output using the local dataset with 1,113 has the SVM, KNN, RF, LR, AlexNet, and GoogLeNet with 97%, 98%, 95%, 94%, 99%, and 100%, respectively, as the best in their performances; while DT, MobileNet, VGG-19, and DenseNet, with 89%, 80%, 77%, and 70% respectively, are the lowest in performance. The Max Voting Ensemble yielded 97% and weighted average yielded 98%. The results obtained from the analysis revealed the highest performance classification abilities of the seven models. The classification models for viral pneumonia disease enhance clinical practice by enabling improved interpretation of results, early prediction, detection, and life-saving interventions.
On the general model’s performance: The model's output using the local dataset with 1,113 has the SVM, KNN, RF, LR, AlexNet, and GoogLeNet with 97%, 98%, 95%, 94%, 99%, and 100%, respectively, as the best in their performances; while DT, MobileNet, VGG-19, and DenseNet, with 89%, 80%, 77%, and 70% respectively, are the lowest in performance. The Max Voting Ensemble yielded 97% and weighted average yielded 98%. The results obtained from the analysis revealed the highest performance classification abilities of the seven models. The classification models for viral pneumonia disease enhance clinical practice by enabling improved interpretation of results, early prediction, detection, and life-saving interventions.
Art of HPC
Art of HPC
Not Livestreamed
Not Recorded
TP
W
TUT
XO/EX
DescriptionThe image depicts the flow of wind through a wind farm comprising four five-megawatt wind turbines. The flow and turbine motions are shown in real time. The blades and tower are in a deflected state due to fluid-structure interaction. Vortical flow structures are colored by velocity magnitude to indicate turbulent high-speed flow regions.
Art of HPC
Art of HPC
Not Livestreamed
Not Recorded
TP
W
TUT
XO/EX
DescriptionLocated at the Texas A&M University System RELLIS Campus, the Detonation Research Test Facility (DRTF) exemplifies pioneering research, led by Dr. Elaine Oran, a world-renowned authority on the physics of explosions. The DRTF is dedicated to understanding the intricate dynamics of detonations, where flammable gases and materials interact under extreme conditions to produce powerful and potentially devastating outcomes.
At the heart of this facility lies a massive steel tube, stretching 150 meters in length with a diameter of 2 meters and reinforced by 3/4-inch thick walls. This immense structure is designed to safely contain and replicate the magnitude of real-world detonation events, providing researchers with the rare opportunity to observe and analyze explosive phenomena at scale.
Complementing this experimental infrastructure, Dr. Jian Tao leads a groundbreaking initiative to create a digital twin of the DRTF. This virtual replica is being developed at the Texas A&M RELLIS Campus to enable near-real-time simulation, predictive analysis, and operational optimization. By integrating high performance computing and advanced visualization, the digital twin promises to transform the way researchers model, test, and refine their understanding of detonation processes, enhancing both safety and scientific insight.
The 3D model showcased here was created by Sina Alidoust Salimi and Britain Thomas, master’s students in the College of Performance, Visualization and Fine Arts. Their contribution highlights the power of interdisciplinary collaboration at Texas A&M, where art, science, and technology converge to advance research capabilities and safeguard future innovations.
At the heart of this facility lies a massive steel tube, stretching 150 meters in length with a diameter of 2 meters and reinforced by 3/4-inch thick walls. This immense structure is designed to safely contain and replicate the magnitude of real-world detonation events, providing researchers with the rare opportunity to observe and analyze explosive phenomena at scale.
Complementing this experimental infrastructure, Dr. Jian Tao leads a groundbreaking initiative to create a digital twin of the DRTF. This virtual replica is being developed at the Texas A&M RELLIS Campus to enable near-real-time simulation, predictive analysis, and operational optimization. By integrating high performance computing and advanced visualization, the digital twin promises to transform the way researchers model, test, and refine their understanding of detonation processes, enhancing both safety and scientific insight.
The 3D model showcased here was created by Sina Alidoust Salimi and Britain Thomas, master’s students in the College of Performance, Visualization and Fine Arts. Their contribution highlights the power of interdisciplinary collaboration at Texas A&M, where art, science, and technology converge to advance research capabilities and safeguard future innovations.
Art of HPC
Art of HPC
Not Livestreamed
Not Recorded
TP
W
TUT
XO/EX
Description3d volumetric clouds (VDBs) are rendered from particles. The clouds are then composited into a still frame from NVIDIA's Omniverse Flight example USD project. When animated the camera flights through the 3d cloud formation, revealing the internal structure of the clouds.
Early Career
Not Livestreamed
Not Recorded
TP
W
TUT
XO/EX
DescriptionThis session will feature a diverse, moderated panel of individuals from academia, government, and industry, who will share their career stories and answer questions from participants.
Research and ACM SRC Posters
Research & ACM SRC Posters
TP
DescriptionWithin evolving microbial populations, genes that elevate mutation rate impose a fundamental trade-off: on one hand, increasing harmful mutations among offspring, but, on the other, allowing more opportunities for rare beneficial mutations. Existing single-CPU agent-based simulation work suggests that increased population size should generally favor the proliferation of mutator alleles due to "hitch-hiking" effects associated with beneficial mutation discovery. However, in contrast to this expectation, this outcome is often not the case in large asexual populations found in nature. To address this knowledge gap, we leveraged the 850,000-processor Cerebras Wafer-Scale Engine (WSE) to increase simulation scale up to 1.5 billion-agent populations. In benchmarks, WSE provided 294× speedup over GPU and 111,091× speedup over single-core CPU execution. Among other results, our experiments indicate that limitation of adaptive potential (i.e., few beneficial mutations available) can produce a tertiary regime where complete mutator allele takeover becomes disfavored at very large population sizes.
Workshop
Partially Livestreamed
Partially Recorded
TP
W
DescriptionServerless LLM serving lowers costs by elastically provisioning GPUs and charging only for usage. However, current systems mostly target cold-start latency, overlooking inefficiencies: (i) static, exclusive GPU allocation that wastes compute resources and increases costs, and (ii) fixed hardware-controlled clock speeds that waste energy. Our analysis shows many LLM workloads can meet SLOs with partial SM allocations and reduced clock speeds, enabling GPU multiplexing and dynamic clock scaling. We present WAGES, a workload-aware GPU sharing system that uses NVIDIA MPS to co-locate LLMs, dynamically adjusting SM partitions and clock speeds to workload needs while meeting SLOs. A two-tier scheduler coordinates global GPU consolidation and local SLO-aware tuning, overlapping model/KV migration with execution to reduce reconfiguration overhead. On real LLM traces, WAGES improves SLO attainment by up to 4% over prior GPU sharing approaches and reduces energy use by up to 26%.
Paper
Algorithms
Livestreamed
Recorded
TP
DescriptionThe Single-Source Shortest Path (SSSP) problem is a fundamental graph problem with an extensive set of real-world applications. State-of-the-art parallel algorithms for SSSP, such as the ∆-stepping algorithm, create parallelism through priority coarsening. Priority coarsening results in redundant computations that diminish the benefits of parallelization and limit parallel scalability.
This paper introduces Wasp, a novel SSSP algorithm that reduces parallelism-induced redundant work by utilizing asynchrony and an efficient priority-aware work stealing scheme. Contrary to previous work, Wasp introduces redundant computations only when threads have no high-priority work locally available to execute. This is achieved by a novel priority-aware work stealing mechanism that controls the inefficiencies of indiscriminate priority coarsening.
Experimental evaluation shows competitive or better performance compared to GAP, GBBS, MultiQueues, Galois, ∆*-stepping, and ρ-stepping on 13 diverse graphs with geometric mean speedups of 2.26x on AMD Zen 3 and 2.16x on Intel Sapphire Rapids using 128 threads.
This paper introduces Wasp, a novel SSSP algorithm that reduces parallelism-induced redundant work by utilizing asynchrony and an efficient priority-aware work stealing scheme. Contrary to previous work, Wasp introduces redundant computations only when threads have no high-priority work locally available to execute. This is achieved by a novel priority-aware work stealing mechanism that controls the inefficiencies of indiscriminate priority coarsening.
Experimental evaluation shows competitive or better performance compared to GAP, GBBS, MultiQueues, Galois, ∆*-stepping, and ρ-stepping on 13 diverse graphs with geometric mean speedups of 2.26x on AMD Zen 3 and 2.16x on Intel Sapphire Rapids using 128 threads.
Workshop
Livestreamed
Recorded
TP
W
DescriptionWe present a parallel, hash-based software library, HashBrick, for sparse, block-structured applications on CPUs and NVIDIA GPUs. We use a brick-based layout, where data is aggregated into small regular bricks, and hashes hide the complexity of neighbor indexing. This exposes an extra level of flexibility for irregular data and avoids packing communication buffers for ghost zones. One-sided NVSHMEM is used for GPU-GPU communication of an irregular distribution of ghost bricks. Weak scaling experiments using a high-order CFD application and a Jacobi iteration benchmark were run on a NVIDIA GH200 cluster. For constant problem size per node, the computation time is constant, and communication scales well despite load imbalance. We find that the variation in the distribution of ghost bricks more or less correlates with scaling efficiency. Our results show that MPI and NVSHMEM have similar scaling, but MPI wall-clock times are 64-84% higher for these experiments.
Early Career
Not Livestreamed
Not Recorded
TP
W
TUT
XO/EX
DescriptionWelcome
Committee introductions
Icebreaker
Program overview
Q&A
Committee introductions
Icebreaker
Program overview
Q&A
Workshop
Livestreamed
Recorded
TP
W
DescriptionThis session will explore how quantum computing is rapidly evolving from a specialized research domain into an integral part of the high-performance computing (HPC) landscape. We will examine the convergence of classical supercomputing and quantum architectures, highlighting systems such as Fugaku, Frontier, and IBM Quantum System Two, and discuss how hybrid orchestration across CPUs, GPUs, and QPUs is redefining scientific computing. The session will trace the engineering milestones on the path toward fault-tolerant quantum computing by 2029, emphasizing the role of open-source frameworks like Qiskit and new workload-management integrations that make quantum resources first-class citizens in HPC environments. By presenting a full-stack view, from hardware and middleware to compilers and algorithms, we’ll demonstrate how HPC-native quantum computing can accelerate breakthroughs in chemistry, optimization, materials science, and AI, ushering in a new era of quantum-centric supercomputing.
Workshop
Livestreamed
Recorded
TP
W
DescriptionUnderstanding how quantum and classical high-performance computing can work together is important to unlock the full potential of quantum computing. Together with a panel of experts, we discuss use cases for hybrid quantum-classical workflows, and how existing HPC centers can adopt the technology required to implement quantum-centric supercomputing at scale.
Flash Session
Not Livestreamed
Not Recorded
TP
XO/EX
DescriptionIncorporating quantum computers into high performance computing (HPC) environments (commonly referred to as HPC+QC integration) marks a pivotal step in advancing computational capabilities for scientific research. Here we report on the integration of a superconducting 20-qubit quantum computer into the HPC infrastructure at Leibniz Supercomputing Centre (LRZ), one of the first practical implementations of its kind. This yielded four key lessons: (1) quantum computers have stricter facility requirements than classical systems, yet their deployment in HPC environments is feasible when preceded by a rigorous site survey to ensure compliance; (2) quantum computers are inherently dynamic systems that require regular recalibration that is automatic and controllable by the HPC scheduler; (3) redundant power and cooling infrastructure is essential; and (4) effective hands-on onboarding should be provided for both quantum experts and new users. The identified conclusions provide a roadmap to guide future HPC center integrations.
Paper
Algorithms
Applications
State of the Practice
Livestreamed
Recorded
TP
DescriptionOver the last nearly 20 years, lossy compression has become an essential aspect of HPC applications' data pipelines, allowing them to overcome limitations in storage capacity and bandwidth and, in some cases, increase computational throughput and capacity. However, with the adoption of lossy compression comes the requirement to assess and control the impact lossy compression has on scientific outcomes.
In this work, we take a major step forward in describing the state of practice and characterizing workloads. We examine applications' needs and compressors' capabilities across nine different supercomputing application domains. We present 25 takeaways that provide best practices for applications, operational impacts for facilities achieving compressed data, and gaps in application needs not addressed by production compressors that point towards opportunities for future compression research.
In this work, we take a major step forward in describing the state of practice and characterizing workloads. We examine applications' needs and compressors' capabilities across nine different supercomputing application domains. We present 25 takeaways that provide best practices for applications, operational impacts for facilities achieving compressed data, and gaps in application needs not addressed by production compressors that point towards opportunities for future compression research.
Research and ACM SRC Posters
Research & ACM SRC Posters
TP
DescriptionWe tackle the challenge of breadth-first traversal (BFT) on sparse graphs with a high number of connected components. We propose a novel distributed-memory parallel algorithm that uses the label propagation (LP) algorithm to perform BFT on all connected components of the graph simultaneously. In synthetic benchmarks with RMAT-like graphs, we show that our LP-based algorithm can be up to 77x faster compared to the parallel direction-optimized BFS in the Combinatorial BLAS library, while scaling up to 1.5k CPU cores.
Birds of a Feather
Emerging Hardware & Software Technologies
Livestreamed
Recorded
TP
XO/EX
DescriptionThanks to the stream of funding provided by the EuroHPC Joint Undertaking, Europe is ready to build its next-generation, world-class HPC and AI systems.
Its project portfolio offers solutions that pave the way towards global leadership and collaboration opportunities in the following complementary areas: DARE and EPI projects – HPC/AI processor development, NET4EXA – high-speed interconnects, and SEANERGYS – dynamic, energy-optimized operation.
Project representatives will showcase their results in order to identify related efforts and technologies overseas (U.S., Asia, Japan, India, other regions). The outcomes will be documented in an ETP4HPC white paper on international collaboration.
Its project portfolio offers solutions that pave the way towards global leadership and collaboration opportunities in the following complementary areas: DARE and EPI projects – HPC/AI processor development, NET4EXA – high-speed interconnects, and SEANERGYS – dynamic, energy-optimized operation.
Project representatives will showcase their results in order to identify related efforts and technologies overseas (U.S., Asia, Japan, India, other regions). The outcomes will be documented in an ETP4HPC white paper on international collaboration.
Workshop
Partially Livestreamed
Partially Recorded
TP
W
DescriptionIn this session, one of the WHPC Global Organization Chapters will be showcasing its activities for the community.
Workshop
Partially Livestreamed
Partially Recorded
TP
W
Workshop
Partially Livestreamed
Partially Recorded
TP
W
Workshop
Partially Livestreamed
Partially Recorded
TP
W
Workshop
Partially Livestreamed
Partially Recorded
TP
W
Workshop
Not Livestreamed
Not Recorded
Partially Livestreamed
Partially Recorded
TP
W
Workshop
Partially Livestreamed
Partially Recorded
TP
W
DescriptionIn this session, we will invite our attendees to engage in a dynamic activity where they will share visions and strategies for building community in groups of three people. Using the Troika Consulting technique, the attendees will be guided to provide feedback to others' practical or imaginative questions within building community and networking strategies. At the end of this session, we hope our attendees will get different perspectives on possible problems they face daily and create connections.
Workshop
Partially Livestreamed
Partially Recorded
TP
W
Workshop
Partially Livestreamed
Partially Recorded
TP
W
Doctoral Showcase
Interactive Research e-Poster
Research & ACM SRC Posters
Not Livestreamed
Not Recorded
TP
DescriptionAs HPC simulations generate ever-larger datasets, reducing the volume of data that must be loaded into compute node memory for analysis has become essential for unlocking insights efficiently. In-storage analysis achieves this by processing data directly at the storage servers, allowing them to return only compact results that match regions of interest instead of raw datasets that may be orders of magnitude larger—significantly reducing data footprint.
This e-poster showcases a novel compute-near-storage architecture based on pNFS that enables secure in-storage analysis of scientific data with industry-standard software, including Arrow, Parquet, Substrait, and DuckDB. Using a real-world asteroid impact dataset, we present a live demonstration of a VTK visualization pipeline modified to offload analysis to pNFS data servers, tracing the aftermath of the impact over time and rendering the results on the client as 3D visuals. We show substantial data reduction by pushing down analysis and transmitting only insight-relevant information.
This e-poster showcases a novel compute-near-storage architecture based on pNFS that enables secure in-storage analysis of scientific data with industry-standard software, including Arrow, Parquet, Substrait, and DuckDB. Using a real-world asteroid impact dataset, we present a live demonstration of a VTK visualization pipeline modified to offload analysis to pNFS data servers, tracing the aftermath of the impact over time and rendering the results on the client as 3D visuals. We show substantial data reduction by pushing down analysis and transmitting only insight-relevant information.
Research and ACM SRC Posters
Research & ACM SRC Posters
TP
DescriptionHeterogeneous architectures integrating CPUs, GPUs, and memory controllers generate diverse traffic patterns that stress the on-chip network. Wireless networks-on-chip (WNoCs) provide fast, single-hop communication across distant nodes. However, their effectiveness is limited by congestion at wireless interfaces (WIs). In this work, we present WiCAT (Wireless Collate and Transfer), a lightweight two-stage framework to mitigate WI bottlenecks without modifying CPUs, GPUs, or memory controllers. WiCAT introduces Collate, a WI-level collation scheme that reduces redundant requests and corresponding reply traffic, and Transfer, a predictive medium access control protocol that dynamically allocates channel time based on both current buffer occupancy and anticipated traffic. Evaluation on Rodinia benchmarks shows that WiCAT reduces average delay by 17.8%, increases network throughput by 64%, and lowers energy consumption by 13.5%.
Art of HPC
Art of HPC
Not Livestreamed
Not Recorded
TP
W
TUT
XO/EX
DescriptionThe Sculpting Vis Collaborative developed software we call Artifact-Based Rendering, used here to visualize data from the HiGrad/Firetec wildfire model at LANL. We depart from traditional visualization methods by using statistical sampling to produce multiple point sets that represent fire and smoke. An evolving point set is produced by continuously emitting particles from the fire hot-spot on the ground and allowing them to be carried by the wind interpolated through time. The instantaneous velocity of the wind is represented by path lines seeded at two levels on the up-wind face of the simulation space as well from the hot-spot on the ground. This results in several sets of points and of lines that exist in the same space, but which represent different aspects of the data. ABR enables us to use artist-made glyphs, lines and textures to differentiate between these different geometrical sets, helping the viewer to understand the complex structure of the data.
Art of HPC
Art of HPC
Not Livestreamed
Not Recorded
TP
W
TUT
XO/EX
DescriptionA look within the blue marble we call home. HPC simulation data from geophysics shows processes under the Earth's mantle that are used for fundamental research on our planet.
Workshop
Livestreamed
Recorded
TP
W
DescriptionInteractivity enables the exploitation of HPC in new and revolutionary ways, delivering many new and exciting opportunities for our community. Interactive HPC involves users being in the loop during job execution where a human is monitoring a job, steering the experiment, or visualizing results to make immediate decisions about the results to influence the current or subsequent interactive jobs. Likewise, urgent computing combines interactive computational modeling with time sensitive systems in the real world such as the near real time analysisdetection of unfolding disasters to informtake real-time decisionmakingactions. Supporting interactive and urgent workloads on HPC requires expertise in a wide range of areas and the solving of numerous technical and organizational challenges.
This workshop brings together stakeholders, researchers and practitioners from across interactive and urgent computing within the wider HPC community. We will share success stories, case studies and technologies to continue community building around leveraging interactive HPC as an important tool for scientific research, responding to disasters and addressing societal issues.
This workshop brings together stakeholders, researchers and practitioners from across interactive and urgent computing within the wider HPC community. We will share success stories, case studies and technologies to continue community building around leveraging interactive HPC as an important tool for scientific research, responding to disasters and addressing societal issues.
Research and ACM SRC Posters
Research & ACM SRC Posters
TP
DescriptionRecent scientific workflow management systems, such as Nextflow, put a huge focus on portability of workflows. Portability encompasses replacing both the target infrastructure and the input dataset. The more portable systems become, the more the importance of automatic adaptation and optimization increases. Strategies to optimize the execution of scientific workflows are often evaluated in simulation, and only for the individual strategy. Accordingly, it is unclear how different strategies affect each other.
In this work, we fuse three strategies to optimize workflow execution. First, WOW: an approach that focuses on location-aware scheduling. Second, PONDER: an approach that predicts task memory consumption and sizes the tasks accordingly. And third, SCALE: an approach to predict task CPU and size it accordingly.
We test all three approaches together and investigate their synergies. Our results show that the whole is greater than the sum of its parts. We achieve makespan reductions of up to 67.4%.
In this work, we fuse three strategies to optimize workflow execution. First, WOW: an approach that focuses on location-aware scheduling. Second, PONDER: an approach that predicts task memory consumption and sizes the tasks accordingly. And third, SCALE: an approach to predict task CPU and size it accordingly.
We test all three approaches together and investigate their synergies. Our results show that the whole is greater than the sum of its parts. We achieve makespan reductions of up to 67.4%.
Birds of a Feather
Community Meetings
Livestreamed
Recorded
TP
XO/EX
DescriptionThis BoF will convene the workflows community to discuss emerging directions in scientific workflow execution, including agentic workflows, integration of high performance and quantum computing workflows, and coordinated allocation and scheduling across experimental and computing facilities. A central focus will be on ensuring end-to-end resource availability when workflows depend on limited instrument time and distributed infrastructure. The session will also address the need for infrastructure and policy reforms to support intelligent, cross-facility execution. Through interactive discussions, participants will explore collaborative strategies to enable resilient, scalable, and adaptive workflows that meet the evolving demands of scientific discovery.
Flash Session
Not Livestreamed
Not Recorded
TP
XO/EX
DescriptionHigh performance computing (HPC), artificial intelligence (AI), and data analytics are converging in ways that are reshaping research, industry, and government operations. For decades, organizations optimized individual workloads—tuning a simulation, refining a model, or scaling a job. However, as complexity increases, this siloed approach creates bottlenecks, compliance risks, and escalating costs. The real breakthrough comes not from making single workloads faster, but from orchestrating workflows that span the entire value chain.
This flash session explores the transition from the “age of workloads” to the “age of workflows.” Drawing parallels to Henry Ford’s assembly line, we will highlight how workflow thinking accelerates insight, ensures reproducibility, and enables cost control across hybrid and multi-cloud environments. Topics include:
• Workflow orchestration in HPC/AI pipelines using tools such as NextFlow, Snakemake, WDL, Cromwell, and Airflow
• Hybrid flexibility: balancing on-premises HPC with cloud bursting for agility and scale
• Compliance and reproducibility: structured, auditable pipelines that reduce risk in regulated domains
• Knowledge preservation: encoding expert practices into workflows to mitigate the “Silver Tsunami” of retiring talent
• Case studies: accelerating drug discovery, optimizing energy production, and advancing automotive materials design
Attendees will leave with a clear understanding of how workflow optimization transforms HPC from a collection of isolated jobs into a strategic engine for innovation. The imperative is clear: organizations that shift to workflow-driven thinking will accelerate their time-to-market, reduce costs, and build a sustainable competitive advantage in the era of data-intensive science and AI.
This flash session explores the transition from the “age of workloads” to the “age of workflows.” Drawing parallels to Henry Ford’s assembly line, we will highlight how workflow thinking accelerates insight, ensures reproducibility, and enables cost control across hybrid and multi-cloud environments. Topics include:
• Workflow orchestration in HPC/AI pipelines using tools such as NextFlow, Snakemake, WDL, Cromwell, and Airflow
• Hybrid flexibility: balancing on-premises HPC with cloud bursting for agility and scale
• Compliance and reproducibility: structured, auditable pipelines that reduce risk in regulated domains
• Knowledge preservation: encoding expert practices into workflows to mitigate the “Silver Tsunami” of retiring talent
• Case studies: accelerating drug discovery, optimizing energy production, and advancing automotive materials design
Attendees will leave with a clear understanding of how workflow optimization transforms HPC from a collection of isolated jobs into a strategic engine for innovation. The imperative is clear: organizations that shift to workflow-driven thinking will accelerate their time-to-market, reduce costs, and build a sustainable competitive advantage in the era of data-intensive science and AI.
Paper
System Software and Cloud Computing
Livestreamed
Recorded
TP
DescriptionToday, cloud workloads are largely opaque to the cloud platform. Typically, the only information the platform receives is the virtual machine (VM) type and possibly a decoration to the type (e.g., the VM is evictable). Similarly, workloads receive minimal information from the platform; generally, only telemetry from their VMs or occasional signals (e.g., just before a VM is evicted). The narrow interface between workloads and platforms has several drawbacks: (1) a surge in VM types and decorations in public cloud platforms complicates customer selection; (2) key workload characteristics (e.g., low availability requirements) are often unspecified, hindering platform customization for optimized resource usage and cost savings; and (3) workloads may be unaware of potential optimizations or lack sufficient time to react to platform events. To resolve these issues and improve cloud efficiency, we propose Workload Sage, a framework for enabling dynamic bi-directional communication between cloud workloads and cloud platform.
Workshop
Livestreamed
Recorded
TP
W
DescriptionWorkshop organizers will provide an overview of the objectives and program as well as the international Trillion Parameter Consortium and opportunities to engage.
Tutorial
Livestreamed
Recorded
TUT
DescriptionSYCL is a programming model that lets developers support a wide variety of devices (CPUs, GPUs, and more) from a single code base. Given the growing heterogeneity of processor roadmaps in both HPC and AI, moving to an open standard, platform-independent model such as SYCL is essential for modern software developers. SYCL has the further advantage of supporting a single-source style of programming using completely standard C++. In this tutorial, we will introduce SYCL and provide programmers with a solid foundation they can build on to gain mastery of this language. The main benefit of using SYCL over other heterogeneous programming models is the single programming language approach, which enables one to target multiple devices using the same programming model, and therefore to have a cleaner, portable, and more readable code. This is a hands-on tutorial. The real learning will happen as students write code. The format will be short presentations followed by hands-on exercises. Hence, attendees will require their own laptop to perform the hands-on exercises.
Paper
BSP
HPC for Machine Learning
Performance Measurement, Modeling, & Tools
Livestreamed
Recorded
TP
DescriptionEmerging expert-specialized Mixture-of-Experts (MoE) architectures, such as DeepSeek-MoE, deliver strong model quality through fine-grained expert segmentation and large top-k routing. However, their scalability is limited by substantial activation memory overhead and costly all-to-all communication. Furthermore, current MoE training systems—primarily optimized for NVIDIA GPUs—perform suboptimally on non-NVIDIA platforms, leaving significant computational potential untapped.
In this work, we present X-MoE, a novel MoE training system designed to deliver scalable training performance for next-generation MoE architectures. X-MoE achieves this via several novel techniques, including efficient padding-free MoE training with cross-platform kernels, redundancy-bypassing dispatch, and hybrid parallelism with sequence-sharded MoE blocks. Our evaluation on the Frontier supercomputer, powered by AMD MI250X GPUs, shows that X-MoE scales DeepSeek-style MoEs up to 545 billion parameters across 1,024 GPUs—10x larger than the largest trainable model with existing methods under the same hardware budget, while maintaining high training throughput.
In this work, we present X-MoE, a novel MoE training system designed to deliver scalable training performance for next-generation MoE architectures. X-MoE achieves this via several novel techniques, including efficient padding-free MoE training with cross-platform kernels, redundancy-bypassing dispatch, and hybrid parallelism with sequence-sharded MoE blocks. Our evaluation on the Frontier supercomputer, powered by AMD MI250X GPUs, shows that X-MoE scales DeepSeek-style MoEs up to 545 billion parameters across 1,024 GPUs—10x larger than the largest trainable model with existing methods under the same hardware budget, while maintaining high training throughput.
Workshop
Livestreamed
Recorded
TP
W
DescriptionX-ray ptychography is becoming an indispensable tool for nanoscale imaging, driving innovation in functional materials, electronics, life sciences, etc. To retrieve sample images, the technique relies on advanced mathematical algorithms, making it computationally intensive. Recent advances in data acquisition have greatly increased the data generation rate, making it challenging to perform reconstruction in a timely manner to support decision making during an experiment. Here, we demonstrate how efficient GPU-based iterative reconstruction algorithms, deployed at the edge, enable real-time feedback during high-speed continuous data acquisition, allowing for a more informed experiment execution and thus increasing the quality and efficiency of the measurements. These developments represent a steppingstone towards augmentation of computationally intensive experiments with data-driven decision making, paving the way for autonomous experiments performed at machine speeds.
Paper
BP
State of the Practice
System Software and Cloud Computing
Livestreamed
Recorded
TP
DescriptionHPC systems and cloud data centers are converging, and containers are becoming the default software deployment method. While containers simplify software management, they face significant performance challenges: they must sacrifice hardware-specific optimizations to achieve portability. Although HPC containers can use runtime hooks to access optimized libraries and devices, they are limited by ABI compatibility and cannot reverse the effects of early-stage compilation decisions. XaaS containers proposed a vision of performance-portable containers, and we present a practical realization with Source and Intermediate Representation (IR) containers. We delay performance-critical decisions until the target system specification is known. We analyze specialization mechanisms in HPC software and propose a new LLM-assisted method for their automatic discovery. By examining the compilation pipeline, we develop a methodology to build containers optimized for target architectures at deployment time. Our prototype demonstrates that new XaaS containers combine the convenience of containerization with the performance benefits of system-specialized builds.
Workshop
Livestreamed
Recorded
TP
W
DescriptionAdvanced scientific applications require coupling distributed sensor networks with centralized high-performance computing facilities. Citrus Under Protective Screening (CUPS) exemplifies this need in digital agriculture, where citrus research facilities are instrumented with numerous sensors monitoring environmental conditions and detecting protective screening damage. CUPS demands access to computational fluid dynamics codes for modeling environmental conditions and guiding real-time interventions like water application or robotic repairs. These computing domains have contrasting properties: sensor networks provide low-performance, limited-capacity, unreliable data access, while high-performance facilities offer enormous computing power through high-latency batch processing. Private 5G networks present novel capabilities addressing this challenge by providing low latency, high throughput, and reliability necessary for near-real-time coupling of edge sensor networks with HPC simulations. This work presents xGFabric, an end-to-end system coupling sensor networks with HPC facilities through Private 5G networks. The prototype connects remote sensors via 5G network slicing to HPC systems, enabling real-time digital agriculture simulation.
Workshop
Livestreamed
Recorded
TP
W
DescriptionAdvancement in computational power and high-speed networking is enabling a new model of scientific experiment, experiment-in-the-loop computing (EILC). In this model, simulation and/or learning modules are run as data is collected from observational and experimental sources. Presently, the amount and complexity of data generated by simulations and by observational and experimental sources, such as sensor networks and large-scale scientific facilities, continues to increase. Several research challenges exist, many of which are independent of the scientific application domain. New algorithms, including artificial intelligence and machine learning algorithms, to merge simulation ensembles and experimental data sets must be developed. Data transfer techniques and workflows must be constructed to control the ensembles and integrate simulated and observed data sets. The Workshop on Extreme-Scale Experiment-in-the-Loop Computing (XLOOP 2025) will be a unique opportunity to promote this interdisciplinary topic area. We invite papers, presentations, and participants from the physical and computer sciences.
Birds of a Feather
Practitioners in HPC
Security & Privacy
Livestreamed
Recorded
TP
XO/EX
DescriptionThis session discusses the critical challenge of integrating Zero Trust (ZT) security into traditional supercomputing environments. The ZT model, based on a least-privilege, per-request architecture, has profound implications for HPC centers, application developers, and end-user workflows. We will explore the fundamentals of ZT, the purpose of NIST SP 800-207, and relevant U.S. Federal mandates. We will discuss current implementation approaches and challenges at major HPC centers. Join this interactive discussion to share your experiences, questions, and solutions.
Paper
Performance Measurement, Modeling, & Tools
Livestreamed
Recorded
TP
DescriptionZero-value propagation is a common phenomenon in modern programs, where redundant operations caused by zero values can severely impact performance. Since zero values are often generated dynamically at runtime, eliminating such redundancies through static analysis alone is challenging. In this paper, we propose an efficient static control data flow analysis algorithm to identify redundancies resulting from zero-value propagation. Based on this algorithm, we design and implement ZeroSpec, a fully automated profile-guided code optimizer that detects zero values at runtime and specializes fast paths for them. To maximize performance gains, ZeroSpec also employs a fine-grained cost model that evaluates the optimization potential of individual zero-value instructions to guide the construction of targeted optimization regions. Evaluation on SPEC CPU2017, NPB and real-world applications demonstrates the effectiveness of ZeroSpec, achieving a maximum performance speedup of 1.31x.
Sessions
Workshop
Livestreamed
Recorded
TP
W
Workshop
Performance Evaluation, Scalability, & Portability
Livestreamed
Recorded
TP
W
Awards and Award Talks
Livestreamed
Recorded
TP
ACM Gordon Bell Climate Modeling Finalist
ACM Gordon Bell Finalist
Awards and Award Talks
Applications
Livestreamed
Recorded
TP
ACM Gordon Bell Finalist
Awards and Award Talks
Applications
Livestreamed
Recorded
TP
ACM Gordon Bell Climate Modeling Finalist
Awards and Award Talks
Applications
Livestreamed
Recorded
TP
Community Engagement and Support
Not Livestreamed
Not Recorded
TP
W
TUT
XO/EX
Community Engagement and Support
Not Livestreamed
Not Recorded
TP
W
TUT
XO/EX
Paper
Algorithms
HPC for Machine Learning
Programming Frameworks
Livestreamed
Recorded
TP
Paper
Algorithms
HPC for Machine Learning
Performance Measurement, Modeling, & Tools
State of the Practice
Livestreamed
Recorded
TP
Paper
Applications
Performance Measurement, Modeling, & Tools
State of the Practice
Livestreamed
Recorded
TP
Invited Talk
Livestreamed
Recorded
TP
Paper
Applications
Architectures & Networks
Livestreamed
Recorded
TP
Paper
Architectures & Networks
Livestreamed
Recorded
TP
Art of HPC
Art of HPC
Not Livestreamed
Not Recorded
TP
W
TUT
XO/EX
Art of HPC
Art of HPC
Not Livestreamed
Not Recorded
TP
W
TUT
XO/EX
Art of HPC
Art of HPC
Not Livestreamed
Not Recorded
TP
W
TUT
XO/EX
Art of HPC
Art of HPC
Not Livestreamed
Not Recorded
TP
W
TUT
XO/EX
Art of HPC
Art of HPC
Not Livestreamed
Not Recorded
TP
W
TUT
XO/EX
Art of HPC
Art of HPC
Not Livestreamed
Not Recorded
TP
W
TUT
XO/EX
Art of HPC
Art of HPC
Not Livestreamed
Not Recorded
TP
W
TUT
XO/EX
Art of HPC
Art of HPC
Not Livestreamed
Not Recorded
TP
W
TUT
XO/EX
Art of HPC
Art of HPC
Not Livestreamed
Not Recorded
TP
W
TUT
XO/EX
Art of HPC
Art of HPC
Not Livestreamed
Not Recorded
TP
W
TUT
XO/EX
Art of HPC
Art of HPC
Not Livestreamed
Not Recorded
TP
W
TUT
XO/EX
Art of HPC
Art of HPC
Not Livestreamed
Not Recorded
TP
W
TUT
XO/EX
Art of HPC
Reception
Art of HPC
Not Livestreamed
Not Recorded
TP
W
TUT
XO/EX
Paper
HPC for Machine Learning
Programming Frameworks
Livestreamed
Recorded
TP
Awards and Award Talks
Awards Luncheon (Invitation Only)
11:30am - 12:30pm CST Thursday, 20 November 2025 223-224-225-226Best Poster Presentations (Research, ACM SRC Grad/Undergrad)
Research & ACM SRC Posters
TP
Best Poster Presentations (Research, ACM SRC Grad/Undergrad)
Research & ACM SRC Posters
TP
Invited Talk
Livestreamed
Recorded
TP
Panel
CANCELED: Industry Algorithms Panel
10:30am - 12:00pm CST Friday, 21 November 2025 240-241-242Embedded and/or Reconfigurable Systems
Ethics & Societal Impact of HPC
Parallel Programming Methods, Models, Languages, & Environments
Livestreamed
Recorded
TP
W
TUT
XO/EX
Canceled
Community Engagement and Support
Childcare
8:00am - 7:00pm CST Wednesday, 19 November 2025 370Not Livestreamed
Not Recorded
TP
W
TUT
XO/EX
Attendee Services
Community Engagement and Support
Childcare
8:00am - 9:00pm CST Monday, 17 November 2025 370Not Livestreamed
Not Recorded
TP
W
TUT
XO/EX
Attendee Services
Community Engagement and Support
Childcare
8:00am - 7:00pm CST Tuesday, 18 November 2025 370Not Livestreamed
Not Recorded
TP
W
TUT
XO/EX
Attendee Services
Community Engagement and Support
Childcare
8:00am - 6:00pm CST Thursday, 20 November 2025 370Not Livestreamed
Not Recorded
TP
W
TUT
XO/EX
Attendee Services
Community Engagement and Support
Childcare
8:00am - 6:00pm CST Sunday, 16 November 2025 370Not Livestreamed
Not Recorded
TP
W
TUT
XO/EX
Attendee Services
Paper
Architectures & Networks
HPC for Machine Learning
Performance Measurement, Modeling, & Tools
Programming Frameworks
Livestreamed
Recorded
TP
Community Engagement and Support
Community Engagement and Support Office
8:00am - 12:00pm CST Friday, 21 November 2025 360Not Livestreamed
Not Recorded
TP
W
TUT
XO/EX
Attendee Services
Community Engagement and Support
Community Engagement and Support Office
8:00am - 5:00pm CST Thursday, 20 November 2025 360Not Livestreamed
Not Recorded
TP
W
TUT
XO/EX
Attendee Services
Community Engagement and Support
Community Engagement and Support Office
8:00am - 5:00pm CST Wednesday, 19 November 2025 360Not Livestreamed
Not Recorded
TP
W
TUT
XO/EX
Attendee Services
Community Engagement and Support
Community Engagement and Support Office
8:00am - 5:00pm CST Monday, 17 November 2025 360Not Livestreamed
Not Recorded
TP
W
TUT
XO/EX
Attendee Services
Community Engagement and Support
Community Engagement and Support Office
8:00am - 5:00pm CST Tuesday, 18 November 2025 360Not Livestreamed
Not Recorded
TP
W
TUT
XO/EX
Attendee Services
Community Engagement and Support
Community Engagement and Support Office
8:00am - 5:00pm CST Sunday, 16 November 2025 360Not Livestreamed
Not Recorded
TP
W
TUT
XO/EX
Attendee Services
Paper
Algorithms
Applications
State of the Practice
Livestreamed
Recorded
TP
Paper
Data Analytics, Visualization & Storage
Livestreamed
Recorded
TP
Paper
State of the Practice
System Software and Cloud Computing
Livestreamed
Recorded
TP
Community Engagement and Support
Not Livestreamed
Not Recorded
TP
W
TUT
XO/EX
Invited Talk
Livestreamed
Recorded
TP
Exhibitor Forum
Data Analytics
Livestreamed
Recorded
TP
XO/EX
Exhibitor Forum
Data Analytics
Livestreamed
Recorded
TP
XO/EX
Exhibitor Forum
Data Analytics
Livestreamed
Recorded
TP
XO/EX
Paper
Data Analytics, Visualization & Storage
Livestreamed
Recorded
TP
Doctoral Showcase
Research & ACM SRC Posters
Livestreamed
Recorded
TP
Doctoral Showcase
Research & ACM SRC Posters
Livestreamed
Recorded
TP
Doctoral Showcase
Research & ACM SRC Posters
Livestreamed
Recorded
TP
Early Career
Not Livestreamed
Not Recorded
TP
W
TUT
XO/EX
Paper
Energy Efficiency
Performance Measurement, Modeling, & Tools
Power Use Monitoring & Optimization
State of the Practice
Livestreamed
Recorded
TP
Exhibitor Forum
Exascale
Livestreamed
Recorded
TP
XO/EX
Workshop
Partially Livestreamed
Partially Recorded
TP
W
Exhibits
Reception
Exhibit Floor Ribbon Cutting
6:45pm - 6:55pm CST Monday, 17 November 2025 Hall 4 Corner Entrance
TP
XO/EX
Exhibits
Reception
Exhibitor Pre-Gala Dinner
5:30pm - 7:00pm CST Monday, 17 November 2025
XO/EX
Reception
Exhibitor Reception
6:00pm - 9:00pm CST Sunday, 16 November 2025 FanDuel Sports Network Live
XO/EX
Community Engagement and Support
Family Day
3:00pm - 6:00pm CST Wednesday, 19 November 2025 Hall 1 EntranceNot Livestreamed
Not Recorded
TP
W
TUT
XO/EX
Community Engagement and Support
Not Livestreamed
Not Recorded
TP
W
TUT
XO/EX
Community Engagement and Support
Not Livestreamed
Not Recorded
TP
W
TUT
XO/EX
Exhibits
Reception
Grand Opening Gala Reception
7:00pm - 9:00pm CST Monday, 17 November 2025 Exhibit Halls
TP
XO/EX
Paper
HPC for Machine Learning
System Software and Cloud Computing
Partially Livestreamed
Partially Recorded
TP
Paper
Algorithms
Applications
Data Analytics
Livestreamed
Recorded
TP
Exhibitor Forum
Hardware and Architecture
Livestreamed
Recorded
TP
XO/EX
HPC Ignites Plenary
Quantum & Other Post Moore Computing Technologies
Livestreamed
Recorded
TP
W
TUT
XO/EX
Workshop
Livestreamed
Recorded
TP
W
Community Engagement and Support
Not Livestreamed
Not Recorded
TP
XO/EX
SCinet
Not Livestreamed
Not Recorded
IndySCC
IndySCC
10:00am - 6:00pm CST Wednesday, 19 November 2025 IndySCC BoothNot Livestreamed
Not Recorded
TP
XO/EX
IndySCC
IndySCC
10:00am - 6:00pm CST Tuesday, 18 November 2025 IndySCC BoothNot Livestreamed
Not Recorded
TP
XO/EX
IndySCC
IndySCC Poster Display
10:00am - 6:00pm CST Tuesday, 18 November 2025 IndySCC BoothNot Livestreamed
Not Recorded
TP
XO/EX
IndySCC
IndySCC Poster Display
7:00pm - 9:00pm CST Monday, 17 November 2025 IndySCC BoothNot Livestreamed
Not Recorded
TP
XO/EX
IndySCC
IndySCC Poster Display
10:00am - 6:00pm CST Wednesday, 19 November 2025 IndySCC BoothNot Livestreamed
Not Recorded
TP
XO/EX
Workshop
Partially Livestreamed
Partially Recorded
TP
W
Job Fair
Job Fair
10:30am - 3:00pm CST Wednesday, 19 November 2025 Hall 6 Job FairNot Livestreamed
Not Recorded
TP
W
TUT
XO/EX
Community Engagement and Support
Not Livestreamed
Not Recorded
TP
W
TUT
XO/EX
Invited Talk
Life Sciences
Societal Impact
Livestreamed
Recorded
TP
Paper
HPC for Machine Learning
System Software and Cloud Computing
Livestreamed
Recorded
TP
Paper
Applications
Architectures & Networks
HPC for Machine Learning
Livestreamed
Recorded
TP
Paper
HPC for Machine Learning
Performance Measurement, Modeling, & Tools
Livestreamed
Recorded
TP
Paper
HPC for Machine Learning
Livestreamed
Recorded
TP
Community Engagement and Support
Not Livestreamed
Not Recorded
TP
W
TUT
XO/EX
Invited Talk
Livestreamed
Recorded
TP
SCinet Network Research Exhibition
Not Livestreamed
Not Recorded
Exhibitor Forum
Networking
Livestreamed
Recorded
TP
XO/EX
Community Engagement and Support
Not Livestreamed
Not Recorded
TP
W
TUT
XO/EX
Paper
Papers Overflow
10:30am - 12:00pm CST Thursday, 20 November 2025 274Not Livestreamed
Not Recorded
TP
Paper
Papers Overflow
3:30pm - 5:00pm CST Tuesday, 18 November 2025 276Not Livestreamed
Not Recorded
TP
Paper
Papers Overflow
10:30am - 12:00pm CST Tuesday, 18 November 2025 274Not Livestreamed
Not Recorded
TP
Paper
Papers Overflow
3:30pm - 5:00pm CST Wednesday, 19 November 2025 274Not Livestreamed
Not Recorded
TP
Paper
Papers Overflow
3:30pm - 5:00pm CST Wednesday, 19 November 2025 276Not Livestreamed
Not Recorded
TP
Paper
Papers Overflow
1:30pm - 3:00pm CST Thursday, 20 November 2025 274Not Livestreamed
Not Recorded
TP
Paper
Papers Overflow
10:30am - 12:00pm CST Wednesday, 19 November 2025 274Not Livestreamed
Not Recorded
TP
Paper
Papers Overflow
1:30pm - 3:00pm CST Wednesday, 19 November 2025 274Not Livestreamed
Not Recorded
TP
Paper
Papers Overflow
10:30am - 12:00pm CST Wednesday, 19 November 2025 276Not Livestreamed
Not Recorded
TP
Paper
Papers Overflow
1:30pm - 3:00pm CST Tuesday, 18 November 2025 276Not Livestreamed
Not Recorded
TP
Paper
Papers Overflow
3:30pm - 5:00pm CST Thursday, 20 November 2025 274Not Livestreamed
Not Recorded
TP
Paper
Papers Overflow
3:30pm - 5:00pm CST Thursday, 20 November 2025 276Not Livestreamed
Not Recorded
TP
Paper
Papers Overflow
1:30pm - 3:00pm CST Tuesday, 18 November 2025 274Not Livestreamed
Not Recorded
TP
Paper
Papers Overflow
10:30am - 12:00pm CST Tuesday, 18 November 2025 276Not Livestreamed
Not Recorded
TP
Paper
Papers Overflow
10:30am - 12:00pm CST Thursday, 20 November 2025 276Not Livestreamed
Not Recorded
TP
Paper
Papers Overflow
1:30pm - 3:00pm CST Thursday, 20 November 2025 276Not Livestreamed
Not Recorded
TP
Paper
Papers Overflow
1:30pm - 3:00pm CST Wednesday, 19 November 2025 276Not Livestreamed
Not Recorded
TP
Community Engagement and Support
Parents Room
8:00am - 6:00pm CST Thursday, 20 November 2025 362Not Livestreamed
Not Recorded
TP
W
TUT
XO/EX
Attendee Services
Community Engagement and Support
Parents Room
8:00am - 12:00pm CST Friday, 21 November 2025 362Not Livestreamed
Not Recorded
TP
W
TUT
XO/EX
Attendee Services
Community Engagement and Support
Parents Room
8:00am - 7:00pm CST Tuesday, 18 November 2025 362Not Livestreamed
Not Recorded
TP
W
TUT
XO/EX
Attendee Services
Community Engagement and Support
Parents Room
8:00am - 7:00pm CST Wednesday, 19 November 2025 362Not Livestreamed
Not Recorded
TP
W
TUT
XO/EX
Attendee Services
Community Engagement and Support
Parents Room
8:00am - 9:00pm CST Monday, 17 November 2025 362Not Livestreamed
Not Recorded
TP
W
TUT
XO/EX
Attendee Services
Community Engagement and Support
Parents Room
8:00am - 6:00pm CST Sunday, 16 November 2025 362Not Livestreamed
Not Recorded
TP
W
TUT
XO/EX
Attendee Services
Workshop
Data Analytics
High Performance I/O, Storage, Archive, & File Systems
Storage
Livestreamed
Recorded
TP
W
Paper
Performance Measurement, Modeling, & Tools
System Software and Cloud Computing
Livestreamed
Recorded
TP
Paper
Performance Measurement, Modeling, & Tools
Livestreamed
Recorded
TP
Paper
HPC for Machine Learning
Performance Measurement, Modeling, & Tools
Programming Frameworks
Livestreamed
Recorded
TP
Doctoral Showcase
Interactive Research e-Poster
Not Livestreamed
Not Recorded
TP
Doctoral Showcase
Interactive Research e-Poster
Research & ACM SRC Posters
Not Livestreamed
Not Recorded
TP
Doctoral Showcase
Interactive Research e-Poster
Research & ACM SRC Posters
Not Livestreamed
Not Recorded
TP
Doctoral Showcase
Interactive Research e-Poster
Research & ACM SRC Posters
Not Livestreamed
Not Recorded
TP
Research and ACM SRC Posters
Research & ACM SRC Posters
TP
Research and ACM SRC Posters
Research & ACM SRC Posters
TP
Research and ACM SRC Posters
Research & ACM SRC Posters
TP
Research and ACM SRC Posters
Research & ACM SRC Posters
TP
Doctoral Showcase
Interactive Research e-Poster
Research & ACM SRC Posters
Not Livestreamed
Not Recorded
TP
Research and ACM SRC Posters
Research & ACM SRC Posters
TP
Community Engagement and Support
Prayer Room
8:00am - 6:00pm CST Thursday, 20 November 2025 372Not Livestreamed
Not Recorded
TP
W
TUT
XO/EX
Attendee Services
Community Engagement and Support
Prayer Room
8:00am - 12:00pm CST Friday, 21 November 2025 372Not Livestreamed
Not Recorded
TP
W
TUT
XO/EX
Attendee Services
Community Engagement and Support
Prayer Room
8:00am - 7:00pm CST Tuesday, 18 November 2025 372Not Livestreamed
Not Recorded
TP
W
TUT
XO/EX
Attendee Services
Community Engagement and Support
Prayer Room
8:00am - 6:00pm CST Sunday, 16 November 2025 372Not Livestreamed
Not Recorded
TP
W
TUT
XO/EX
Attendee Services
Community Engagement and Support
Prayer Room
8:00am - 7:00pm CST Wednesday, 19 November 2025 372Not Livestreamed
Not Recorded
TP
W
TUT
XO/EX
Attendee Services
Community Engagement and Support
Prayer Room
8:00am - 9:00pm CST Monday, 17 November 2025 372Not Livestreamed
Not Recorded
TP
W
TUT
XO/EX
Attendee Services
Paper
Algorithms
Applications
Architectures & Networks
Livestreamed
Recorded
TP
Paper
Programming Frameworks
Livestreamed
Recorded
TP
Paper
Post-Moore Computing
Livestreamed
Recorded
TP
Exhibitor Forum
Quantum & Other Post Moore Computing Technologies
Livestreamed
Recorded
TP
XO/EX
Exhibitor Forum
Quantum & Other Post Moore Computing Technologies
Livestreamed
Recorded
TP
XO/EX
Community Engagement and Support
Quiet Room
8:00am - 6:00pm CST Sunday, 16 November 2025 361Not Livestreamed
Not Recorded
TP
W
TUT
XO/EX
Attendee Services
Community Engagement and Support
Quiet Room
8:00am - 7:00pm CST Tuesday, 18 November 2025 361Not Livestreamed
Not Recorded
TP
W
TUT
XO/EX
Attendee Services
Community Engagement and Support
Quiet Room
8:00am - 9:00pm CST Monday, 17 November 2025 361Not Livestreamed
Not Recorded
TP
W
TUT
XO/EX
Attendee Services
Community Engagement and Support
Quiet Room
8:00am - 7:00pm CST Wednesday, 19 November 2025 361Not Livestreamed
Not Recorded
TP
W
TUT
XO/EX
Attendee Services
Community Engagement and Support
Quiet Room
8:00am - 12:00pm CST Friday, 21 November 2025 361Not Livestreamed
Not Recorded
TP
W
TUT
XO/EX
Attendee Services
Community Engagement and Support
Quiet Room
8:00am - 6:00pm CST Thursday, 20 November 2025 361Not Livestreamed
Not Recorded
TP
W
TUT
XO/EX
Attendee Services
Awards and Award Talks
Livestreamed
Recorded
TP
XO/EX
Keynote
Keynote
Livestreamed
Recorded
TP
W
TUT
XO/EX
Paper
HPC for Machine Learning
State of the Practice
System Software and Cloud Computing
Livestreamed
Recorded
TP
SCinet
Not Livestreamed
Not Recorded
SCinet
Not Livestreamed
Not Recorded
Exhibitor Forum
Software Tools
Livestreamed
Recorded
TP
XO/EX
Students@SC
Speed Mentoring
11:00am - 1:30pm CST Wednesday, 19 November 2025 120-127Not Livestreamed
Not Recorded
TP
W
TUT
XO/EX
Student Cluster Competition
Student Cluster Competition
9:00am - 5:30pm CST Tuesday, 18 November 2025 SCC BoothNot Livestreamed
Not Recorded
TP
W
TUT
XO/EX
Student Cluster Competition
Student Cluster Competition
9:00am - 5:30pm CST Wednesday, 19 November 2025 SCC BoothNot Livestreamed
Not Recorded
TP
W
TUT
XO/EX
Student Cluster Competition
Student Cluster Competition Poster Display
9:00am - 5:30pm CST Wednesday, 19 November 2025 SCC BoothNot Livestreamed
Not Recorded
TP
XO/EX
Student Cluster Competition
Student Cluster Competition Poster Display
7:00pm - 9:00pm CST Monday, 17 November 2025 SCC BoothNot Livestreamed
Not Recorded
TP
XO/EX
Student Cluster Competition
Student Cluster Competition Poster Display
9:00am - 5:30pm CST Tuesday, 18 November 2025 SCC BoothNot Livestreamed
Not Recorded
TP
XO/EX
IndySCC
Student Cluster Competition
Student Cluster Competition/IndySCC Kickoff
7:00pm - 8:00pm CST Monday, 17 November 2025 IndySCC BoothNot Livestreamed
Not Recorded
TP
W
TUT
XO/EX
Early Career
Students@SC
Students@SC & Early Career Program Alumni Event
5:00pm - 7:00pm CST Wednesday, 19 November 2025 The Party ZoneNot Livestreamed
Not Recorded
TP
W
TUT
XO/EX
Students@SC
Not Livestreamed
Not Recorded
TP
W
TUT
XO/EX
Students@SC
Not Livestreamed
Not Recorded
TP
W
TUT
XO/EX
Students@SC
Not Livestreamed
Not Recorded
TP
W
TUT
XO/EX
Students@SC
Not Livestreamed
Not Recorded
TP
W
TUT
XO/EX
Students@SC
Not Livestreamed
Not Recorded
TP
W
TUT
XO/EX
Paper
System Software and Cloud Computing
Livestreamed
Recorded
TP
Paper
Architectures & Networks
System Software and Cloud Computing
Livestreamed
Recorded
TP
Reception
Technical Program Reception
6:30pm - 9:30pm CST Thursday, 20 November 2025 St. Louis Science Center
TP
Attendee Services
Awards and Award Talks
Livestreamed
Recorded
TP
Birds of a Feather
Community Meetings
Livestreamed
Recorded
TP
XO/EX
Invited Talk
Power Use Monitoring & Optimization
Livestreamed
Recorded
TP
Workshop
Partially Livestreamed
Partially Recorded
TP
W
Try a different query.
