Search Program

12:30pm - 2:00pm CST Sunday, 16 November 2025 275

Livestreamed

Recorded

Workshop

12:30pm - 2:00pm CST Monday, 17 November 2025 267

Livestreamed

Recorded

Workshop

12:50pm - 2:20pm CST Sunday, 16 November 2025 266

Livestreamed

Recorded

Workshop

12:30pm - 2:00pm CST Sunday, 16 November 2025 231

Livestreamed

Recorded

Workshop

12:30pm - 2:00pm CST Sunday, 16 November 2025 230

Livestreamed

Recorded

Workshop

12:30pm - 2:00pm CST Monday, 17 November 2025 275

Livestreamed

Recorded

Workshop

12:30pm - 2:00pm CST Monday, 17 November 2025 265

Livestreamed

Recorded

Workshop

12:30pm - 2:00pm CST Sunday, 16 November 2025 267

Livestreamed

Recorded

Workshop

12:20pm - 2:00pm CST Sunday, 16 November 2025 232

Livestreamed

Recorded

Workshop

12:30pm - 2:00pm CST Monday, 17 November 2025 260

Livestreamed

Recorded

Workshop

12:30pm - 2:00pm CST Monday, 17 November 2025 264

Livestreamed

Recorded

Workshop

12:20pm - 2:00pm CST Sunday, 16 November 2025 240

Livestreamed

Recorded

Workshop

12:30pm - 2:00pm CST Sunday, 16 November 2025 241

Livestreamed

Recorded

Workshop

12:30pm - 2:00pm CST Sunday, 16 November 2025 261

Livestreamed

Recorded

Workshop

12:55pm - 2:25pm CST Monday, 17 November 2025 261

Livestreamed

Recorded

Workshop

12:30pm - 2:00pm CST Monday, 17 November 2025 276

Partially Livestreamed

Partially Recorded

Workshop

12:30pm - 2:00pm CST Monday, 17 November 2025 266

Livestreamed

Recorded

Workshop

12:30pm - 2:00pm CST Monday, 17 November 2025 274

Livestreamed

Recorded

Workshop

12:30pm - 2:00pm CST Monday, 17 November 2025 230

Data Analytics

High Performance I/O, Storage, Archive, & File Systems

Storage

Livestreamed

Recorded

Workshop

12:30pm - 2:00pm CST Monday, 17 November 2025 231

Performance Evaluation, Scalability, & Portability

Livestreamed

Recorded

Workshop

12:30pm - 2:00pm CST Monday, 17 November 2025 232

Livestreamed

Recorded

Workshop

Lustre Community BoF: Lustre in HPC, AI, and the Cloud

12:30pm - 2:00pm CST Sunday, 16 November 2025 260

Livestreamed

Recorded

Birds of a Feather

5:15pm - 6:45pm CST Tuesday, 18 November 2025 276

Storage

Livestreamed

Recorded

XO/EX

DescriptionLustre is the leading open-source and open-development file system for HPC. Eight of the top 10 and 64% of the top 100 systems on the most recent Top500 list use Lustre. It is a community-developed technology with contributors from around the world. Lustre supports many HPC infrastructures such as research, finance, energy, and manufacturing. Lustre clients are available for instruction set architectures such as x86, POWER, and ARM.

At this BoF, Lustre users, developers, administrators, and solution providers will gather to ask questions and discuss recent Lustre developments and challenges, including the role of Lustre in AI and its use in cloud environments. People new to Lustre will get a feel for the power of this HPC shared file system.

Best Poster Presentations (Research, ACM SRC Grad/Undergrad)

Luthier: A Dynamic Binary Instrumentation Framework Targeting AMD GPUs

4:30pm - 4:45pm CST Wednesday, 19 November 2025 230

Research & ACM SRC Posters

DescriptionIn this poster we present Luthier, the first open-source dynamic binary instrumentation framework targeting AMD GPUs. We highlight key features of our framework, including example use cases and runtime overhead comparison with NVIDIA’s NVBit. We also go over some major enhancements under development in the latest version of Luthier that support more of the growing family of AMD GPUs and additional instrumentation scenarios.

Art of HPC

Machine Dreaming

8:00am - 6:00pm CST Sunday, 16 November 2025 Art of HPC - Plaza Lobby

Art of HPC

Not Livestreamed

Not Recorded

TUT

XO/EX

DescriptionMachine Dreaming is an interactive installation that explores the perceptual and social dynamic between humans and AI, highlighting the tensions and intimacies that surface in the experience of being “seen” by these systems. The video depicts a viewer interacting with the installation, watching as their image gradually transforms, shifting into AI’s “perception” of them, inviting reflection on what it means to be seen and interpreted by an LLM.

By using real-time video and AI-generated visuals, Machine Dreaming initially reveals a recognizable reflection of the viewer themselves. As the viewer moves within the space they notice their image gradually morph into an alien-like intermediary state, to plant-like forms that move and transform in real time. Though semi-ambiguous forms, the work allows for both abstraction and coherence—as plants can appear both alien and organic, yet beautiful simultaneously—lending itself to exploring unfamiliar yet resonant representations of the self.

The work simultaneously allows the viewer to feel in control while also being subtly guided by the system as the projection shifts back and forth from them into alien plant-like forms based on their movements, gestures, and interaction with the installation.

By foregrounding this interplay between human presence and algorithmic interpretation, Machine Dreaming highlights the conditions of being seen and mediated by intelligent systems. The work prompts viewers to reflect on their relationship to AI and how perception, agency, and representation are negotiated with these systems.

Workshop

Machine Learning-Guided Memory Optimization for DLRM Inference on Tiered Memory

3:30pm - 3:54pm CST Monday, 17 November 2025 240

Livestreamed

Recorded

DescriptionDeep learning recommendation models (DLRMs) rely on massive embedding tables that often exceed GPU memory capacity. Tiered memory offers a cost-effective solution but creates challenges for managing irregular access patterns. We introduce RecMG, an ML-guided caching and prefetching system tailored for DLRM inference. RecMG uses separate models for short-term reuse and long-range prediction, with a novel differentiable loss to improve accuracy. In large-scale deployments, RecMG reduces on-demand fetches by up to 2.8× and cuts inference time by up to 43%.

Art of HPC

Magnetic Field of a Star Formation Simulation

8:00am - 6:00pm CST Sunday, 16 November 2025 Art of HPC - Plaza Lobby

Art of HPC

Not Livestreamed

Not Recorded

TUT

XO/EX

DescriptionThis star formation simulation data shows how the rotating gas winds up the magnetic field around a forming protostar. All protostellar systems have angular momentum/rotation—this is how accretion disks/planetary systems form. Various details of the magnetic field geometry visualized here in response to this rotation are interesting in protostar research.

Paper

Make Updates Faster: A Fast Multi-Stripe Updates Framework in Erasure-Coded Storage Clusters

4:37pm - 5:00pm CST Thursday, 20 November 2025 275

System Software and Cloud Computing

Livestreamed

Recorded

DescriptionErasure coding is widely adopted to maintain data reliability, yet it introduces a significant update penalty. We analyze real-world traces and observe several challenges that are not addressed by existing studies, which thereby restrict the performance gains. We propose FastUpdate, an efficient multi-stripe updates framework that assists existing update schemes for fast updates. FastUpdate comprises three key designs: (1) it perceives the update locality and carefully merges multiple update requests accessing the same stripe to reduce the incurred network traffic; (2) it abstracts the existing update schemes into collector selection and tree construction, greedily generates the update solution for each stripe to balance the transmission load across nodes; (3) it dynamically schedules appropriate stripes to update in heterogeneous and dynamic networks to fully saturate the bandwidth resources. Comprehensive evaluations verify the effectiveness of FastUpdate on Alibaba ECS. It can increase the update throughput by 16.15%-88.71% for various update schemes.

Workshop

Making the Self-Driving “Net” Work: Developing Production-Ready ML Models for Self-Driving Networks

9:10am - 10:00am CST Sunday, 16 November 2025 266

Livestreamed

Recorded

DescriptionThe vision of self-driving networks (AIOps) hinges on the ability to develop production-ready machine learning models—models that are not only performant but also generalizable, robust, and trustworthy. Yet, most ML artifacts in networking today remain underspecified, suffering from shortcut learning, spurious correlations, and out-of-distribution failures rooted in data deficiencies. This talk traces our journey toward addressing these challenges by closing the loop between model analysis and data generation. I will present a closed-loop ML pipeline—composed of Trustee for model analysis and NetUnicorn, NetReplica, and NetGent for programmable data generation—that iteratively fixes underspecification by generating "better" data. Building on this foundation, I will discuss our efforts toward developing network foundation models (NFMs) that leverage self-supervised learning on large-scale network telemetry to unify diverse tasks, and toward reasoning about the generalizability of these NFMs. Finally, I will highlight emerging opportunities for using these programmable substrates to reimagine network operations and network measurements—solving unexplored learning problems in networking and revisiting previously explored ones with a fresher perspective.

Doctoral Showcase

Managing Heterogeneous Topologies and Understanding Their Impact on Performance

11:00am - 11:15am CST Thursday, 20 November 2025 230

Research & ACM SRC Posters

Livestreamed

Recorded

DescriptionTo solve increasingly complex problems more efficiently, modern HPC systems feature highly heterogeneous components: CPUs, GPUs, and recently QPUs (quantum processing units), each with a unique, complex compute topology. The massive parallelism of GPUs, combined with emerging memory technologies on CPUs and GPUs, makes the memory topologies increasingly heterogeneous, complex, and dynamically configurable. Understanding these topological details, especially regarding available memory and its usage, is essential to operating the systems and applications efficiently.

This thesis presents a framework targeting several fundamental gaps in the currently available research and tooling: sys-sage, MT4G, GPUscout, and Mitos modeling. At the core, the sys-sage library offers a unified approach to maintaining static and dynamic topological information from different sources and APIs. Its universal architecture handles CPUs, GPUs, and QPUs alike. MT4G provides an otherwise unavailable, vendor-agnostic, and complete report on GPU memory topologies, integrable with sys-sage. GPUs' massive parallelism amplifies the potential performance penalties of improper cache and memory usage. Therefore, GPUscout identifies root causes of frequently occurring memory-related bottlenecks, helping users efficiently utilize the complex memory subsystem of GPUs. Finally, to address emerging memory technologies, such as CXL.mem, this thesis presents a novel data access modeling workflow as an extension of Mitos. The model predicts the performance impact of CXL.mem-based cross-node shared-buffer data exchange as an alternative to point-to-point MPI communication. Altogether, these tools capture topologies of HPC systems and provide missing insights into application data transfer behavior.

Tutorial

Managing HPC Software Complexity with Spack

8:30am - 5:00pm CST Monday, 17 November 2025 127

Livestreamed

Recorded

TUT

DescriptionModern scientific software stacks rely on thousands of packages, from low-level libraries in C, C++, and Fortran to higher-level tools in Python and R. Scientists must deploy these stacks across diverse environments, from personal laptops to supercomputers, while tailoring workflows to specific tasks. Development workflows often require frequent rebuilds, debugging, and small-scale testing for rapid iteration. In contrast, preparing applications for large-scale HPC production involves performance-critical libraries (e.g., MPI, BLAS, LAPACK) and machine-specific optimizations to maximize efficiency. Managing these varied requirements is challenging. Configuring software, resolving dependencies, and ensuring compatibility can hinder both development and deployment. Spack is an open-source package manager that simplifies building, installing, and customizing HPC software stacks. It offers a flexible dependency model, Python-based syntax for package recipes, and a repository of over 8,500 packages maintained by more than 1,500 contributors. Spack is widely adopted by researchers, developers, cloud platforms, and HPC centers worldwide. This tutorial introduces Spack’s core capabilities, including installing and authoring packages, configuring environments, and deploying optimized software on HPC systems. Attendees will gain foundational skills for automating routine tasks and acquire advanced knowledge to address complex use cases with Spack.

Paper

MANS: Efficient and Portable ANS Encoding for Multi-Byte Integer Data on CPUs and GPUs

2:37pm - 3:00pm CST Wednesday, 19 November 2025 275

Data Analytics, Visualization & Storage

Livestreamed

Recorded

DescriptionLossless compression is a classic technique for reducing data storage and transmission requirements. Asymmetric numeral systems (ANS) is a high-throughput, high-ratio lossless compression algorithm, but it lacks effective support for multi-byte data and cross-platform compatibility.

To address this issue, we propose an adaptive data mapping (ADM) scheme, which maps multi-byte integer data into single-byte space based on the data's characteristics, improving the compression ratio of ANS while maintaining low encoding redundancy. We also optimize the ADM algorithm and the ANS encoder for GPU and CPU architectures, respectively, and combine them to create an efficient and portable ANS encoding method for multi-byte integer data, called MANS.

Experimental results show that MANS improves compression ratios by an average of 1.24$\times$, achieves 870.27MB/s throughput on CPUs, and delivers up to 288.45$\times$ and 135.86$\times$ speedups on an NVIDIA A100 and an AMD MI210 GPU compared to the CPU version—demonstrating its efficiency and portability.

Research and ACM SRC Posters

Massively Parallel Bayesian Inference Framework for GPU Supercomputers: Application to Estimation of Coseismic Fault Slip

8:00am - 5:00pm CST Tuesday, 18 November 2025 Second Floor Atrium

Research & ACM SRC Posters

DescriptionWe present a massively parallel Bayesian inference framework for GPU supercomputers, demonstrated in coseismic fault slip estimation. Bayesian inference, a robust method for inverse analysis, often relies on Monte Carlo sampling with over 100,000 forward simulations, making large-scale applications computationally intensive. A previous state-of-the-art implementation for the CPU-based supercomputer Fugaku was unsuitable for GPUs due to numerous small, imbalanced computations. We redesigned the algorithm to enforce uniform, dense computation and employed Multi-Process Service (MPS) to maximize GPU utilization. On a single node of the GPU-based supercomputer Miyabi with an NVIDIA GH200 Grace Hopper Superchip, the method achieved 13.40 TFLOPS (20% of Tensor Cores FP64 peak) and scaled to 128 nodes with 92.3% efficiency. Compared with the original CPU implementation on Fugaku, it achieved a 42.1-fold speedup per node and reduced energy-to-solution to 18.8%. The methodology provides a general guide for porting Bayesian inference and similar applications to GPU-based environments.

Research and ACM SRC Posters

Massively Parallel GPU Rasterizer for Next-Generation Computational Lithography

8:00am - 5:00pm CST Tuesday, 18 November 2025 Second Floor Atrium

Research & ACM SRC Posters

DescriptionIn electronic design automation (EDA), traditional rasterization algorithms suffer from poor speedup and accuracy when managing large and complex semiconductor designs, limiting efficiency in optical proximity correction (OPC) processes. To overcome these challenges, we developed a GPU-based rasterization algorithm that employs floating-point precision and tile-based, warp-cooperative strategies. This approach significantly boosts performance, achieving up to 290x speedup for Manhattan shapes and 45x for curvilinear shapes over conventional CPU methods, while maintaining errors below 1% against CPU results. Our solution enhances both computational efficiency and geometric accuracy in nanometer-scale tasks. During the poster session, we will present our methodology, showcase performance results, and illustrate how advanced GPU optimization effectively addresses the limitations of traditional rasterization workflows in EDA.

Tutorial

Mastering AI Workflows Using ACCESS Pegasus

8:30am - 12:00pm CST Monday, 17 November 2025 126

Livestreamed

Recorded

TUT

DescriptionThis tutorial is designed for both users and facilitators who want to deepen their understanding of modeling AI pipelines in a portable, reproducible way using scientific workflows and application containers. Scientific workflows are essential for managing complex computations: they define the dependencies between steps in data analysis and simulation pipelines, automate execution, and capture provenance information critical for verifying results and ensuring reproducibility. Workflows also promote sharing and reuse. Participants will learn to use Pegasus, a leading scientific workflow management system now integrated into the ACCESS Support offerings (https://support.access-ci.org/pegasus). ACCESS Pegasus provides a fully hosted environment built on Open OnDemand and Jupyter, enabling users to develop and run workflows directly from a web browser. Workflow execution is powered by HTCondor Annex, allowing jobs to run across multiple ACCESS resources, including PSC Bridges-2, SDSC Expanse, Purdue Anvil, NCSA Delta, and IU Jetstream2. Through hands-on exercises in a hosted Jupyter Notebook, participants will work through an example LLM-RAG (large language model retrieval-augmented generation) workflow that leverages GPUs across ACCESS resources. Along the way, the tutorial will address key challenges and best practices across the entire workflow life cycle.

Tutorial

Mastering OpenMP Tasking: From Start to Free-Agent Tasks

8:30am - 12:00pm CST Sunday, 16 November 2025 125

Livestreamed

Recorded

TUT

DescriptionOpenMP is the leading, portable, and widely supported directive-based programming model. Already in 2008, OpenMP introduced tasking to support the creation of composable parallel software blocks and the parallelization of irregular algorithms. Developers usually find OpenMP easy to learn. However, mastering the tasking concept requires a change in the way developers reason about the structure of their code and how they expose its parallelism. Our tutorial has been designed for the SC audience to learn about the tasking concept in detail and to understand code patterns as solutions to many common problems. Throughout all topics, we showcase the additions brought with OpenMP 5.x and OpenMP 6.0 and explain how to adopt codes. For this tutorial, we assume attendees understand basic parallelization concepts and know the fundamentals of OpenMP. First, we introduce the OpenMP tasking language features in detail and then focus on performance aspects, such as introducing cutoff mechanisms, exploiting task dependencies, and preserving locality. The new free-agent tasks introduced with OpenMP 6.0 are covered in detail. All topics are accompanied by extensive case studies. If accepted as a full-day tutorial, we will include hands-on sessions.

Flash Session

Material and Vendor Selection for Reliable Thermal Cooling Hose Assemblies

4:15pm - 4:30pm CST Tuesday, 18 November 2025 Booth 2638 - Flash Session

Not Livestreamed

Not Recorded

XO/EX

DescriptionJoin Jeff Berger, an expert with over 35 years in product design, applications, and product management specializing in rubber hose and fittings, as he shares critical best practices for thermal cooling hose and tubing assemblies in data centers. This is perfect for engineers, IT administrators, and decision-makers who want to learn from Parker Hannifin’s extensive industry leadership and innovation in fluid conveyance solutions, guiding them through the complexities of material selection, component integration, and vendor evaluation.

This concise yet impactful session will explain how selecting the right hose materials and fittings can significantly enhance thermal management efficiency, reduce downtime, and extend equipment lifespan. Attendees will gain actionable insights on balancing performance, durability, and cost-effectiveness in hose assembly design tailored for demanding data center conditions.

Leveraging Parker Hannifin’s proven expertise and global presence, Jeff will provide real-world examples and technical considerations that empower your team to optimize cooling infrastructure reliability. Whether you’re a CIO, engineer, or policy maker focused on sustainability and operational excellence, this session offers valuable knowledge to support critical infrastructure decisions in today’s data-driven world.

Paper

Matrix Is All You Need: Rearchitecting Quantum Chemistry to Scale on AI Accelerators

4:14pm - 4:36pm CST Thursday, 20 November 2025 260-267

Algorithms

Livestreamed

Recorded

DescriptionScientific computing remains misaligned with the execution paradigm of modern AI accelerators, which favor structured, low-precision matrix operations. Quantum chemistry exemplifies this gap, with irregular computations, fragmented utilization, and limited support for high-complexity systems.

We present Mako, a matrix-centric system that rearchitects quantum chemistry to scale on AI accelerators. Mako comprises three components: KernelMako reformulates ERI evaluation into composable MatMul pipelines using CUTLASS; QuantMako introduces physics-informed quantization to exploit low-precision potential; and CompilerMako automates kernel fusion and architecture-tuned specialization.

Mako achieves up to ~20× speedup on high-angular-momentum basis sets. It sustains over 90% parallel efficiency on a single node and 70% across 64 GPUs, completing the accurate simulation of ubiquitin (1,231 atoms, def2-TZVP) from days to just 58 minutes. Mako demonstrates how scientific workloads can be restructured to inherit the scalability of deep learning—repurposing AI accelerators and their ecosystems to scale quantum chemistry beyond traditional limits.

Paper

MaverIQ: Fingerprint-Guided Extrapolation and Fragmentation-Aware Layering for Intent-Based LLM Serving

10:30am - 10:52am CST Thursday, 20 November 2025 261-262-265-266

HPC for Machine Learning

System Software and Cloud Computing

Livestreamed

Recorded

DescriptionLarge language models (LLMs) are becoming ubiquitous across industries, where applications demand diverse user intents. To meet those intents, developers must manually explore combinations of parallelism and compression techniques that affect resource usage, latency, cost, and accuracy. Prior works automate this process but incur high profiling costs, inefficient GPU use, or ignore diverse user-intents. We build MaverIQ, an automated intent-based LLM inference serving system that translates user-expressed intents into LLM deployment configurations and deploys the chosen configurations to improve operational cost for the provider. To reduce profiling costs, MaverIQ introduces and observes LLM fingerprint—a compact proxy of the LLM—under a few configurations, and uses novel analytical models to extrapolate the observed fingerprint data to the full LLM. To cut provider costs, we exploit our key observation that uneven LLM layer distribution minimally affects inference latency. MaverIQ cuts profiling costs by 7-15× and provider costs by 3.8-8.3× while best fulfilling user-intents.

Flash Session

Maximizing Cooling Efficiency in High-Density Applications (with AxiEco 200)

11:00am - 11:15am CST Thursday, 20 November 2025 Booth 2638 - Flash Session

Not Livestreamed

Not Recorded

XO/EX

DescriptionIn today’s technology landscape, high-density electronic systems demand efficient, reliable cooling solutions that don’t compromise on space or energy consumption. The ebm-papst AxiEco 200 fan is engineered to meet these challenges head-on, delivering exceptional airflow performance and energy efficiency in a compact design. This session will explore how the AxiEco 200 optimizes cooling for high-density environments (like a rear door heat exchanger), reducing operational costs while enhancing system reliability. Attendees will gain insights into the fan’s innovative features, real-world application benefits, and how it supports sustainable, high-performance cooling in demanding scenarios.

Birds of a Feather

Meeting HPC Community Needs: How ACM SIGHPC, IEEE TCPP/TCHPC, and SIAG-SC/CSE Join Efforts to Engage Communities and Deliver Services

12:15pm - 1:15pm CST Thursday, 20 November 2025 124

Community Meetings

Livestreamed

Recorded

XO/EX

DescriptionCome and learn from the leaders of the professional societies focused on HPC from ACM, IEEE, and SIAM! Your SIGHPC, TCPP, TCHPC, SIAG-SC, and SIAG-CSE representatives invite SC25 participants to join this cross-society BoF to learn about the opportunities these societies provide. Each organization recognizes outstanding achievements in HPC with society awards, offers travel grants to students and early-career professionals, supports initiatives focused on education and outreach, and promotes diversity, equity, and inclusion. These representatives are also seeking feedback from the community to help improve their initiatives and to learn from each other.

Workshop

MEMO’25: International Workshop on Memory System, Management and Optimization

2:00pm - 2:05pm CST Sunday, 16 November 2025 274

Livestreamed

Recorded

DescriptionThe increasing disparity between computing speed and memory speed, commonly referred to as the memory wall, remains a critical and enduring challenge in the high performance computing and analytics community. This workshop aims to bring together computer science and computational science researchers, from industry, government labs and academia, concerned with the challenges of efficiently using existing and emerging memory systems. The term "performance" for memory systems is general, and includes latency, bandwidth, power consumption and reliability from the aspect of hardware memory technologies to how it is manifested in the application performance.

Research and ACM SRC Posters

Memory-Efficient CFD Based on MPS: Effective One-Billion-Cell Resolution on a Single Node

8:00am - 5:00pm CST Tuesday, 18 November 2025 Second Floor Atrium

Research & ACM SRC Posters

DescriptionWe investigate matrix product states (MPS), a tensor-network compression method, as a memory-efficient representation of flow variables. A three-dimensional incompressible Navier-Stokes solver is implemented entirely in MPS form and is applied to canonical flow problems. Results show substantial memory savings and the ability to perform a $1024^3$ simulation on a single GPU. Performance analysis revealed new bottlenecks, particularly bond-dimension growth during nonlinear operations,
suggesting novel optimization strategies are needed to fully realize MPS-based CFD at extreme scales.

Community Engagement and Support

Mental Health and AI: Monitoring Anxiety and Stress Through Wearables

5:30pm - 6:00pm CST Tuesday, 18 November 2025 Booth 3537 - SCinet Theater

Not Livestreamed

Not Recorded

XO/EX

DescriptionThis presentation showcases work from the Viva Bem Hub on using wearables for health and well-being, with a focus on mental health. We explore how AI can leverage physiological data to monitor and support an individual's mental state.

Paper

MetoHash: A Memory-Efficient and Traffic-Optimized Hashing Index on Hybrid PMem-DRAM Memories

10:52am - 11:15am CST Wednesday, 19 November 2025 260-267

Applications

Architectures & Networks

BSP

Livestreamed

Recorded

DescriptionPersistent memory (PMem) brings new design considerations in realizing high-performance and scalable hashing indexes. We uncover that existing hashing indexes for PMem still suffer from traffic amplification and memory inefficiency. We present MetoHash, a memory-efficient and traffic-optimized hashing index on hybrid PMem-DRAM memories. MetoHash proposes a three-layer index structure spanning across CPU caches, DRAM, and PMem for data management. It aggregates the incoming key-value items in CPU caches for fast inserts, which are then arranged in DRAM and flushed to PMem, to eliminate traffic amplification. MetoHash also uses fingerprinting to reduce unnecessary probings over PMem and removes duplicate items during bucket relocations. We implement MetoHash on PMem with persistent and volatile CPU caches, and show that compared to state-of-the-art hashing indexes for PMem, MetoHash improves the throughput by 86.1%–257.6% under various workloads.

SCinet

MetrANOVA: An Open Source Software Stack for Sharing Network Telemetry

2:20pm - 2:40pm CST Tuesday, 18 November 2025 Booth 3537 - SCinet Theater

Not Livestreamed

Not Recorded

Workshop

Microcredentials for Open Hardware and HPC Workforce Development: The Openchip Approach with RISC-V Ecosystem

3:30pm - 3:45pm CST Sunday, 16 November 2025 261

Livestreamed

Recorded

DescriptionThe semiconductor and high-performance computing (HPC) sectors face a parallel and urgent challenge: a widening skills gap amid rapid technological evolution, particularly with the rise of open hardware platforms such as RISC-V. As global initiatives like the EU Chips Act and national sovereignty strategies emphasize workforce development, traditional academic programs remain too rigid and slow to adapt. This paper presents a microcredential-based strategy for agile, modular, and industry-validated training focused on open hardware and full-stack HPC system design. Through the lens of Openchip’s approach, we explore how co-designed curricula—especially around vector-based RISC-V architectures—can modernize education to reflect emerging HPC paradigms. We also examine curricular gaps, the potential alignment with the TCPP curriculum initiative, and propose a roadmap for embedding microcredentials into scalable, open, and sovereign HPC education ecosystems.

Paper

Microscopic-Level Mouse Whole Cortex Simulation Composed of 9 Million Biophysical Neurons and 26 Billion Synapses on the Supercomputer Fugaku

3:52pm - 4:15pm CST Thursday, 20 November 2025 263-264

Applications

Livestreamed

Recorded

DescriptionUnderstanding the brain remains a major scientific challenge due to its complex structure and function. Unlike artificial neural networks, the biological brain features diverse biophysical properties essential to its function. Building an accurate digital replica using extensive anatomical and physiological data, which are available in standardized public databases, has emerged as a promising approach. In this study, a lightweight biophysical neuron simulator was developed and optimized to the supercomputer Fugaku. Good strong scaling was demonstrated in a benchmark model up to 152,064 compute nodes with 7.13 petaflops performance. In a more realistic scenario, a whole cerebral cortex of a mouse, consisting of 9 million biophysical neurons and 26 billion synapses, was simulated on the full-scale Fugaku with 145,728 nodes. These suggest that the present high performance computing technology is ready to support the construction of a digital replica of the whole mammalian brain.

Paper

Million-Atom Ab Initio Electron Dynamics: Discontinuous Galerkin Real-Time Time-Dependent Density Functional Theory

1:52pm - 2:14pm CST Thursday, 20 November 2025 263-264

Applications

GBC

Livestreamed

Recorded

DescriptionOver the past decades, first-principles real-time time-dependent density functional theory (RT-TDDFT) simulations have been limited to systems with only thousands of atoms. We propose a novel method based on the discontinuous Galerkin adaptive local basis, significantly reducing global communication in RT-TDDFT. We further introduce a tensor compression technique that leverages basis locality to avoid repeated evaluation of multi-center integrals in hybrid functionals, greatly reducing computational cost. To overcome the projection bottleneck in our basis sets, we design a fused GEMM-Reduce operation that achieves several times higher floating-point efficiency than standard BLAS combination. Our implementation reaches 34.8\% of theoretical peak performance on 524,288 CGs of the New Sunway supercomputer and simulates electronic dynamics of systems with over one million atoms for both local-semi-local and hybrid functionals. This work improves computational scale by two orders of magnitude, opening new possibilities for exploring ultrafast dynamics in large-scale materials and nanophotonic devices.

Paper

Minimizing Power Waste in Heterogeneous Computing via Adaptive Uncore Scaling

2:14pm - 2:36pm CST Tuesday, 18 November 2025 260-267

Architectures & Networks

System Software and Cloud Computing

Livestreamed

Recorded

DescriptionHigh performance computing (HPC) systems are essential for scientific discovery and engineering innovation. However, their growing power demands pose significant challenges, particularly as systems scale to the exascale level. Prior uncore frequency tuning studies primarily focused on conventional HPC workloads on homogeneous systems. As HPC advances toward heterogeneous computing, integrating diverse GPU workloads on heterogeneous systems, it is crucial to revisit and enhance uncore scaling. Our investigation reveals that uncore frequency decreases only when CPU power approaches its TDP (thermal design power)—an uncommon scenario in GPU-dominant applications—resulting in power waste. To address this, we present MAGUS, a user-transparent uncore scaling runtime for heterogeneous computing. Effective uncore tuning is complex, requiring dynamic detection of application execution phases that affect uncore utilization. Moreover, an efficient runtime should introduce minimal overhead. MAGUS employs key techniques such as memory throughput monitoring and prediction, and handling frequent phase transitions to tackle these challenges.

Paper

MISA-AKMC: Achieve Kinetic Monte Carlo Simulation of 20 Quadrillion Atoms on GPU Clusters

11:14am - 11:36am CST Thursday, 20 November 2025 263-264

Applications

Livestreamed

Recorded

DescriptionThe Atomic Kinetic Monte Carlo (AKMC) method provides insights into the macroscopic behavior of materials through atomistic-level simulations and finds broad applications in materials science innovation. Improving simulation scale and performance remains a consistent focus in the development of parallel AKMC software. We port the AKMC software to GPU clusters. To alleviate the memory pressure in large-scale complex system simulations, we redesign the data layout and propose the lattice data compression and vacancy data decompression algorithms. Additionally, we propose a multi-level pipeline scheme combined with an on-demand communication forwarding and merging strategy to reduce data transfer and communication overhead. Compared to state-of-the-art KMC software, MISA-AKMC\footnotemark achieves a 10.41-fold improvement in computational throughput and a 52.07-fold expansion in simulation scale. We implement the first true micrometer-scale AKMC simulation involving 20 quadrillion atoms on GPU clusters. MISA-AKMC achieves 96.03% parallel efficiency in weak scaling and 85.29% in strong scaling on 16,000 GPUs.

Birds of a Feather

Mission Accomplished: Arm-Based Systems Are Here and They Are 'Boring.' What Now? What Next?

12:15pm - 1:15pm CST Thursday, 20 November 2025 130

State of the Practice

Livestreamed

Recorded

XO/EX

DescriptionOver a decade ago, Arm was known for co-design, innovative architectures, and disruptive change in a x86-dominated software landscape. Today, Arm-based supercomputers operate globally, supporting advanced research across diverse fields. Has Arm become really “boring”? While accelerated computing is now essential, CPU-only systems remain important. The future of Arm in HPC raises questions: What new breakthroughs can Arm technologies enable? Is Arm still exciting for technologists and developers? This BoF gathers leaders who have helped advance the Arm HPC ecosystem to address what else is yet to be done to achieve the next maturity level.

Research and ACM SRC Posters

Mitigating I/O Bottlenecks in LiDAR Pipelines by Directly Merging Neural Decompression and Semantic Segmentation

8:00am - 5:00pm CST Tuesday, 18 November 2025 Second Floor Atrium

Research & ACM SRC Posters

DescriptionThe increasing volume of high-resolution LiDAR data poses a significant I/O bottleneck in large-scale analysis and high-performance computing pipelines due to costly intermediary data storage and retrieval. We introduce a novel, end-to-end framework that addresses this issue by proposing the first unified RENO-based neural autoencoder with a Point Transformer v3 (PTV3) segmentation backbone. This integrated architecture directly feeds the high rank feature tensors of the RENO decoder into the segmentation backbone, completely bypassing the need for costly intermediary file storage and I/O operations. Evaluated on the German Outdoor and Offroad (GOOSE) dataset, this approach enables direct semantic analysis on compressed data. Our results demonstrate that this method significantly reduces storage overhead, saving 29.9 GB per 13,076 point clouds and 2.7 GB per minute of LiDAR operation, all while maintaining the accuracy of semantic segmentation. This unified framework represents a major step towards efficient, real-time processing of large-scale point cloud datasets.

Research and ACM SRC Posters

Mixed Compute Environments with OpenCHAMI

8:00am - 5:00pm CST Tuesday, 18 November 2025 Second Floor Atrium

Research & ACM SRC Posters

DescriptionThere is a growing need for workloads that don’t follow a traditional HPC workflow. Many of these workloads are developed with Kubernetes as the workload manager rather than an HPC-focused one such as Slurm. Mixing different workloads presents a challenge for a few reasons: The demand for either type of resource may fluctuate, so static assignments of Kubernetes or Slurm as the WLM may result in idle resources; the desire for one WLM or another may increase, so extra resources will need to be assigned and moved.

To address this demand, we utilized OpenCHAMI, an open-source system management platform for deploying, managing, and scaling HPC clusters. With OpenCHAMI, we created “spread”: a command line tool that configures nodes’ workload environments across the cluster. We support fast node booting using kexec and a dynamic base of workload environments to swap between, including Slurm and Kubernetes.

Birds of a Feather

Mixed Feelings About Mixed Precisions

12:15pm - 1:15pm CST Tuesday, 18 November 2025 263-264

Algorithms

Livestreamed

Recorded

XO/EX

DescriptionWhat if we have been oversolving in computational science and engineering for decades? Are low-precision arithmetic formats only for AI workloads? How can HPC applications exploit mixed-precision hardware features? This BoF invites the HPC community at large interested in applying mixed precisions into their workflows and discussing the impact on time-to-solution, memory footprint, storage, data motion, and energy consumption. Experts from scientific applications/software libraries/hardware architectures will briefly provide the context on this trendy topic, share their own perspectives, and mostly engage with the audience via a set of questions, while gathering feedback to define a roadmap moving forward.

Workshop

Mixed-Precision For Energy Efficient Computations

2:20pm - 2:25pm CST Monday, 17 November 2025 276

Partially Livestreamed

Partially Recorded

DescriptionAs simulations become more realistic, the pursuit of higher accuracy results in extended computation times and substantial energy consumption. This study explores mixed-precision computing as a promising strategy to address these challenges, leveraging computer arithmetic tools to optimize performance. To do so, we used Reactor Simulator and LULESH benchmarks as case studies to evaluate the potential of mixed-precision strategies to reduce both time-to-solution and energy-to-solution. For Reactor Simulator, we achieved more then 30 % reduction in both metrics without compromising accuracy. Similarly, results for LULESH demonstrated improvements of up to 31.5 % in time-to-solution and 25.6 % savings in energy-to-solution.

Workshop

Mixed-Precision Performance Portability of FFT-Based GPU-Accelerated Algorithms for Block-Triangular Toeplitz Matrices

2:30pm - 3:00pm CST Monday, 17 November 2025 231

Performance Evaluation, Scalability, & Portability

Livestreamed

Recorded

DescriptionThe hardware diversity in leadership-class computing facilities, alongside the immense performance boosts from today's GPUs when computing in lower precision, incentivizes scientific HPC workflows to adopt mixed-precision algorithms and performance portability models. We present an on-the-fly framework using hipify for performance portability and apply it to FFTMatvec - an HPC application that computes matrix-vector products with block-triangular Toeplitz matrices. Our approach enables FFTMatvec, initially a CUDA-only application, to run seamlessly on AMD GPUs with excellent performance. Performance optimizations for AMD GPUs are integrated into the open-source rocBLAS library, keeping the application code unchanged. We then present a dynamic mixed-precision framework for FFTMatvec; a Pareto front analysis determines the optimal mixed-precision configuration for a desired error tolerance. Results are shown for AMD Instinct MI250X, MI300X, and the newly launched MI355X GPUs. The performance-portable, mixed-precision FFTMatvec is scaled to 4,096 GPUs on the OLCF Frontier supercomputer.

Paper

MLP-Offload: Multi-Level, Multi-Path Offloading for LLM Pre-Training To Break the GPU Memory Wall

3:30pm - 3:52pm CST Wednesday, 19 November 2025 261-262-265-266

HPC for Machine Learning

Livestreamed

Recorded

DescriptionLarge language models (LLMs) have been rapidly adopted across all domains, supporting divergent use cases with remarkable accuracy. However, training these massive models requires scaling across multiple GPUs. Given the expensive and limited GPU resources, advanced redundancy elimination and parallelization techniques are employed to maximize training throughput. Furthermore, to run LLMs larger than the aggregated memory of multiple GPUs, host memory or disk offloading techniques are leveraged. Despite advanced asynchronous multi-tier read/write strategies, such offloading strategies result in significant I/O overheads in the critical path of training. To this end, we propose MLP-Offload, a novel multi-level, multi-path offloading engine specifically designed for optimizing LLM training on resource-constrained setups by mitigating I/O bottlenecks. We design and implement MLP-Offload to offload the optimizer states across multiple tiers in a cache-efficient and concurrency-controlled fashion to mitigate I/O bottlenecks. Evaluations on models up to 280B parameters show that MLP-Offload achieves 2.5x faster iterations.

Paper

mLR: Scalable Laminography Reconstruction Based on Memoization

10:30am - 10:52am CST Tuesday, 18 November 2025 263-264

Performance Measurement, Modeling, & Tools

Livestreamed

Recorded

DescriptionADMM-FFT is an iterative method with high reconstruction accuracy for laminography but suffers from excessive computation time and large memory consumption. We introduce mLR, which employs memoization to replace the time-consuming Fast Fourier Transform (FFT) operations based on the unique observation that similar FFT operations appear in iterations of ADMM-FFT. We introduce a series of techniques to make the application of memoization to ADMM-FFT performance-beneficial and scalable. We also introduce variable offloading to save CPU memory and scale ADMM-FFT across GPUs within and across nodes. Using mLR, we are able to scale ADMM-FFT on an input problem of $2K \times 2K \times 2K$, which is the largest input problem laminography reconstruction has ever worked on with the ADMM-FFT solution on limited memory; mLR brings 52.8\% performance improvement on average (up to 65.4\%), compared to the original ADMM-FFT.

Workshop

Modeling and Optimizing Real-Time Telescope Interaction for Multi-wavelength Observation of Gamma-ray Bursts

10:55am - 11:20am CST Friday, 21 November 2025 260-267

Livestreamed

Recorded

DescriptionMulti-wavelength observation of gamma-ray bursts (GRBs) requires real-time interaction among multiple telescopes. A gamma-ray telescope detects and localizes a GRB in the sky and must then communicate with an optical telescope to direct the latter toward the GRB as quickly as possible. We previously developed software for ADAPT, a suborbital gamma-ray telescope, to localize GRBs in real time, on a timescale shorter than that of the GRB itself. This work therefore studies progressive localization, in which ADAPT computes a series of increasingly accurate location estimates during a GRB to enable a partner instrument to more rapidly find it. We describe a modeling and optimization framework to decide when ADAPT should compute estimated GRB locations to minimize the time for the partner to find the GRB. Our framework can design progressive strategies that allow a partner telescope to find a GRB up to 42% faster than strategies using a single alert.

Workshop

Modeling the Carbon Footprint of HPC: The Top 500 and EasyC

5:00pm - 5:15pm CST Sunday, 16 November 2025 264

Livestreamed

Recorded

DescriptionClimate change is a critical concern for HPC systems, but GHG protocol carbon-emission accounting methodologies are difficult for a single system, and effectively infeasible for a collections of systems.
As a result, there is no HPC-wide carbon reporting, and even the largest HPC sites do not do it.
We assess the carbon footprint of HPC, focusing on the Top 500 systems. The key challenge is modeling carbon footprint with limited data availability.
With the disclosed Top500.org data, and using EasyC tool, we were able to model operational carbon of 391 HPC systems and embodied carbon of 283 systems. This coverage can be further enhanced by public information, then interpolation is used to produce the first carbon footprint estimates of the Top 500 HPC systems (1.4 million MT CO2e operational carbon and 1.9 million MT CO2e embodied carbon). We also project how the Top 500's carbon footprint will increase through 2030.

Workshop

Modelling Load Imbalance In Shared Memory Multicore Systems

9:01am - 9:30am CST Monday, 17 November 2025 267

Livestreamed

Recorded

DescriptionMemory bandwidth has become the primary limiting factor of performance in many modern HPC applications, and it poses a limit to scalability because the achievable memory
bandwidth only grows linearly with a small number of CPU cores. When the number of cores concurrently using the memory system exceeds a threshold, the aggregate memory bandwidth quickly saturates.

To estimate the time usage of a computation dominated by memory
traffic, the mainstream strategy is to divide the expected total memory
traffic volume by the maximum memory bandwidth. However, this
implicitly assumes homogeneous memory traffic which is often not the case, leading to inaccurate time estimates by
the mainstream strategy.

In this paper, we present a new performance model that specifically
targets inhomogeneity in per-core memory traffic.
The new requires only three hardware parameters. Using several cases of uneven per-core memory traffic, we demonstrate its advantage
over the mainstream strategy.

Workshop

Moderated Discussion

11:20am - 12:00pm CST Friday, 21 November 2025 274

Livestreamed

Recorded

DescriptionThis moderated discussion will engage all symposium speakers and participants in a comprehensive conversation about the future directions of AI workflows. The session will address both technical and community-driven priorities, fostering new connections and ideas among workshop attendees.

Workshop

Moderated Panel Discussion

11:30am - 12:25pm CST Monday, 17 November 2025 240

Livestreamed

Recorded

DescriptionOur invited speakers address this year's charge question,
then our audience & panelists will dig deeper in a moderated discussion.

Tutorial

Modern High Performance I/O: Leveraging Object Stores

8:30am - 5:00pm CST Sunday, 16 November 2025 120

Livestreamed

Recorded

TUT

DescriptionAs more diverse applications move to high performance computing (HPC), the I/O workload has also become more varied. The community has spent three decades improving large-scale parallel file systems, but object stores provide new approaches and novel interfaces. At the same time, traditional parallel I/O approaches and abstractions have evolved under the covers to make use of both novel object stores and classic parallel file systems. In this full-day tutorial we will explore several object storage technologies and discuss both old and new approaches to getting maximum performance from them. With a mix of lectures and hands-on exercises, attendees will learn about object storage design and usage. We will also tackle how the classic I/O software stack has evolved to make use of object stores, as well as provide attendees with the knowledge to know when file systems and object stores are the most appropriate storage approach for their applications. By the end of the day, attendees will learn a bit more about what these new storage systems are doing, as well as how to use libraries and tools to hide the details.

Workshop

Modernizing HPC Configuration Management

9:01am - 9:31am CST Sunday, 16 November 2025 276

Livestreamed

Recorded

DescriptionHigh-performance computing (HPC) environments require configuration management systems to support diverse infrastructure and operational needs. At the National Center for Supercomputing Applications (NCSA), we initiated a multi-year transition from Puppet to Ansible to modernize our configuration management across our active HPC clusters. This paper presents the motivations behind the migration, including limitations encountered with Puppet and the advantages of Ansible’s agentless architecture and human-readable YAML-based configuration model.We detail our transition methodology, emphasizing cross-team collaboration, configuration parity, and low operational impact to production systems. Comparative insights highlight key differences in compliance enforcement, inventory visibility, automation workflows, secrets management, and custom module development. Additionally, we share implementation insights regarding community resource gaps, provisioning integration, access constraints, and organizational buy-in.Our experience underscores the importance of deliberate planning and collaborative toolsets in infrastructure modernization.

Workshop

ModuLair: Streamlining Python Virtual Environment Management for HPC

4:30pm - 4:50pm CST Sunday, 16 November 2025 276

Livestreamed

Recorded

DescriptionManaging Python environments on high-performance computing (HPC) systems presents unique challenges due to complex toolchains, file system constraints, and diverse user needs. We present ModuLair, a modular, metadata-driven Python virtual environment framework designed to simplify environment creation, activation, and management in HPC contexts. ModuLair supports both EasyBuild and non-EasyBuild module systems, automatically detecting explicit specification of toolchains to ensure reproducibility and compatibility across workflows. The framework integrates seamlessly with command-line and graphical interfaces, including the improved User Dashboard, Job Composer, and JupyterLab, enabling visual, intuitive environment management for both novice and experienced users.

We validate ModuLair through usage metrics collected over five months across three HPC clusters, demonstrating sustained adoption by active Python users and integration into ongoing research workflows in both CLI and GUI contexts. These results show that ModuLair reduces setup complexity, lowers the barrier to entry, and promotes best practices in environment configuration and job submission.

Workshop

Modular Architecture for High-Performance and Low Overhead Data Transfers

5:05pm - 5:20pm CST Sunday, 16 November 2025 266

Livestreamed

Recorded

DescriptionHigh-performance applications necessitate rapid and dependable transfer of massive datasets across geographically dispersed locations. Traditional file transfer tools often suffer from resource underutilization and instability due to fixed configurations or monolithic optimization methods. We propose AutoMDT, a novel Modular Data Transfer Architecture, to address these issues by employing deep reinforcement learning based agent to simultaneously optimize concurrency levels for read, network, and write operations. This solution incorporates a lightweight network–system simulator, enabling offline training of a Proximal Policy Optimization (PPO) agent in approximately 45 minutes on average, thereby overcoming the impracticality of lengthy online training in production networks. AutoMDT’s modular design decouples I/O and network tasks. This allows the agent to capture complex buffer dynamics precisely and to adapt quickly to changing system and network conditions. Evaluations on production-grade testbeds show that AutoMDT achieves up to 8X faster convergence and 68\% reduction in transfer completion times compared to state-of-the-art solutions.

Workshop

MoE-Inference-Bench: Performance Evaluation of Mixture of Expert Large Language and Vision Models

2:00pm - 2:30pm CST Monday, 17 November 2025 267

Livestreamed

Recorded

DescriptionMixture of Experts (MoE) models have enabled the scaling of Large Language Models (LLMs) and Vision Language Models (VLMs) by achieving massive parameter counts while maintaining computational efficiency. However, MoEs introduce several inference-time challenges, including load imbalance across experts and the additional routing computational overhead. To address these challenges and fully harness the benefits of MoE, a systematic evaluation of hardware acceleration techniques is essential. We present MoE-Inference-Bench, a comprehensive study to evaluate MoE performance across diverse scenarios. We analyze the impact of batch size, sequence length, and critical MoE hyperparameters such as FFN dimensions and number of experts on throughput. We evaluate several optimization techniques on Nvidia H100 GPUs, including pruning, Fused MoE operations, speculative decoding, quantization, and various parallelization strategies. Our evaluation includes MoEs from the Mixtral, DeepSeek, OLMoE and Qwen families. The results reveal performance differences across configurations and provide insights for the efficient deployment of MoEs.

Workshop

Mojo: MLIR-based Performance-Portable HPC Science Kernels on GPUs for the Python Ecosystem

10:30am - 11:00am CST Monday, 17 November 2025 266

Livestreamed

Recorded

DescriptionWe explore the performance and portability of the novel Mojo language for scientific computing workloads on GPUs. As the first language based on the LLVM's Multi-Level Intermediate Representation (MLIR) compiler infrastructure, Mojo aims to close performance and productivity gaps by combining Python's interoperability and syntax with CUDA-like compile-time programming. We target four scientific workloads: (i) a Seven-point stencil (memory-bound), (ii) BabelStream (memory-bound), (iii) miniBUDE (compute-bound), and (iv) Hartree--Fock (compute-bound with atomic operations), and compared their performance against vendor baselines on NVIDIA H100 and AMD MI300A GPUs. We show that Mojo's performance is competitive with CUDA and HIP for memory-bound kernels, whereas gaps exist on AMD GPUs for atomic operations and for fast-math compute-bound kernels on both AMD and NVIDIA GPUs. Although the learning curve and programming requirements are still fairly low-level, Mojo can close significant gaps in the fragmented Python ecosystem in the convergence of scientific computing and AI.

Best Poster Presentations (Research, ACM SRC Grad/Undergrad)

Mojo: Python-Like MLIR-Based GPU Portable Science Kernels

2:15pm - 2:30pm CST Wednesday, 19 November 2025 230

Research & ACM SRC Posters

DescriptionThis work investigates Mojo, a new MLIR-based language that combines Python-like syntax with portable, low-level GPU programming capabilities. We compare the performance of the Mojo portable GPU kernels against vendor-specific C++ NVIDIA CUDA and AMD HIP implementations on four representative scientific workloads: (1) BabelStream (memory-bound); (2) seven-point stencil (memory-bound); (3) miniBUDE (compute-bound); and (4) Hartree-Fock (compute-bound with atomic operations), evaluated on NVIDIA H100 and AMD MI300A GPUs. Results show that Mojo can match CUDA and HIP performance for memory-bound kernels, though gaps remain for atomic operations and certain compute-bound cases. This poster will present a general overview of the language, our benchmarking methodology, comparative results, the use of vendor profiling tools, and observations on Mojo’s potential to close the gap between high performance and developer productivity in scientific GPU programming. Our contribution is the first systematic evaluation of Mojo for HPC workloads, highlighting both its promise and current limitations.

Workshop

Molten Chloride Small Modular Reactor Performance Characteristics for Data Center Operation

4:30pm - 4:45pm CST Sunday, 16 November 2025 264

Livestreamed

Recorded

DescriptionSmall modular reactors (SMRs) require a smaller physical footprint than conventional large nuclear reactors while still providing high reliability in power generation and they are frequently discussed in the context of providing power for data centers. Molten chloride reactors represent a new type of SMR where a mixture of molten chloride and uranium serves as both reactor fuel and coolant. The approach is key to molten chloride fast reactor technologies being marketed commercially in SMRs for end users like data centers. This work leverages the open-source Multiphysics Object Oriented Simulation Environment (MOOSE) framework to simulate the behavior of a molten salt SMR with the control rods at 60% open exploring steady state behavior of an SMR while operating below maximum capacity to power a data center with a mismatched power rating.

Paper

Moment: Co-Optimizing Physical Communication Topology and Data Placement for Multi-GPU Out-of-Core GNN Training

11:37am - 12:00pm CST Tuesday, 18 November 2025 260-267

BSP

HPC for Machine Learning

System Software and Cloud Computing

Partially Livestreamed

Partially Recorded

DescriptionGraph neural networks (GNNs) are widely employed in applications like recommendation systems, social network analysis, and fraud detection, but training large-scale GNNs is challenging due to memory limitations. Existing systems face a trade-off between throughput and monetary cost: distributed systems require expensive memory scaling, while single-machine out-of-core systems are limited by GPU/PCIe throughput. To this end, we propose Moment, a physical communication topology and data placement co-optimizer to enable high-throughput and low-cost GNN training in a single multi-GPU machine. Moment addresses communication contention and GPU load imbalance by modeling the physical topology as capacity-constrained directed graphs and formulating communication scheduling as a max-flow problem. It also introduces a data distribution-aware knapsack algorithm for optimized data placement. Experimental results show that Moment outperforms out-of-core systems by up to 6.51× and distributed systems by up to 3.02×, with only 50% monetary cost.

Early Career

Morning Break

10:00am - 10:30am CST Monday, 17 November 2025 262

Not Livestreamed

Not Recorded

TUT

XO/EX

Workshop

Morning Break

10:00am - 10:30am CST Sunday, 16 November 2025 274

Livestreamed

Recorded

Workshop

Morning Break - AI4S: 6th Workshop on Artificial Intelligence and Machine Learning for Scientific Applications

10:00am - 10:30am CST Monday, 17 November 2025 274

Livestreamed

Recorded

Workshop

Morning Break - Alternatives To MPI+X (PAW-ATM 2025)

10:00am - 10:30am CST Sunday, 16 November 2025 230

Livestreamed

Recorded

Workshop

Morning Break - Best Practices for HPC Training and Education

10:00am - 10:30am CST Monday, 17 November 2025 232

Livestreamed

Recorded

Workshop

Morning Break - CANOPIE-HPC

10:00am - 10:30am CST Monday, 17 November 2025 275

Livestreamed

Recorded

Workshop

Morning Break - Computational Approaches for Cancer Workshop (CAFCW25)

10:00am - 10:30am CST Monday, 17 November 2025 241

Livestreamed

Recorded

Workshop

Morning Break - EduHPC-25

10:00am - 10:30am CST Sunday, 16 November 2025 261

Livestreamed

Recorded

Workshop

Morning Break - ExaMPI25

10:00am - 10:30am CST Sunday, 16 November 2025 260

Livestreamed

Recorded

Workshop

Morning Break - ExHetAI

10:00am - 10:30am CST Sunday, 16 November 2025 264

Partially Livestreamed

Partially Recorded

Workshop

Morning Break - Fifth International Symposium on Quantitative Co-Design of Supercomputers

10:00am - 10:30am CST Monday, 17 November 2025 240

Livestreamed

Recorded

Workshop

Morning Break - Frontiers in Generative AI for HPC Science and Engineering: Foundations, Challenges, and Opportunities

10:00am - 10:30am CST Sunday, 16 November 2025 241

Livestreamed

Recorded

Workshop

Morning Break - HPC Systems Professionals Workshop (HPCSYSPROS25)

10:01am - 10:31am CST Sunday, 16 November 2025 276

Livestreamed

Recorded

Workshop

Morning Break - IA^3 2025

10:00am - 10:30am CST Sunday, 16 November 2025 232

Livestreamed

Recorded

Workshop

Morning Break - International Workshop on Innovating the Network for Data Intensive Science (INDIS)

10:00am - 10:30am CST Sunday, 16 November 2025 266

Livestreamed

Recorded

Workshop

Morning Break - International Workshop on RISC-V for HPC (RISCVHPC)

10:00am - 10:30am CST Monday, 17 November 2025 242

Livestreamed

Recorded

Workshop

Morning Break - ISAV25

9:40am - 10:10am CST Sunday, 16 November 2025 242

Partially Livestreamed

Partially Recorded

Workshop

Morning Break - LLVM-HPC2025

10:00am - 10:30am CST Monday, 17 November 2025 260

Livestreamed

Recorded

Workshop

Morning Break - PDSW'25: The 10th International Parallel Data Systems Workshop

10:00am - 10:30am CST Monday, 17 November 2025 230

Data Analytics

High Performance I/O, Storage, Archive, & File Systems

Storage

Livestreamed

Recorded

Workshop

Morning Break - PMBS25

10:00am - 10:30am CST Monday, 17 November 2025 267

Livestreamed

Recorded

Workshop

Morning Break - Research Software Engineers in HPC (RSE-HPC-2025)

10:00am - 10:30am CST Sunday, 16 November 2025 267

Livestreamed

Recorded

Workshop

Morning Break - Scalable Algorithms for Large-Scale Heterogeneous Systems (ScalAH'25)

10:00am - 10:30am CST Sunday, 16 November 2025 231

Livestreamed

Recorded

Workshop

Morning Break - Sixth Combined Workshop on Interactive and Urgent High-Performance Computing

10:00am - 10:30am CST Friday, 21 November 2025 260-267

Livestreamed

Recorded

Workshop

Morning Break - Symposium on Artificial Intelligence and Extreme-Scale Workflows

10:00am - 10:30am CST Friday, 21 November 2025 274

Livestreamed

Recorded

Workshop

Morning Break - WHPC: Building Community, Building Careers

10:00am - 10:30am CST Monday, 17 November 2025 276

Partially Livestreamed

Partially Recorded

Workshop

Morning Break - Workflows in Support of Large-Scale Science (WORKS25)

10:00am - 10:30am CST Monday, 17 November 2025 264

Livestreamed

Recorded

Workshop

Morning Break - Workshop for Software Frameworks and Workload Management on Quantum-HPC Ecosystems

10:00am - 10:30am CST Friday, 21 November 2025 275

Livestreamed

Recorded

Workshop

Morning Break - Workshop on Accelerator Programming and Directives (WACCPD 2025)

10:00am - 10:30am CST Monday, 17 November 2025 266

Livestreamed

Recorded

Workshop

Morning Break - Workshop on Data Analysis and Reduction for Big Scientific Data

10:00am - 10:30am CST Monday, 17 November 2025 265

Livestreamed

Recorded

Workshop

Morning Break - Workshop on Emerging Parallel Distributed Runtime Systems and Middleware

10:00am - 10:30am CST Friday, 21 November 2025 261-262-265-266

Livestreamed

Recorded

Workshop

Morning Break - Workshop on Heterogeneous High-Performance Reconfigurable Computing (H2RC 2025)

10:00am - 10:30am CST Friday, 21 November 2025 263-264

Livestreamed

Recorded

Workshop

Morning Break - Workshop on HPC Testing and Evaluation of Systems, Tools, and Software (HPCTESTS 2025)

10:00am - 10:30am CST Friday, 21 November 2025 276

Livestreamed

Recorded

Workshop

Morning Break - Workshop on Large Scale Quantum-Classical Computing

10:00am - 10:30am CST Sunday, 16 November 2025 275

Livestreamed

Recorded

Workshop

Morning Break - Workshop on Performance, Portability, and Productivity in HPC

10:00am - 10:30am CST Monday, 17 November 2025 231

Performance Evaluation, Scalability, & Portability

Livestreamed

Recorded

Workshop

Morning Break - Workshop on Software Correctness for HPC Applications (Correctness '25)

10:00am - 10:30am CST Monday, 17 November 2025 261

Livestreamed

Recorded

Workshop

Morning Break - XLOOP 2025

10:00am - 10:30am CST Sunday, 16 November 2025 240

Livestreamed

Recorded

Birds of a Feather

MPI 5.0 Now Available!

12:15pm - 1:15pm CST Tuesday, 18 November 2025 261-262-265-266

Standards

Livestreamed

Recorded

XO/EX

DescriptionThe Message Passing Interface (MPI) API is the most dominant programming approach for HPC environments. Its specification is driven by the MPI Forum, an open forum consisting of MPI developers, vendors, and users. This year, the MPI Forum published the latest version of the standard, MPI 5.0. We will take a look at the new features and will discuss what they mean for the users of MPI. We will also discuss ongoing work toward the next version of the MPI standard, with lightning talks from the working groups, and get feedback from the community.

Workshop

MPI Collectives with Programmable Smart Switches

12:00pm - 12:30pm CST Sunday, 16 November 2025 260

Livestreamed

Recorded

DescriptionProgrammable smart network devices are heavily used by cloud providers, but typically not for HPC. However, they provide opportunities for off-loading computations, in particular for collective operations, which are important for data intensive workloads in classic HPC and ML training. In this paper, we present a prototype called mpitofino to enable offloading MPI collectives (in particular reductions) onto smart switches over an Ethernet fabric. We target Intel’s programmable Ethernet switches equipped with a Tofino ASIC, and we use the P4 programming language to process collective packets on the chip’s low-latency data path. We demonstrate how the flexibility of P4 enables us to use RoCEv2 as protocol, utilizing RDMA hardware support on the nodes’ NICs. Furthermore, we implement mpitofino as a collective provider in Open MPI and discuss its desirable scaling characteristics. Finally, we demonstrate that mpitofino can achieve data throughput close to the 100GBit/s line rate.

Workshop

MPI Communication Performance on AMD MI300A: Microbenchmarks and Applications

10:30am - 11:00am CST Friday, 21 November 2025 261-262-265-266

Livestreamed

Recorded

DescriptionAMD’s MI300A integrates CPU and GPU chiplets around a shared HBM3 pool, removing the traditional host-device boundary and changing assumptions in GPU-aware MPI. Despite early deployments, there is little guidance on how mainstream MPI libraries behave on this architecture. This evaluation paper presents a comparative study of MVAPICH-Plus, Open MPI, MPICH, and Cray MPICH on MI300A APU nodes. We measure point-to-point performance on CPU and GPU buffers, reporting intra-node and inter-node latency, unidirectional bandwidth, and bidirectional bandwidth across various message sizes. We then examine collectives, covering reduction-based and data-movement-based operations, and analyze scaling behavior. Finally, we connect microbenchmark trends to application results using OpenFOAM and Distributed training of a large language model (LLM) with Pytorch. The study distills practical guidance and highlights opportunities for MI300A-aware optimizations in MPI.

Research and ACM SRC Posters

MPI-SGX: Enabling Confidential Computing for MPI Parallel Applications with Intel SGX Technology

8:00am - 5:00pm CST Tuesday, 18 November 2025 Second Floor Atrium

Research & ACM SRC Posters

DescriptionBig data and deep learning workloads often require handling sensitive data, but security mechanisms in current supercomputers mainly protect against external threats, leaving risks of insider leakage. As a result, supercomputers remain unsuitable for confidential applications. To address this challenge, we propose the first SGX-based parallel computing system with a secure MPI library, MPI-SGX. MPI-SGX enables MPI processes across multiple SGX enclaves to communicate safely through encryption, without requiring code modifications. By combining MPI-SGX with SGX enclaves, our system supports confidential execution of MPI-based parallel applications. Experimental results show that our approach incurs a 6.6x increase in communication latency and a 49% reduction in bandwidth compared to the baseline, but successfully achieves confidentiality. In the poster session, we will present the design of the SGX-based system and MPI-SGX, report detailed experimental findings, and discuss directions for improving performance and expanding the scope of secure HPC.

Birds of a Feather

MPICH: A High Performance Open-Source MPI Implementation

5:15pm - 6:45pm CST Wednesday, 19 November 2025 125

System Architecture

Livestreamed

Recorded

XO/EX

DescriptionMPICH is a widely used, open-source implementation of the MPI message passing standard. It has been ported to many platforms and used by several vendors and research groups as the basis for their own MPI implementations. This BoF session will provide a forum for core MPICH developers to share new features and release plans with the community. Developers of MPI implementations derived from MPICH will share their own status updates and discuss experiences and issues in using and porting MPICH. Key users will be given an opportunity to present MPICH usage stories. Questions from the audience are welcome.

Workshop

MPPI - Type safe C++ Datatypes for MPI

10:30am - 11:00am CST Sunday, 16 November 2025 260

Livestreamed

Recorded

DescriptionMPI provides a flexible C-API to communicate data of various types between a set of distributed processes over high-speed interconnects in HPC systems. Data buffers are described using MPI-Datatypes, which specify the type and layout of the data to be transmitted. To construct these datatypes, users must manually describe the memory layout of buffer elements via the MPI-API. However, modern applications are typically written in object-oriented C++, which offers significant advantages over C, including type safety and metaprogramming capabilities. In this work, we introduce a new C++-API and datatype engine that leverage C++ language features such as concepts, ranges, and the upcoming reflection to extract the necessary datatype information for the user at compile-time. This approach simplifies the user’s work, enhances code safety by eliminating manual datatype construction and offers previously unavailable possibilities. Our measurements demonstrate that this interface introduces no performance overhead and, in some cases, even improves performance.

Workshop

MT4G: A Tool for Reliable Auto-Discovery of NVIDIA and AMD GPU Compute and Memory Topologies

2:30pm - 3:00pm CST Monday, 17 November 2025 241

Livestreamed

Recorded

DescriptionUnderstanding GPU topology is essential for performance-related tasks in HPC or AI. Yet, unlike for CPUs with tools like hwloc, GPU information is hard to come by, incomplete, and vendor-specific.

In this work, we address this gap and present MT4G, an open-source and vendor-agnostic tool that automatically discovers GPU compute and memory topologies and configurations, including cache sizes, bandwidths, and physical layouts.
MT4G combines existing APIs with a suite of over 50 microbenchmarks, applying statistical methods, such as the Kolmogorov-Smirnov test, to automatically and reliably identify otherwise programmatically unavailable topological attributes.

We showcase MT4G's universality on ten different GPUs and demonstrate its impact through integration into three workflows: GPU performance modeling, GPUscout bottleneck analysis, and dynamic resource partitioning.
These scenarios highlight MT4G's role in understanding system performance and characteristics across NVIDIA and AMD GPUs, providing an automated, portable solution for modern HPC and AI systems.

Research and ACM SRC Posters

Multi-GPU Implementation and Roofline Analysis of a Numerical Global Ocean Model

8:00am - 5:00pm CST Tuesday, 18 November 2025 Second Floor Atrium

Research & ACM SRC Posters

DescriptionNumerical ocean models are essential tools for climate prediction and marine resource studies, requiring high resolution and realistic physical processes. We developed the global ocean model COCO and implemented it on GPUs using an OpenACC directive-based approach, while maintaining compatibility with CPUs. Performance was evaluated on the Miyabi supercomputer, which includes GPU-based (NVIDIA GH200) and CPU-based (Intel Xeon MAX 9480) systems. Realistic ocean experiments with a 0.17° global grid showed that most components achieved faster execution on GPUs, with the tracer calculation accelerated by a factor of 2.9. Roofline analysis revealed that most loops were memory-bound, and GPU speedup was constrained by memory bandwidth rather than compute capability. Future improvements will require increasing arithmetic intensity and applying kernel-level optimizations, while ensuring compatibility between CPU- and GPU-based codes.

Workshop

Multi-rail RoCE, Now with more BGP!

11:30am - 11:45am CST Sunday, 16 November 2025 276

Livestreamed

Recorded

DescriptionHeterogeneous compute nodes containing multiple accelerators and Ethernet network injections have become common in recent years. Despite this, additional network injections beyond the first are often only utilized by application middleware such as MPI or NCCL supporting an RDMA API. We explain why traditional Etherchannel can't support this usecase. We further propose an alternative network configuration which allows these hardware resources to be utilized both by RDMA application middleware such as MPI as well as other applications which utilize the OS provided sockets API rather than a kernel bypass API. This allows user applications using less HPC focused (but potentially more portable) APIs as well as parallel filesystems and other tools to also benefit from the additional networking hardware available in this type of compute node.

SCinet Network Research Exhibition

Multi-Resource Cyberinfrastructure Services for Science Domain Workflows via SENSE

11:00am - 11:20am CST Wednesday, 19 November 2025 Booth 3537 - SCinet Theater

Not Livestreamed

Not Recorded

Descriptionnri106
nri### (from Joe Mambretti/StarLight) nri### (from Harvey Newman/Caltech)

ACM Gordon Bell Finalist

Awards and Award Talks

Multiscale Light-Matter Dynamics in Quantum Materials: From Electrons to Topological Superlattices

11:37am - 12:00pm CST Tuesday, 18 November 2025 261-262-265-266

Applications

Livestreamed

Recorded

DescriptionLight-matter dynamics in topological quantum materials enables ultralow-power, ultrafast devices. A challenge is simulating multiple field and particle equations for light, electrons, and atoms over vast spatiotemporal scales on Exaflop/s computers with increased heterogeneity and low-precision focus. We present a paradigm shift that solves the multiscale/multiphysics/heterogeneity challenge harnessing hardware heterogeneity and low-precision arithmetic. Divide-conquer-recombine algorithms divide the problem into not only spatial but also physical subproblems of small dynamic ranges and minimal mutual information, which are mapped onto best-characteristics-matching hardware units, while metamodel-space algebra minimizes communication and precision requirements. Using 60,000 GPUs of Aurora, DC-MESH (divide-and-conquer Maxwell-Ehrenfest surface hopping) and XS-NNQMD (excited-state neural-network quantum molecular dynamics) modules of MLMD (multiscale light-matter dynamics) software were 152- and 3,780-times faster than the state-of-the-art for 15.4 million-electron and 1.23 trillion-atom PbTiO3 material, achieving 1.87 EFLOP/s for the former. This enabled the first study of light-induced switching of topological superlattices for future ferroelectric "topotronics."

Paper

MXBLAS: Accelerating 8-Bit Deep Learning with a Unified Micro-Scaled GEMM Library

10:52am - 11:15am CST Thursday, 20 November 2025 260-267

Algorithms

HPC for Machine Learning

Programming Frameworks

Livestreamed

Recorded

DescriptionMicro-scaling general matrix multiplication (MX-GEMM) uses 8-bit MX-format inputs to accelerate deep learning workloads. While the MX-format supports diverse scaling patterns and granularities, current MX-GEMM implementations are often model-specific. This leads to three main issues: tight coupling between models and kernels, inefficient promotion operations, and neglected quantization overhead.

This paper introduces MXBLAS, a high-performance MX-GEMM library that supports the full range of MX-format variations. MXBLAS overcomes prior limitations with three key innovations: (1) a template-based design enabling flexible promotion patterns within a unified framework; (2) adaptive runtime kernel generation using template matching, guided search pruning, and auto-tuning to find optimal configurations; and (3) a compute-store co-optimization that fuses quantization into the kernel’s epilogue, reducing overhead. Experiments show MXBLAS outperforms existing MX-GEMM libraries by 33% on average, and is the first to fully harness the performance potential of generalized 8-bit computing across all MX-formats.

Workshop

National Data Platform's Education Hub

4:42pm - 4:48pm CST Sunday, 16 November 2025 261

Livestreamed

Recorded

DescriptionAs demand for AI literacy and data science education grows, there is a critical need for infrastructure that bridges the gap between research data, computational resources, and educational experiences. To address this gap, we developed a first-of-its-kind Education Hub within the National Data Platform. This hub enables seamless connections between collaborative research workspaces, classroom environments, and data challenge settings. Early use cases demonstrate the effectiveness of the platform in supporting complex and resource-intensive educational activities. Ongoing efforts aim to enhance the user experience and expand
adoption by educators and learners alike.

Invited Talk

National Supercomputing Mission: Building India’s Self-Reliant HPC Ecosystem

3:30pm - 4:15pm CST Wednesday, 19 November 2025 America's Ballroom Tu-Th

National Strategies

Livestreamed

Recorded

DescriptionThe National Supercomputing Mission (NSM) is a Government of India initiative with an aim to promote cutting-edge research in science and technology. The main objective of the Mission is to create HPC infrastructure at various academic and research institutions in the country, develop applications for national needs, develop indigenous HPC technologies for self-reliance, and develop human resources to spearhead HPC activities in the nation. The Centre for Development of Advanced Computing (C-DAC) and the Indian Institute of Science (IISc), Bangalore, are the implementation agencies of the Mission.

To date, 22 large and midsize supercomputing systems with a total compute power of 37+ PF have been created under this program. Ten more systems with a total compute power of 60+ PF are being built with indigenous technologies in the next six months, bringing the cumulative compute power capacity to 100 PF under the Mission. To continue the momentum, an NSM 2.0 proposal with exascale compute power is being worked out with a major focus on self-reliance in supercomputing.

Birds of a Feather

Navigating Complexity: Achieving Performance Portability in the Evolving Landscape of Accelerated HPC Systems

12:15pm - 1:15pm CST Wednesday, 19 November 2025 130

State of the Practice

Livestreamed

Recorded

XO/EX

DescriptionWith the increasing demand for AI in HPC, there has been a rapid rise in accelerated architectures, portable programming models, and frameworks. The already-daunting task of programming for accelerated systems has become even more complex. This BoF, organized by IXPUG, will focus on portable programming across a wide range of heterogeneous architectures—including Intel, NVIDIA, AMD, and Arm—supporting diverse simulation, data analytics, and AI workloads. The session will explore key challenges, state-of-the-art solutions, and emerging best practices for programming across these systems, identifying common principles and methodologies that support development and long-term maintenance across sites, architectures, and scientific applications.

Art of HPC

Navigating the Sky

8:00am - 6:00pm CST Sunday, 16 November 2025 Art of HPC - Plaza Lobby

Art of HPC

Not Livestreamed

Not Recorded

TUT

XO/EX

DescriptionA short meditation in two parts about how ideas and knowledge shape and order what we find in the sky.

Manu-o-Kū

The first part is based on a narration by Polynesian navigator Nainoa Thompson who describes how stars, clouds, waves, and living beings form an interconnected system of orientation that can be read, felt, heard, and smelled. This celestial knowledge is not a product of the human mind alone but shared with animals such as the seabird Manu-o-Kū, which indicates the proximity of land. Thompson’s Hawaiian voyaging canoe played a central role in the revival of traditional Polynesian non-instrumental navigation techniques in the 1970s. The close entanglement of celestial knowledge and cultural ideas is also reflected in the visuals generated by an artificial neural network that has been trained on millions of images representing contemporary visual culture.

SIMBAD

The second part traces how scientific knowledge is shaped by instruments and human culture. SIMBAD, alluding to another mythical seafarer, is the name of an astronomical database maintained by the Université de Strasbourg. It maps every celestial object described in scientific literature to its corresponding place in the sky. Looking at the composite image of all astronomical references, one is struck by distinct geometrical patterns – rectangles, circles, and other complex shapes appear in the map of all known stars and galaxies, revealing the imprints of instruments, publication formats, and changing cultural interests. Sounds and visuals are generated from 28 million bibliographic references extracted from the database.

Panel

Navigating the Software Storm: Writing Software in the Age of Extreme Heterogeneity

3:30pm - 5:00pm CST Wednesday, 19 November 2025 240-241-242

AI, Machine Learning, & Deep Learning

HPC Software & Runtime Systems

Parallel Programming Methods, Models, Languages, & Environments

Livestreamed

Recorded

DescriptionThe future of high performance computing faces a software reckoning. As architectural diversity explodes—from AI accelerators and chiplet-based designs to quantum processors—our programming models, compilers, and system software are under strain. New programming systems seemingly materialize from thin air while stalwarts like Fortran struggle to keep current. Will AI help us manage the increasing complexity? Can open standards and toolchains like LLVM provide stability? How should the HPC community adapt in an era increasingly dominated by “productivity-focused” languages like Python, Rust, and Julia? This panel assembles thought leaders across architectures, languages, AI, and programming models to debate whether today's approaches are sustainable—or whether radical new software infrastructures are needed to keep science moving forward. Technology is only one factor in the solution; business models, training, and stewardship are equally, if not more, important than a specific technology.

Workshop

Network Replay and Consistency Across Testbeds

3:05pm - 3:20pm CST Sunday, 16 November 2025 266

Livestreamed

Recorded

DescriptionShared network testbeds are critical for systems and networking research. However, their shared hardware can introduce variability—like increased jitter or loss—that may impact experiment fidelity or reproducibility.

We present Choir, the first 100 Gbps replay tool designed to run on commodity hardware and shared infrastructures. Choir enables precise replay and measurement to observe how closely a testbed reproduces expected behavior. We also introduce a metric for quantifying consistency, designed to support comparison across time, configurations, and environments.

We evaluate our approach on FABRIC and a local, bare-metal testbed. We show that FABRIC, even with dedicated resources and low background utilization, has greater variability in inter-packet arrival times and latency compared to the local testbed. With high utilization on shared hardware, this variability increases by an order of magnitude. Our findings demonstrate how tools like Choir can help researchers better understand and mitigate the effects of shared infrastructure.

Flash Session

Networking: The Cornerstone of the Future of AI/HPC Data Centers

10:15am - 10:30am CST Wednesday, 19 November 2025 Booth 2638 - Flash Session

Not Livestreamed

Not Recorded

XO/EX

Workshop

Next Generation Linear Algebra: One Singular Value Decomposition to Catch Them All across Data Size, Precision, and Hardware

2:10pm - 2:15pm CST Monday, 17 November 2025 276

Partially Livestreamed

Partially Recorded

DescriptionThe Singular Value Decomposition (SVD) is a foundational building block in many applications, including low rank adaptation (LoRA) for large language models (LLMs). Historically, separate SVD implementations have been designed for each data precision, for each hardware vendor, and for each hardware type (personal computer and HPC). This divergence leads to increased development time, the need to redevelop entire libraries when new architectures or data types emerge, and significant complexity for the end user. In this abstract, we discuss a work in progress to develop an alternative: a unified SVD, enabled by abstraction layers. We demonstrate that state-of-the-art performance across the board can be reached using abstraction frameworks, and investigate the performance engineering process and the characteristics that enable adaptable performance.

Exhibitor Forum

Next Vector Project Based on Proven NEC Vector and RISC-V Architecture

10:30am - 11:00am CST Thursday, 20 November 2025 130

Hardware and Architecture

Livestreamed

Recorded

XO/EX

DescriptionThe Next Vector Project by NEC addresses the growing challenges in high performance computing (HPC), such as energy efficiency, scalability, and accessibility. Building on the proven SX-Aurora TSUBASA vector architecture, the project integrates the open-standard RISC-V instruction set to foster innovation and collaboration. NEC (a Japanese IT company) partners with Openchip & Software Technologies in Spain to co-develop this next-generation processor system. The initiative emphasizes not only advanced hardware but also a robust software ecosystem, including compiler development and user-friendly programming tools. The goal is to make powerful vector computing accessible to a broader range of users, from HPC experts to domain scientists. The presentation outlines current HPC challenges, showcases the benefits of vector computing, and details the technical and collaborative aspects of the Next Vector Project, including its roadmap and specifications.

Workshop

Next Year Roadmap + Closing Remarks

4:45pm - 5:15pm CST Sunday, 16 November 2025 275

Livestreamed

Recorded

DescriptionJoin us for an in-depth look at the IBM Quantum roadmap, where we will share our vision for the future of quantum computing and outline the key developments and milestones that will drive progress towards achieving fault tolerant quantum computing by 2029. This session will provide a comprehensive overview of our technical roadmap, highlighting upcoming advancements in quantum hardware, software, and ecosystem development. By attending this session, attendees will gain a deeper understanding of IBM's quantum strategy and the exciting developments that are on the horizon, and learn how to prepare for the emerging era of quantum advantage.

SCinet Network Research Exhibition

Next-Generation Systems for Data-Intensive Sciences

10:40am - 11:00am CST Wednesday, 19 November 2025 Booth 3537 - SCinet Theater

Not Livestreamed

Not Recorded

Invited Talk

Nightingale AI: Building Medical Foundation Models at Nation-State Scale

2:15pm - 3:00pm CST Tuesday, 18 November 2025 America's Ballroom Tu-Th

Life Sciences

Societal Impact

Livestreamed

Recorded

DescriptionNightingale AI is Europe’s flagship effort to build sovereign, open medical foundation models using secure national health data. Unlike language-only models, medical AI must learn across multimodal data—imaging, biosignals, genomics, and clinical text—demanding innovations in architecture, scaling, and interpretability. Leveraging exascale compute and federated secure data environments, Nightingale AI pioneers an “AI factory” approach that fuses national-scale datasets with immediate healthcare impact. This talk will share an overview of our work today since launch in March 2025, and how we have partnered from day one with Isambard-AI at an unprecedented scale of compute for academic/medical research teams.

Workshop

NIST SP 800-* in HPC: Standards That Matter

10:38am - 10:45am CST Sunday, 16 November 2025 276

Livestreamed

Recorded

DescriptionAs HPC systems increasingly support sensitive and federally regulated research, frameworks like the NIST Special Publication (SP) 800 series are becoming essential for compliance and data protection. This lightning talk offers a fast-paced overview of the NIST SP 800-* family, highlighting key standards like 800-53, 800-66, 800-171, 800-223, and 800-234, and what they mean for the HPC community. Attendees will gain a high-level understanding of how these standards influence research requirements, cybersecurity expectations, and institutional responsibilities. Whether you're supporting CUI, HIPAA, preparing for CMMC, or looking to stay ahead of compliance trends, this session will help you understand the standards that matter most.

Paper

NNQS-SCI: Tackling Trillion-Dimensional Hilbert Space with Adaptive Neural Network Quantum States

10:52am - 11:14am CST Thursday, 20 November 2025 263-264

Applications

Livestreamed

Recorded

DescriptionNeural network quantum states (NNQS) offer a powerful variational Monte Carlo (VMC) approach for quantum many-body problems, balancing polynomial scaling with high expressive power. However, scaling NNQS to large chemical systems faces challenges in preserving accuracy with exact energy and managing vast configurations efficiently. In this work, we introduce NNQS-SCI, a high-performance selected configuration interaction (SCI) based NNQS method designed to overcome these limitations. NNQS-SCI employs highly parallelized Slater-Condon rules for fast local energy evaluations, avoiding accuracy loss, while its adaptive SCI engine dynamically manages billions of configurations without space explosion or arbitrary cutoffs that plague other NNQS-CI approaches. Optimized for extreme scalability via multi-level parallelism and memory compression, NNQS-SCI successfully simulates systems up to 152 spin orbitals, tackling Hilbert space dimensions exceeding 10$^{14}$ and demonstrating significant advances in scale and efficiency. NNQS-SCI thus provides a robust and scalable path towards high-accuracy quantum chemistry on high performance computing platforms.

Research and ACM SRC Posters

Novel Graph Alignment Algorithms for Identifying Non-Determinism in Large-Scale Simulations

8:00am - 5:00pm CST Tuesday, 18 November 2025 Second Floor Atrium

Research & ACM SRC Posters

DescriptionThe increasing complexity of HPC simulations poses several challenges to their reproducibility and reliability. One critical issue is the non-determinism (ND) induced by asynchronous MPI communication. Locating the sources of ND in large codes is difficult. This problem can be addressed by comparing event graphs (graphs mapping MPI communication) across multiple runs of the application by using tools like ANACIN-X[2] (to trace the event graphs) and network alignment (to locate areas of ND). We expand ANACIN-X's point-to-point tracing capabilities by adding collective communication tracing, and propose a novel network alignment algorithm to effectively compare event graphs.

Invited Talk

Nuclear Energy and Large-Scale Computing

2:15pm - 3:00pm CST Wednesday, 19 November 2025 America's Ballroom Tu-Th

Energy Efficiency

Power Use Monitoring & Optimization

Livestreamed

Recorded

DescriptionA transformation is underway in the world's energy sector. Tremendous growth in AI, data centers, and electrification is accelerating the need for new and expanded energy sources. Nuclear energy, with unparalleled bipartisan support, is seeing record-breaking private investment in both fission and fusion technologies and global commitments to expanding nuclear energy. Nuclear energy can be a key provider of energy for large-scale computing, and this talk will provide an overview of current activities and challenges in expansion to meet this rising need.

Nuclear energy is also a user of large-scale computing; accelerated deployment timelines are placing a growing importance on high-fidelity simulation to enable parameter exploration, design optimization, and safety analysis for regulatory compliance. The second part of this talk will describe the adoption of high performance computing for nuclear engineering applications, in particular focusing on the NEAMS (Nuclear Energy Advanced Modeling and Simulation) program, codes developed during the ExaSMR project under the Exascale Computing Project (ECP), and research activities at the University of Illinois.

Research and ACM SRC Posters

Numerical Investigation of Radiation Hydrodynamic Instabilities at Scale with FleCSI-HARD

8:00am - 5:00pm CST Tuesday, 18 November 2025 Second Floor Atrium

Research & ACM SRC Posters

DescriptionThis poster describes the open-source massively parallel and portable radiation hydrodynamics code FleCSI-HARD (Hydrodynamic And Radiative Diffusion) used to study radiation hydrodynamics instabilities. FleCSI-HARD is based on FleCSI (Flexible Computational Science Infrastructure) runtime which enables task-based distributed and portable code to be written in single source modern C++. We show good strong and weak scaling on CPUs and GPUs.

Paper

Numerical Performance of the Implicitly Restarted Arnoldi Method in OFP8, Bfloat16, Posit, and Takum Arithmetics

3:52pm - 4:15pm CST Tuesday, 18 November 2025 274

Algorithms

Applications

Architectures & Networks

Livestreamed

Recorded

DescriptionThe computation of select eigenvalues and eigenvectors of large, sparse matrices is fundamental to a wide range of applications. Accordingly, evaluating the numerical performance of emerging alternatives to the IEEE 754 floating-point standard, such as OFP8 (E4M3 and E5M2), bfloat16, and the tapered-precision posit and takum formats, is of significant interest. Among the most widely used methods for this task is the implicitly restarted Arnoldi method, as implemented in ARPACK.

This paper presents a comprehensive and untailored evaluation based on two real-world datasets: the SuiteSparse Matrix Collection, which includes matrices of varying sizes and condition numbers, and the Network Repository, a large collection of graphs from practical applications. The results demonstrate that the tapered-precision posit and takum formats provide improved numerical performance, with takum arithmetic avoiding several weaknesses observed in posits. While bfloat16 performs consistently better than float16, the OFP8 types are generally unsuitable for general-purpose computations.

Workshop

Numerical Properties and Scalability of s-Step Preconditioned Conjugate Gradient Methods

11:10am - 11:30am CST Sunday, 16 November 2025 231

Livestreamed

Recorded

Descriptions-step Preconditioned Conjugate Gradient (PCG) variants for iteratively solving large sparse linear systems reduce the number of global synchronization points of standard PCG by a factor of O(s). Despite improving scalability on large-scale parallel computers, they have worse numerical properties than standard PCG. Choosing a suitable basis type for the s-step basis matrices is known to potentially improve numerical stability strongly. The s-step method proposed first in the literature was designed to only use the monomial basis. We generalize this method to support arbitrary basis types, denoting our new method as sPCG.

Moreover, we theoretically and experimentally compare all s-step PCG methods. To the best of our knowledge, this is the first comprehensive comparison in the literature. Our theoretical analysis, strong scaling experiments with a synthetic test problem, and runtime experiments with real-world problems confirm that our novel sPCG algorithm achieves higher speedup over standard PCG than existing s-step algorithms.

Workshop

NVIDIA Earth-2, a digital twin of our planet for tackling climate change

9:40am - 10:00am CST Sunday, 16 November 2025 274

Livestreamed

Recorded

DescriptionClimate change is altering the state of our atmosphere, leading to more extreme weather events. But studying our changing climate is enormously complicated due to the many interrelated systems and processes and the sheer size of a planetary scale problem. Digital twin technology provides us with a mechanism to tackle both the scale and complexity of climate change. In this talk, we introduce Earth-2, a revolutionary digital twin framework developed by NVIDIA that allows us to safely look at and experiment with future climate events at both global and regional scales.

Workshop

NVSHMEM4Py

4:30pm - 4:45pm CST Monday, 17 November 2025 242

Livestreamed

Recorded

DescriptionThis presentation introduces NVSHMEM4Py, which provides Python-first host and device APIs that integrate naturally with the Python ecosystem. The library supports array-oriented memory management, collectives, and one-sided communication as on-stream, host-initiated operations, enabling overlap with compute. Additionally, device-side APIs allow fused communication and computation within user-defined kernels. Benchmarks show NVSHMEM4Py achieves native C-level performance while dramatically improving usability, empowering Python developers to build scalable multi-GPU applications without deep C/C++ expertise.

Doctoral Showcase

OAAgent: A Multimodal LLM Agent Clinical Assistant for Precision Osteoarthritis Care

2:30pm - 2:45pm CST Thursday, 20 November 2025 230

Research & ACM SRC Posters

Livestreamed

Recorded

DescriptionOsteoarthritis (OA) is a chronic condition which affects over 300 million people globally and is a leading cause of disability, yet predictive models often remain monomodal, static, and opaque to clinicians. This dissertation develops OAAgent, a multimodal large language model (LLM) clinical assistant that integrates medical images (X-ray, MRI), longitudinal clinical variables, and physician notes for personalized, interpretable OA care and prediction of progression. OAAgent employs a fusion transformer for multimodal integration, a temporal retrieval system C-TRAG with explicit cross-visit semantics to retrieve clinically similar cases (similar past trajectories), and reinforcement learning for dynamic decision-making through a Chain-of-Thought reasoning layer and Extract-and-Abstract clinical note summarization to ensure transparent, patient-specific recommendations. The dissertation addresses five critical gaps at the intersection of OA AI research:

1. Joint integration of heterogeneous modalities
2. Longitudinal temporal reasoning
3. Clinically interpretable decision support
4. Personalized treatment recommendations
5. Inclusion of underutilized narrative notes

OAAgent’s architecture is designed for extensibility through the Model Context Protocol (MCP), enabling it to interoperate with other domain-specific models, multimodal pipelines, and external reasoning agents. This creates a bridge between the LLM core and diverse analytical components, enhancing adaptability to new modalities and clinical contexts.

Developed in collaboration with the Cleveland Clinic and validated on the OAI dataset and FNIH cohort and MIMIC datasets, OAAgent demonstrates improved accuracy, temporal calibration, and interpretability. Anchored in a Trustworthy AI framework, this work advances agentic multimodal AI for healthcare, offering a scalable, ethical, and interoperable pathway toward equitable, explainable clinical decision support across chronic diseases.

Paper

ODOS-MPI: HPC-Friendly SmartNIC Offloading of Computation/Communication Kernels

10:30am - 10:52am CST Wednesday, 19 November 2025 275

Programming Frameworks

Livestreamed

Recorded

DescriptionThe increasing complexity and scale of high performance computing (HPC) workloads demand innovative approaches to optimize both computation and communication. While OpenMP has been widely adopted for intra-node parallelism and MPI for inter-node communication, emerging SmartNICs introduce new opportunities for offloading communication-intensive tasks.

In this work, we extend OpenMP to support MPI kernel offloading to SmartNICs. Our implementation integrates Open MPI communication offloading into the LLVM compiler while utilizing DOCA SDK for efficient interaction with NVIDIA BlueField DPUs. Leveraging OpenMP eliminates the need for direct low-level programming, lowering the entry barrier for domain scientists.

We demonstrate our framework’s versatility by implementing a SmartNIC-enabled version of the MPI OSU micro-benchmarks and improving the execution time of an atmospheric weather simulation by over 18%, thanks to concurrent computation and communication.

Workshop

OmniFed: A Modular Framework for Configurable Federated Learning from Edge to HPC

11:40am - 12:00pm CST Sunday, 16 November 2025 264

Partially Livestreamed

Partially Recorded

DescriptionFederated Learning (FL) is critical for edge and High Performance Computing (HPC) where data is not centralized and privacy is crucial. We present OmniFed, a modular framework designed around decoupling and clear separation of concerns for configuration, orchestration, communication, and training logic. Its architecture supports configuration-driven prototyping and code-level override-what-you-need customization. We also support different topologies, mixed communication protocols within a single deployment, and popular training algorithms. It also offers optional privacy mechanisms including Differential Privacy (DP), Homomorphic Encryption (HE), and Secure Aggregation (SA), as well as compression strategies. These capabilities are exposed through well-defined extension points, allowing users to customize topology and orchestration, learning logic, and privacy/compression plugins, all while preserving the integrity of the core system. We evaluate multiple models and algorithms to measure various performance metrics. By unifying topology configuration, mixed-protocol communication, and pluggable modules in one stack, OmniFed streamlines FL experimentation and deployment across heterogeneous environments.

Workshop

On the Compressibility of Floating-Point Data in Posit and IEEE-754 Representation

2:20pm - 2:40pm CST Monday, 17 November 2025 265

Livestreamed

Recorded

DescriptionThe IEEE 754 floating-point standard is the most used representation for real numbers in modern computer systems, despite issues in accuracy for certain applications. The posit format, which has several advantages, has been proposed as a direct drop-in replacement for IEEE floats. Many works compare the use of posits to floats in a wide range of scientific computing domains. However, there has not been any work looking into the compressibility of posit data. In this paper, we compare the compression ratios of different algorithms when the input is encoded in IEEE format and in posit format. We evaluate 5 lossless general-purpose compressors as well as several new compression algorithms synthesized by our LC framework, on 14 single-precision inputs from the SDRBench suite, encoded in float and posit format. Our results show that that 4 of the 6 compressors yield an average of 2.59\% reduction in compression ratio on posit data.

Workshop

On the Integration of Lightweight Tasks with MPI using the C++26 std::execution `Senders' API

2:30pm - 3:00pm CST Sunday, 16 November 2025 260

Livestreamed

Recorded

DescriptionIntegrating asynchronous MPI messaging with tasking runtimes requires careful handling of request polling and dispatching of associated completions to participating threads. The new C++26 Senders (std::execution) library offers a flexible collection of interfaces and templates for schedulers, algorithms and adaptors to work with asynchronous functions—and it makes explicit the mechanism to transfer execution from one context to another—essential for high performance. We have implemented the major features of the Senders API in the pika tasking runtime and used them to wrap asynchronous MPI calls such that messaging operations become nodes in the execution graph with the same calling semantics as other operations. The API allows us to easily experiment with different methods of message scheduling and dispatching completions. We present insights from our implementation on how application performance is affected by design choices surrounding the placement, scheduling and execution of polling and completion tasks using Senders.

Workshop

On the Performance and Scalability of Cloud Supercomputers: Insights from Eagle and Reindeer

5:00pm - 5:30pm CST Monday, 17 November 2025 267

Livestreamed

Recorded

DescriptionLaunch of Eagle, Azure’s hyper-scale supercomputer and the Number 3 on TOP500 list in November 2023, marked a new era where cloud providers are at the forefront of supercomputing. Despite its rapid expansion, public knowledge on the performance and scalability of cloud-based supercomputing is limited, with numerous misconceptions regarding performance implications due to virtualization layer of cloud-based systems. To address these gaps, we present a comparative analysis of two cloud-based supercomputers: Azure Eagle, a hyper-scale system ranked Number 3 on TOP500 in November 2023, and Azure Reindeer, a small-scale system ranked Number 32 on TOP500 in November 2024.

Using a comprehensive performance analysis, we highlight differences in computational efficiency and scaling characteristics of these systems in comparison to their bare-metal on-premises counterparts. We furthermore quantify the overhead from Azure's virtualization layer, demonstrating its performance implication for real-world HPC workloads to be less than 4%, with typical values ranging from 2–3%.

Birds of a Feather

Ongoing/Commissioning Liquid Cooling Systems: Case Studies and a Systematic Guide

5:15pm - 6:45pm CST Tuesday, 18 November 2025 125

State of the Practice

Livestreamed

Recorded

XO/EX

DescriptionEver-increasing compute system heat density and scale is driving liquid cooling solutions in cutting-edge supercomputers and large-scale AI training/inference systems. Ensuring peak performance and reliability hinges on robust commissioning and ongoing commissioning of data centers. This session, tailored for operational managers, facility engineers, liquid cooling vendors, architects, and engineers, looks into this critical process. Gain firsthand insights from real-world case studies, featuring cooling system commissioning at Oak Ridge National Laboratory's Frontier and Lawrence Livermore National Laboratory's El Capitan, alongside Sandia National Laboratories' advanced OCx methodologies. Learn about a community initiative to document a guideline for liquid cooling commissioning.

Workshop

Open Composer: A Web-Based Application for Generating and Managing Batch Jobs on HPC Clusters

2:20pm - 2:40pm CST Sunday, 16 November 2025 276

Livestreamed

Recorded

DescriptionSubmitting batch jobs on HPC clusters usually requires familiarity with Linux commands and job schedulers, posing a significant learning barrier for beginners. To address this issue, we have developed Open Composer, a web-based application that helps users generate and manage batch jobs on HPC clusters. Open Composer automatically generates shell scripts using web forms defined for each application while also providing a real-time preview of the generated shell scripts and allowing direct editing. This feature helps reduce both the learning curve and the risk of syntax errors while maintaining flexibility in script writing. Open Composer provides a unified interface for job submission and status monitoring, and supports reusable job parameters, dynamic form widgets, and preprocessing steps. By enhancing usability and accessibility, Open Composer aims to make HPC resources more approachable for both novice and experienced users.

Exhibitor Forum

Open Flash Platform: An Initiative for Open, Highly Efficient, Exascale AI Storage

2:30pm - 3:00pm CST Thursday, 20 November 2025 130

Exascale

Livestreamed

Recorded

XO/EX

DescriptionIT infrastructures face AI and analytics demands, driving the need for storage that leverages existing networks, cuts server counts, and frees CAPEX for AI.

Modeled on the Open Compute Project, the Open Flash Platform (OFP) initiative liberates high-capacity flash through an open architecture built on standard pNFS in every Linux distribution. Each OFP unit contains a DPU-based Linux instance and network port, so it connects directly as a peer—no additional servers.

By removing surplus hardware and proprietary software, OFP lets enterprises use dense flash efficiently, halving TCO and increasing storage density 10×. Early configurations deliver up to 48 PB in 2U and scale to 1 EB per rack, yielding a 10× reduction in rack space, power, and OPEX and a 33% longer service life.

This session explains the vision and engineering that make OFP possible, showing how open, standards-based architecture can simplify and reduce the costs of high-capacity storage.

Birds of a Feather

Open MPI State of the Union 2025

12:15pm - 1:15pm CST Wednesday, 19 November 2025 260-267

System Software

Livestreamed

Recorded

XO/EX

DescriptionOpen MPI continues to drive the state of the art in HPC. This year, we've added new features, fixed bugs, improved performance, and collaborated with many across the HPC community. We'll discuss what Open MPI has accomplished over the past year and present a roadmap for the next year.

One of Open MPI's strengths lies in its diversity: we represent many different viewpoints across the HPC ecosystem. To that end, many developers from the community will be present to discuss and answer your questions both during and after the BoF.

Workshop

Open OnDemand: Connecting Computing Power with Powerful Minds

8:40am - 9:35am CST Friday, 21 November 2025 260-267

Livestreamed

Recorded

DescriptionOpen OnDemand (openondemand.org) is an innovative, open-source, web-based portal that removes the complexities of research computing system environments from the end-client, and in so doing, reduces “time to science” for researchers by facilitating their access to research computing resources. Through Open OnDemand, research computing clients can upload and download files, create, edit, submit and monitor jobs, create and share apps, run graphical user interface-based applications and connect to a terminal, all via a web browser, with no client software to install and configure. Open OnDemand greatly simplifies access to research computing resources, freeing domain scientists from having to worry about the operating environment and instead focus on their research. It enables computer center staff to support a wide range of clients by simplifying the user interface and experience. The overall impact is that clients can use remote computing resources faster and more efficiently. This presentation will provide an overview of Open OnDemand and detail some of the success stories that have been generated from the global community of over 2,100 research computing centers that utilize it.

Workshop

Open Panel - Digital Twins for HPC and the Role of AI

12:15pm - 12:40pm CST Sunday, 16 November 2025 274

Livestreamed

Recorded

DescriptionThe primary topics for this panel include the role of AI in Digital Twins among the panelists and the audience. An additional goal of the panel is to provide a way for workshop attendees to address their questions arising from all the workshop presentations given earlier and raise their own topics and areas of interest in a thought provoking panel style discussion.

Workshop

Open QHPC Software Ecosystem

12:00pm - 12:30pm CST Sunday, 16 November 2025 275

Livestreamed

Recorded

DescriptionThe Open QHPC Software Ecosystem (openQSE) initiative is a community-driven effort aimed at defining a common specification for the emerging Quantum–High Performance Computing (QHPC) software stack. Its goal is to enable interoperability across diverse hardware and software platforms, allowing vendors, national laboratories, and academic institutions to develop components that seamlessly integrate within a unified ecosystem. This talk will provide an overview of the initiative’s objectives and progress, highlighting the inaugural workshop held at Oak Ridge National Laboratory on July 25, 2025. The workshop convened key stakeholders from industry, academia, and national laboratories to discuss shared challenges in quantum–HPC integration and to establish the foundational direction for the openQSE effort.

Birds of a Feather

Open-Source Hardware Tools: Advancing Architecture Research and Chip Prototyping

12:15pm - 1:15pm CST Wednesday, 19 November 2025 230

Architectures & Networks

Livestreamed

Recorded

XO/EX

DescriptionAs demand for architectural innovation accelerates in the post-Moore era of HPC and scientific edge computing, the importance of accessible and scalable design methodologies has never been greater. This BoF will focus on the critical role of open-source hardware tools in advancing research and accelerating chip prototyping. Modern chip development increasingly relies on large-scale computing for simulation, formal verification, power–performance–area analysis, and physical layout. These workflows not only require substantial computational resources but also create opportunities to integrate HPC and AI into design processes in transformative ways. Open-source tools provide a unique opportunity to lower barriers to entry, enable reproducibility, and foster innovation across diverse communities.

The session invites contributions from both hardware and software communities, spanning circuit-level abstractions, design-space exploration, and scalable toolchains. Special invited speakers include Mark Ren (NVIDIA, U.S.), Lilia Zaourar (CEA, Europe), and Toru Niina (RIKEN, Japan), representing global perspectives from industry, national laboratories, and academia. Their talks will highlight international efforts that demonstrate the power of open-source methodologies to accelerate chip design and strengthen collaboration across regions.

To maximize interaction, this BoF will include impromptu pitches from attendees and an open Q&A discussion. This is not just a listening session—it is an opportunity for participants to co-create future directions, exchange ideas, and form new collaborations. We encourage all attendees to be actively involved in shaping the outcomes.

Expected results include actionable insights, new collaborations across traditionally separate communities, and concrete follow-up efforts to advance open, reproducible, and scalable hardware design.

Birds of a Feather

OpenACC Users Forum

5:15pm - 6:45pm CST Tuesday, 18 November 2025 230

Programming Frameworks

Livestreamed

Recorded

XO/EX

DescriptionThe Open Accelerated Computing Consortium (OpenACC) supports researchers and developers to advance science by nurturing parallel computing skills and offering a directive-based, high-level parallel programming model for CPUs, GPUs, and more. Additionally, OpenACC organizes 25 global hackathons annually and facilitates the acceleration of 200+ applications on platforms such as Frontier, Perlmutter, JUWELS, LUMI, Alps, and Miyabi. This community BoF serves as a forum to openly discuss the status and future of OpenACC. Opening presentation will be led by OpenACC officers, compiler implementers, and invited users, such as the NASA OVERFLOW CFD project, followed by an open audience fish bowl discussion.

Birds of a Feather

OpenHPC and the Future of Open-Source HPC Provisioning Featuring Warewulf, Confluent, and OpenCHAMI

12:15pm - 1:15pm CST Thursday, 20 November 2025 230

Practitioners in HPC

Livestreamed

Recorded

XO/EX

DescriptionThe landscape of HPC cluster provisioning is rapidly evolving, with innovative open-source solutions emerging to meet modern computational demands. This BoF showcases leading open-source provisioning platforms through lightning talks from the Warewulf, Confluent, and OpenCHAMI communities, preceded by an update on the OpenHPC project. The session will highlight recent advances in container-based provisioning, cloud-native HPC management, and security-focused deployment strategies. Community members will engage in interactive discussions about best practices, interoperability challenges, and future collaboration opportunities. This forum aims to strengthen the open-source provisioning ecosystem and foster cross-project innovation for next-generation HPC infrastructure.

Workshop

Opening Invited Talk: Photonics-Enabled Systems for the Disaggregated Era of Supercomputing and AI

9:05am - 10:00am CST Sunday, 16 November 2025 265

Livestreamed

Recorded

DescriptionAs AI and HPC workloads push the limits of performance, power, and scalability, the industry faces a fundamental architectural shift. The traditional boundaries between compute, memory, and networking are dissolving, giving rise to disaggregated systems—composed from modular chiplets and accelerators that must communicate with near-monolithic efficiency. In this landscape, photonics is emerging not just as an I/O technology, but as the foundation for next-generation system architecture.
This keynote will explore how advances in integrated photonics—including silicon photonic interposers, co-packaged optics, and optical circuit fabrics—are transforming the design of large-scale AI and supercomputing platforms. By delivering orders-of-magnitude improvements in bandwidth density, latency uniformity, and energy per bit, photonics interconnects enable the flexible composition of heterogeneous compute elements across dies, packages, and system racks.
The talk will outline the evolution from electrical domain limits to photonic-domain scalability, highlighting how photonic fabrics can unify on-package, board-level, and cluster-scale communication. It will examine the interplay between photonics, packaging, and network topology, and discuss emerging opportunities in optically reconfigurable architectures for AI model training and HPC workflows. Ultimately, it will argue that photonics is not just a bandwidth solution—but the enabler of a new class of composable, memory-centric, and energy-efficient supercomputing systems.

Workshop

9:01am - 9:10am CST Sunday, 16 November 2025 264

Partially Livestreamed

Partially Recorded

Workshop

9:01am - 9:05am CST Sunday, 16 November 2025 260

Livestreamed

Recorded

Workshop

8:31am - 8:35am CST Friday, 21 November 2025 263-264

Livestreamed

Recorded

Workshop

Opening Remarks from Workshop Chair

2:00pm - 2:03pm CST Sunday, 16 November 2025 242

Livestreamed

Recorded

Workshop

9:01am - 9:10am CST Monday, 17 November 2025 240

Livestreamed

Recorded

Workshop

OpenSHMEM MLIR: A Dialect for Compile-Time Optimization of One-Sided Communications

3:30pm - 4:00pm CST Monday, 17 November 2025 260

Livestreamed

Recorded

DescriptionCommunication increasingly limits performance in high-performance computing (HPC), yet mainstream compilers focus on computation because communication intent is lost early in compilation. OpenSHMEM offers a one-sided Partitioned Global Address Space (PGAS) model with symmetric memory and explicit synchronization, but lowering to opaque runtime calls hides these semantics from analysis.

We present an OpenSHMEM dialect for Multi-Level Intermediate Representation (MLIR) that preserves one-sided communication, symmetric memory, and team/context structure as first-class intermediate representation (IR) constructs. Retaining these semantics prior to lowering enables precise, correctness-preserving optimizations that are difficult to recover from LLVM IR. The dialect integrates with existing MLIR/LLVM passes while directly representing communication and synchronization intent.

We demonstrate four transformations: recording the number of processing elements, fusing compatible atomics, converting blocking operations to non-blocking forms when safe, and aggregating small messages. These examples show how explicit OpenSHMEM semantics enable communication-aware optimization and lay the groundwork for richer cross-layer analyses.

Birds of a Feather

OpenSHMEM: Version 1.7, Heterogeneous Devices, and Emerging Frontiers

5:15pm - 6:45pm CST Tuesday, 18 November 2025 130

Programming Frameworks

Livestreamed

Recorded

XO/EX

DescriptionOpenSHMEM is a PGAS API for single-sided asynchronous scalable communications in HPC applications. OpenSHMEM is a community-driven standard for this API across multiple architectures/implementations. This BoF brings together the OpenSHMEM community to present the latest accomplishments since the release of the 1.6 specification, and discuss future directions for the OpenSHMEM community as we develop version 1.7 and beyond. The BoF will consist of talks from end-users, implementers, and middleware and tool developers to discuss their experiences and plans for using OpenSHMEM. We will then open the floor for discussion of the specification and our mid-to-long-term goals.

Birds of a Feather

Operational Data Analytics: Mind the Gap

12:15pm - 1:15pm CST Thursday, 20 November 2025 231-232

State of the Practice

Livestreamed

Recorded

XO/EX

DescriptionOperational data analytics (ODA) provides unique opportunities to analyze, understand, and optimize operations of HPC systems. However, those opportunities are often missed because access to data is restricted, siloed, or misaligned with the people best positioned to act on it. System administrators, researchers, and HPC users could combine their skills to optimize efficiency of HPC systems if data and expertise were shared between all parties.

How can we bridge these gaps and make more operational data available to more stakeholders in an easily digestible way while still maintaining operational safety, privacy, and legal requirements? What data is relevant to whom?

Invited Talk

Opportunities in Egocentric Video Understanding

8:40am - 9:20am CST Thursday, 20 November 2025 America's Ballroom Tu-Th

Applications

Livestreamed

Recorded

DescriptionForecasting the rise of wearable devices equipped with audio-visual feeds, this talk will present opportunities for research in egocentric video understanding. The talk argues for new ways to foresee egocentric videos as partial observations of a dynamic 3D world, where objects are out of sight but not out of mind. I’ll review new data collection and annotation HD-EPIC (https://hd-epic.github.io/) that merges video understanding with 3D modeling, showcasing current failures of VLMs in understanding the perspective outside the camera’s field of view—a task trivial for humans.

All project details are at .

Research and ACM SRC Posters

Optimizing and Extending Periodogram Computations for Astronomy

8:00am - 5:00pm CST Tuesday, 18 November 2025 Second Floor Atrium

Research & ACM SRC Posters

DescriptionThis work extends nifty-ls, a high-performance Lomb–Scargle periodogram implementation, with multiple Fourier terms harmonic fitting with OpenMP-parallelized/GPU-parallelized methods.

We leverage a fast and accurate spreading kernel (the "exponential of semicircle") from the Flatiron Institute Nonuniform Fast Fourier Transform (FINUFFT) to replace the extrapolation method of Press and Rybicki when computing trigonometric sums.

To efficiently process large batches of short or variably-sized time series, we introduce a heterobatch model that uses "kernel fusion." This approach wraps the entire workflow—preprocessing, FINUFFT execution, and postprocessing—into a single, per-series C++ pipeline via nanobind. By operating entirely in C++, the model avoids Python’s Global Interpreter Lock (GIL) and integrates seamlessly with OpenMP. This tight coupling allows OpenMP's dynamic scheduling to treat each time series as an independent unit of work, effectively balancing the computational load.

Research and ACM SRC Posters

Optimizing Collectives with Large Payloads on GPU-Based Supercomputers

8:00am - 5:00pm CST Tuesday, 18 November 2025 Second Floor Atrium

Research & ACM SRC Posters

DescriptionWe evaluate the current state of collective communication on GPU-based supercomputers for large language model (LLM) training at scale. Existing libraries such as RCCL and Cray-MPICH exhibit critical limitations on systems such as Frontier—Cray-MPICH underutilizes network and compute resources, while RCCL suffers from severe scalability issues. To address these challenges, we introduce PCCL, a communication library with highly optimized implementations of all-gather and reduce-scatter operations tailored for distributed deep learning workloads. PCCL is designed to maximally utilize all available network and compute resources and to scale efficiently to thousands of GPUs. It achieves substantial performance improvements, delivering 6-33x speedups over RCCL and 28-70x over Cray-MPICH for all-gather on 2,048 GCDs of Frontier. These gains translate directly to end-to-end performance: in large-scale GPT-3-style training, PCCL provides up to 60% and 40% speedups over RCCL for 7B and 13B parameter models, respectively.

Paper

Optimizing Data Acquisitions in Multi-Robot Systems

11:37am - 12:00pm CST Wednesday, 19 November 2025 260-267

Applications

Architectures & Networks

Livestreamed

Recorded

DescriptionWe present ROSfs, a novel user-level file system designed to address critical data query inefficiencies in multi-robot systems (MRS). ROSfs introduces an innovative file organization model where robot data is structured as labeled sub-files, coupled with a time-indexed architecture that enables efficient querying of actively modified data. This design enables real-time cross-robot data acquisition and collaboration capabilities previously unattainable in MRS deployments. Our implementation integrates seamlessly with the Robot Operating System (ROS) and has been extensively evaluated using both physical UAV/UGV platforms and data servers. Experimental results demonstrate that ROSfs achieves a 7x reduction in online data query latency under wireless network conditions compared to conventional ROS storage methods, while simultaneously improving data freshness (Age of Information) by up to 271x. These advancements position ROSfs as a transformative solution for high-performance robotic data management in distributed systems.

Workshop

Optimizing Microgrid Composition for Sustainable Data Centers

3:45pm - 4:00pm CST Sunday, 16 November 2025 264

Livestreamed

Recorded

DescriptionAs computing energy demand continues to grow and electrical grid infrastructure struggles to keep pace, an increasing number of data centers are being planned with colocated microgrids that integrate on-site renewable generation and energy storage. However, while existing research has examined the tradeoffs between operational and embodied carbon emissions in the context of renewable energy certificates, there is a lack of tools to assess how the sizing and composition of microgrid components affects long-term sustainability and power reliability.

In this paper, we present a novel optimization framework that extends the computing and energy system co-simulator Vessim with detailed renewable energy generation models. Our framework simulates the interaction between computing workloads, on-site renewable production, and energy storage, capturing both operational and embodied emissions. We use a multi-horizon black-box optimization to explore efficient microgrid compositions and enable operators to make more informed decisions when planning energy systems for data centers.

Workshop

Optimizing Network Resilience Using Domain-Specific Hardware Accelerator for Dynamic Programming

12:40pm - 12:50pm CST Sunday, 16 November 2025 266

short paper

Livestreamed

Recorded

DescriptionDynamic programming accelerators (DPAs) are devices designed with an instruction set optimized for dynamic programming (DP) operations. DP is fundamental to solving complex networking problems, particularly those involving fault tolerance and routing under dynamic conditions. This paper explores the use of DPA to accelerate network resilience by implementing an optimal routing algorithm that efficiently identifies alternative paths in response to link failures. The system’s performance is evaluated by comparing DPA implementation against conventional GPU-based and CPU-based solutions. Results show that DPA provides significant performance improvements, enabling faster recovery and improved robustness in dynamic network environments.

Paper

Optimizing Quantum Circuit Mapping to Reduce Inter-Module Communications in Distributed Architectures

4:37pm - 5:00pm CST Tuesday, 18 November 2025 260-267

Post-Moore Computing

Quantum Computing

Livestreamed

Recorded

DescriptionModular quantum architectures have emerged as a promising solution for scalable quantum computing systems. Executing circuits in such distributed systems necessitates non-local operations between modules, incurring significant communication overhead. In this work, an optimized quantum circuit mapping technique called DQTetris is proposed to reduce inter-module communications. DQTetris employs a hierarchical framework that first seeks a global communication-free qubit mapping assignment under module capacity constraints. If infeasible, it searches for subcircuits with local communication-free qubit assignments via layer-wise gate pruning. Executing adjacent subcircuits with different qubit assignments incurs inter-module data teleportation. DQTetris minimizes these overheads by reducing qubit reassignment events through optimal circuit segmentation, qubit assignment selection, and adaptive gate teleportation. Experiments show that compared with existing methods, DQTetris can achieve average reductions in communication costs ranging from 28% to 75% across various benchmarks.

Research and ACM SRC Posters

Optimizing Task-Driven Offloading in LLVM

8:00am - 5:00pm CST Tuesday, 18 November 2025 Second Floor Atrium

Research & ACM SRC Posters

DescriptionWe investigate an inefficiency in the LLVM OpenMP runtime related to accelerator offloading. The current implementation manages asynchronous GPU tasks by polling async handles, which introduces CPU overhead. We propose replacing this polling model with an event-driven approach that detaches target tasks by default. In our design, each asynchronous task is associated with an event that is fulfilled once the GPU kernel completes, allowing the task to yield execution. This eliminates repeated polling and reduces scheduling overhead. We implemented this mechanism using existing features in the LLVM OpenMP runtime, relying on a host callback function provided by CUDA. Experiments on NVIDIA H100 GPUs show runtime improvements of up to 75% for independent tasks once matrix sizes exceed 128×128, with benefits appearing at even smaller sizes when task dependencies are present. For large kernels, the effect diminishes as execution time dominates.

Research and ACM SRC Posters

Optimizing the GPU All-Reduce Using Multiple Processes Per GPU

8:00am - 5:00pm CST Tuesday, 18 November 2025 Second Floor Atrium

Research & ACM SRC Posters

DescriptionLarge inter-GPU all-reduce operations, prevalent throughout deep learning, are bottlenecked by communication costs. Emerging heterogeneous architectures are comprised of complex nodes, often containing four GPUs and dozens to hundreds of CPU cores per node. Parallel applications are typically accelerated on the available GPUs, using only a single CPU core per GPU while the remaining cores sit idle. This poster presents novel optimizations to large GPU-aware all-reduce operations, extending lane-aware reductions to the GPUs, and notably using multiple CPU cores per GPU to accelerate these operations. These multi-CPU-accelerated GPU-aware lane all-reduces using an intermediate host buffer yield speedup of up to 2.45x for large MPI all-reduces across the NVIDIA A100 GPUs of NCSA's Delta supercomputer. Finally, the approach is extended to GPUDirect RDMA communication, yielding speedup of 1.17x for large all-reduces.

ACM Gordon Bell Climate Modeling Finalist

ACM Gordon Bell Finalist

Awards and Award Talks

ORBIT-2: Scaling Exascale Vision Foundation Models for Weather and Climate Downscaling

2:37pm - 3:00pm CST Tuesday, 18 November 2025 261-262-265-266

Applications

GBC

Livestreamed

Recorded

DescriptionSparse observations and coarse-resolution climate models limit regional decision-making, underscoring the need for robust downscaling. However, existing AI methods struggle with generalization across variables and geographies and are constrained by the quadratic complexity of Vision Transformer (ViT) self-attention. We introduce ORBIT-2, a scalable foundation model for global, high-resolution climate downscaling. ORBIT-2 incorporates two key innovations: (1) Residual Slim ViT (Reslim), a lightweight architecture with residual learning and Bayesian regularization for efficient, robust prediction; and (2) TILES, a tile-wise sequence-scaling algorithm that reduces self-attention complexity from quadratic to linear, enabling long-sequence processing and massive parallelism. ORBIT-2 scales to 10 billion parameters across 32,768 GPUs, achieving up to 1.8 ExaFLOPS sustained throughput and 92%–98% strong scaling efficiency. It supports downscaling to 0.9 km global resolution and processes sequences up to 4.2 billion tokens. At 7 km resolution, ORBIT-2 achieves high accuracy with $R^2$ scores in the range of 0.98–0.99 against observational data.

Tutorial

Orchestrating Complex HPC and AI/ML Workflows on Kubernetes Using Flux and AWS

1:30pm - 5:00pm CST Sunday, 16 November 2025 123

Livestreamed

Recorded

TUT

DescriptionScientific computing workflows are growing increasingly complex, combining diverse computational patterns, heterogeneous resources, and sophisticated dependencies that challenge traditional orchestration tools. Meanwhile, cloud and AI architectures are driving Kubernetes adoption for these workloads. Deploying workflow components that provide the performance and features required for HPC simulations and applications remains challenging in this environment. This tutorial demonstrates a portability layer to solve this problem—integration of the Flux Framework with Kubernetes to efficiently manage complex scientific workflows on Amazon Web Services (AWS). Participants will learn how Flux’s hierarchical resource management and graph-based scheduling capabilities extend Kubernetes to support diverse workflows. The tutorial progresses from foundational infrastructure concepts to advanced Flux capabilities, culminating in deploying MuMMI (Multiscale Machine-learned Modeling Infrastructure)—a scientific workflow exemplifying emerging complexity through combined large-scale simulations and machine learning. Through lectures and hands-on labs using Amazon EKS, attendees will experience how this architecture supports demanding workflows while maintaining portability across on-premises, cloud, and hybrid environments. Using practical examples, participants will gain applicable skills for orchestrating complex workflows in various computing environments. In the end, attendees will learn how to build efficient, scalable, and flexible environments for complex scientific workflows using Kubernetes, Flux, and cloud infrastructure.

Workshop

Orchestrating Quantum-HPC Workflows with Distributed Quantum Circuit Cutting

10:30am - 10:45am CST Friday, 21 November 2025 275

Livestreamed

Recorded

DescriptionMost quantum computers today are constrained by hardware limitations, particularly the number of available qubits, causing significant challenges for executing large-scale quantum algorithms. Circuit cutting has emerged as a key technique to overcome these limitations by decomposing large quantum circuits into smaller subcircuits that can be executed independently and later reconstructed. Qdislib is a distributed and flexible library for quantum circuit cutting, designed to seamlessly integrate with hybrid quantum-classical HPC systems. Qdislib employs a graph-based representation of quantum circuits to enable efficient partitioning, manipulation and execution, supporting both wire and gate cutting techniques. The library is compatible with multiple quantum programming languages, including Qiskit and Qibo, and leverages distributed computing to execute subcircuits across CPUs, GPUs, and quantum processing units in a fully parallelized manner. The paper describes Qdislib and demonstrates how it enables the distributed execution of quantum circuits across heterogeneous resources, showcasing its potential for scalable quantum-classical workflows.

Best Poster Presentations (Research, ACM SRC Grad/Undergrad)

Orchid: Towards Heterogeneous Batched Eigenvalue Solvers

1:30pm - 1:45pm CST Wednesday, 19 November 2025 230

Research & ACM SRC Posters

DescriptionThere is a growing need for the efficient solution of many small eigenvalue problems (up to N = 1500) that arise in emerging scientific applications. These small-to-medium sized problems present unique computational challenges, particularly when thousands or millions of such problems must be solved repeatedly. This work presents Orchid, a novel distributed, heterogeneous, batched eigenvalue solver based on the IRIS runtime. Orchid can utilize all compute platforms in both heterogeneous nodes and clusters by harnessing the capabilities of the IRIS architecture. Orchid leverages heterogeneous architectures across multiple nodes by partitioning the application task DAG intelligently and orchestrates multiple instances of the IRIS runtime via MPI. We evaluate our proposal against two heterogeneous hardware configurations and Frontier, demonstrating Orchid’s performance utilizing both intra-node and inter-node heterogeneity.

Workshop

Overcoming Dynamic I/O Boundaries: a Double-Sided Streaming Methodology with dispel4py and CAPIO

3:30pm - 3:48pm CST Monday, 17 November 2025 264

Livestreamed

Recorded

DescriptionThis work introduces a novel double-sided streaming methodology that combines control-plane and data-plane streaming. Our
goal is to implement the long-advocated separation of concerns in workflow orchestration without introducing artificial boundaries in their execution. Our approach is exemplified by the integration of control-plane streaming provided by dispel4py and the transparent data-plane streaming provided by CAPIO. Our integration eliminates file synchronization barriers without requiring modifications to existing workflow logic. To support this, we extend CAPIO with a new commit rule that allows streaming over dynamically generated file sets, enabling hybrid workflows that blend in-memory dataflows with file-based communication. We validate our approach using a real-world seismic cross-correlation workflow, achieving performance improvements between 23% and 40%. Unlike previous solutions, our method supports streaming across the entire workflow, including phase boundaries where file I/O would typically enforce strict execution ordering. Therefore, our approach can be straightforwardly extended to other multi-stage streaming applications.

Workshop

Overhead Quantification of the Lightweight Distributed Metric Service for High-Performance Computers

2:35pm - 2:40pm CST Monday, 17 November 2025 276

Partially Livestreamed

Partially Recorded

DescriptionThe Lightweight Distributed Metric Service (LDMS) is a monitoring framework that collects high-fidelity, high-volume node-level data on large distributed computer systems. LDMS is built to introduce negligible overhead in application workloads which has been verified in several scale tests since its inception in 2014. However, new communication strategies, sensor samplers, and fundamental data structures within the core LDMS code have been introduced that could increase the overhead. In this study, we quantify the current overhead that LDMS introduces and verify that it is insignificant. This was done through a variety of benchmarks and applications where we captured timing and performance statistics while LDMS ran with different configurations.

Workshop

Panel Discussion

4:45pm - 5:25pm CST Monday, 17 November 2025 266

Livestreamed

Recorded

Workshop

Panel Discussion

11:20am - 11:59am CST Friday, 21 November 2025 276

Livestreamed

Recorded

Workshop

Panel Discussion - Challenges in Q-HPC Software Frameworks & Workload Management

11:35am - 12:00pm CST Friday, 21 November 2025 275

Livestreamed

Recorded

Workshop

Panel Discussion: Managing a Python Environment for Everyone

4:45pm - 5:30pm CST Monday, 17 November 2025 242

Livestreamed

Recorded

Workshop

Panel Discussion: Navigating the Future of Scientific Workflows

2:15pm - 3:00pm CST Monday, 17 November 2025 264

Livestreamed

Recorded

Workshop

Panel Discussion: The Role of Alternatives to MPI+X Technologies in AI/ML

4:30pm - 5:30pm CST Sunday, 16 November 2025 230

Livestreamed

Recorded

Workshop

Panel Session #1: AI, Cancer, the Future

9:05am - 9:45am CST Monday, 17 November 2025 241

Livestreamed

Recorded

Workshop

Panel: Energy-efficient Memory Technology for maximizing bandwidth and reducing latency

2:05pm - 3:00pm CST Sunday, 16 November 2025 274

Livestreamed

Recorded

Workshop

Panel: NIST 800-234 : High-Performance Computing (HPC) Security Overlay

4:55pm - 5:25pm CST Sunday, 16 November 2025 242

Livestreamed

Recorded

DescriptionThis NIST Special Publication introduces an HPC security overlay built upon the moderate baseline defined in SP 800-53B. The overlay tailors 60 security controls with supplemental guidance and/or discussions to enhance their applicability in HPC contexts. This overlay aims to provide practical, performance-conscious security guidance that can be readily adopted.

Workshop

Panel: Storage Architectures and I/O Optimizations for AI Applications

4:40pm - 5:30pm CST Monday, 17 November 2025 230

Data Analytics

High Performance I/O, Storage, Archive, & File Systems

Storage

Livestreamed

Recorded

DescriptionDeep learning workloads are driving the HPC landscape, from foundation models and surrogate models to emerging agentic workflows. The I/O characteristics of these workloads challenge the assumptions underlying traditional HPC storage systems and libraries: whereas classical modeling/simulation workflows favor large, sequential writes and predictable access patterns, AI workflows demand random access I/O for dataset shuffling and burst bandwidth for model checkpoints, frequently simultaneously. The coupling of simulations with AI models further complicates storage requirements. In this panel, we will examine the challenges of managing data movement and storage for AI workloads, the requirements of I/O systems and how existing ones must evolve, and the open challenges for the field.

Workshop

Panel: The future of MPI in a world driven by AI

4:00pm - 5:30pm CST Sunday, 16 November 2025 260

Livestreamed

Recorded

Workshop

Paper Session 1 Q&A

9:50am - 10:00am CST Sunday, 16 November 2025 261

Livestreamed

Recorded

Workshop

Paper Session 2 Q&A

11:15am - 11:30am CST Sunday, 16 November 2025 261

Livestreamed

Recorded

Workshop

Paper Session 3 Q&A

4:15pm - 4:30pm CST Sunday, 16 November 2025 261

Livestreamed

Recorded

Workshop

PAPI Support for Specialized AI Architectures

4:10pm - 4:15pm CST Monday, 17 November 2025 230

Data Analytics

High Performance I/O, Storage, Archive, & File Systems

Storage

Livestreamed

Recorded

DescriptionThis work focuses on developing unique performance monitoring capabilities in PAPI to extend support for specialized AI chips to help researchers identify hardware-specific bottlenecks and improve AI architectures. However, the lack of traditional hardware counters on these specialized devices poses a unique challenge and makes it necessary to develop alternative approaches for performance monitoring. PAPI’s Software Defined Events (SDEs) is one such promising approach since it provides a portable way to capture and expose software-level metrics from within applications. Our proof-of-concept implementation on traditional processors uses the vendor-agnostic HPL-MxP benchmark instrumented with PAPI SDE to register custom events, such as sde_io_read_bytes and sde_float16, to track memory and network I/O, and the floating-point precision usage throughout the workload. Results showed close agreement between SDE and hardware counts, with a mean ± standard deviation difference of 0.310% ± 5.315%, providing confidence in applying the SDE approach to systems without accessible hardware counters.

Tutorial

Parallel Computing 101

8:30am - 5:00pm CST Sunday, 16 November 2025 130

Livestreamed

Recorded

TUT

DescriptionThis tutorial provides a comprehensive overview of parallel computing, emphasizing those aspects most relevant to the user. It is suitable for new users, students, managers, and anyone seeking an overview of parallel computing. It discusses software and hardware/software interaction, with an emphasis on standards, portability, and systems that are widely available. The tutorial surveys basic parallel computing concepts, using examples from multiple engineering, scientific, and machine learning problems. These examples illustrate using MPI on distributed memory systems; OpenMP on shared memory systems; MPI+OpenMP on hybrid systems; and CUDA and compiler directives on GPUs and accelerators. The tutorial discusses numerous parallelization and load balancing approaches; performance improvement tools; and an overview of recent developments such as machine learning based on accelerators and parallel versions of Python. The tutorial helps attendees make intelligent decisions by covering the primary options that are available, explaining how the different components work together and what they are most suitable for. Extensive pointers to web-based resources are provided to facilitate follow-up studies.

Workshop

Parallel Data Object Creation: Scalable Metadata Management in Parallel I/O Library

11:30am - 12:00pm CST Monday, 17 November 2025 230

Data Analytics

High Performance I/O, Storage, Archive, & File Systems

Storage

Livestreamed

Recorded

DescriptionHigh-level I/O libraries, such as PnetCDF and HDF5, are commonly used by large-scale scientific applications to perform I/O tasks in parallel. These I/O libraries store the metadata of data objects in files along with their raw data. To ensure metadata consistency during parallel data object creation, they require applications to call the metadata APIs collectively using consistent metadata. Such a requirement can result in an expensive consistency check, as its cost increases with the metadata volume and the number of processes. To address this limitation, we propose a new file header format, which uses partitioned metadata blocks to enable independent data object creation and reduce the objects required for consistency check. Our performance evaluation shows that this new design achieves a scalable performance, cutting data object creation times by up to 196x when running on 4096 MPI processes to create 5,684,800 data objects in parallel.

Research and ACM SRC Posters

Parallel Local Motif Counting on Large-Scale Dynamic Graphs

8:00am - 5:00pm CST Tuesday, 18 November 2025 Second Floor Atrium

Research & ACM SRC Posters

DescriptionGraph motifs—small subgraphs such as triangles and cliques—are key tools for comparing and aligning networks in domains ranging from biology to social sciences. While recent advances enable motif counting in billion-edge networks, existing methods focus mainly on global frequencies. Building on ParaDyMS, we introduce a method to compute local edge-level motif frequencies, capturing the motifs incident to each edge. Experiments on real-world networks show that our approach achieves competitive performance against state-of-the-art static algorithms and demonstrate its scalability on shared memory systems and GPUs.

Paper

Parallel Rank-Adaptive Higher-Order Orthogonal Iteration

1:30pm - 1:52pm CST Thursday, 20 November 2025 260-267

Algorithms

Livestreamed

Recorded

DescriptionHigher-order orthogonal iteration (HOOI) is an iterative algorithm that computes a Tucker decomposition of fixed ranks of an input tensor. In this work we modify HOOI to determine ranks adaptively subject to a fixed approximation error, apply optimizations to reduce the cost of each HOOI iteration, and parallelize the method in order to scale to large, dense datasets. We show that HOOI is competitive with the sequentially truncated higher-order singular value decomposition (ST-HOSVD) algorithm, particularly in cases of high compression ratios. Our proposed rank-adaptive HOOI can achieve comparable approximation error to ST-HOSVD in less time, sometimes achieving a better compression ratio. We demonstrate that our parallelization scales well over thousands of cores and show, using three scientific simulation datasets, that HOOI outperforms ST-HOSVD in high-compression regimes. For example, for a 3D fluid-flow simulation dataset, HOOI computes a Tucker decomposition 82x faster and achieves a compression ratio 50% better than ST-HOSVD's.

Research and ACM SRC Posters

ParaViz3D: MPI Trace Visualization with 3D Video

8:00am - 5:00pm CST Tuesday, 18 November 2025 Second Floor Atrium

Research & ACM SRC Posters

DescriptionConventional MPI trace visualizations become unwieldy as the number of processes increases. Identifying global patterns is challenging, and both the underlying machine topology and parallel program structure are often obscured by the rigid two-dimensional rank-time graphs. To address these limitations, we propose a three-dimensional video visualization of MPI programs that adapts its spatial layout to both the machine topology and the parallel decomposition of the numerical problem, while mapping process time evolution to actual video playback time.

Using Blender, we develop a framework that automatically translates MPI trace data into 3D visual scenes. We explore various display choices, including color-time gradients and transparency, to enhance interpretability. Our approach provides a new, intuitive exploration of both large-scale and local behaviors and patterns. We showcase the utility of this method with idle wave and process desynchronization phenomena.

Workshop

PathLlama: A Language Model for Automated Cancer Surveillance

11:10am - 11:30am CST Monday, 17 November 2025 241

Livestreamed

Recorded

DescriptionTransforming unstructured information into structured common
data models (CDM) is a critical step for enabling cancer
surveillance and advancing precision medicine. CDMs standardize
the structure and content of oncologic data extracted
from electronic health records. Unfortunately, traditional Extract
Transform Load processes for electronic health data capture
are generally rule-based, error-prone, and produce static
datasets unsuitable for near real-time information retrieval.

The Modeling Outcomes using Surveillance Data and Scalable
AI for Cancer (MOSSAIC) project developed and deployed
a hierarchical self-attention (HiSAN) model capable
of autocoding approximately 30% of National Cancer Institute
Surveillance, Epidemiology, and End Results (SEER) registry
cancer pathology reports [1], [2]. While a significant step
forward, this falls short of the broader goal of automatically
coding all pathology reports. Fully automating CDM conversion
would facilitate clinical trial matching, decision support
dashboards, real-time case ascertainment, and population
health surveillance.

The distribution of cancer phenotypes in real-world data is
highly imbalanced. While HiSAN performs well on classes
well-represented during training, its accuracy and confidence
degrade substantially for less common categories. Large language
models (LLMs) offer a promising solution for underrepresented
oncological entities, owing to their ability to
leverage context and pretraining. Rather than relying solely
on general-purpose models, domain adaptation or continual
pretraining of LLMs may further improve performance by
helping models learn the specialized vocabulary, abbreviations,
and context typical of clinical text. In this study, we finetune
LLMs for SEER pathology report classification, with and
without additional domain-adaptive pretraining, and compare
the results to the HiSAN baseline [2].

Based on Llama 3 8B, PathLlama was developed by finetuning
for cancer pathology report classification, with and without
domain adaptation. The domain adaptation task was next token
prediction and the pretraining dataset was composed of a
large corpus of approximately 10M cancer pathology reports
and abstracts from SEER and about 500k clinical notes and
radiology reports from MIMIC [3]. The PathLlama models
were finetuned to classify site (70 categories), subsite (330),
laterality (7), histology (677), and behavior (4). The finetuning
dataset was 4052951 reports from six SEER registries:
Kentucky, Louisiana, New Jersey, New Mexico, Seattle/Puget
Sound, and Utah. The finetuning dataset was randomly split
into 80%/10%/10% for training, test, and validation, ensuring
all reports associated with a single case belong to the same
split.

Finetuning results are shown in Table I. We observe that
the micro F1 scores, dominated by majority classes due the
imbalance in the dataset, improve only slightly from the
HiSAN to either of the PathLlama models. The most notable
improvements in micro F1 come from the domain-adapted
PathLlama for subsite and laterality. In contrast, more significant
improvements occur for macro F1, particularly for subsite,
laterality, and histology. For these three tasks, the domain-adapted
PathLlama model also substantially outperforms the
PathLlama base model. From these macro F1 results, we find
that the contextual and pretraining advantages of Llama itself
are indeed sufficient to markedly improve classification performance
on underrepresented classes. However, domain adaptation
offers additional benefit, further enhancing performance
that justifies the increased computational cost associated with
extended pretraining.

Workshop

PathPCNet: Pathway Principal Component-Based Interpretable Framework for Drug Sensitivity Prediction

11:55am - 12:10pm CST Monday, 17 November 2025 241

Livestreamed

Recorded

DescriptionBackground: Precision medicine aims to identify significant biomarkers and effective drugs based on individual genomic profiles, enabling personalized treatment strategies. Drug efficacy is commonly assessed via drug response, typically measured by the concentration required to inhibit a biological activity (e.g., IC50). In contrast, drug sensitivity reflects the strength of a tumor's response to a drug, where a lower effective dose indicates higher sensitivity. With the increased availability of large-scale multi-omics datasets, machine learning (ML) and deep learning approaches have emerged as powerful tools for studying drug response--holding great promise for accelerating biomarker discovery and enabling the development of more effective therapeutics.

Methods: We present `PathPCNet`, a novel interpretable deep learning framework that integrates multi-omics data (copy number variation, mutation, and RNA sequencing) with biological pathways, drug molecular structures, and Principal Component Analysis (PCA) to predict drug response. We project high-dimensional, noisy gene-level features to pathway-level principal components, and evaluate six machine learning models using the first one to five principal components. Our models are trained to predict the IC50 values for 182 drugs across 409 cell lines representing 29 cancer types from the GDSC (Genomics of Drug Sensitivity in Cancer) dataset. Finally, we fine-tune the deep learning model and apply SHAP to interpret feature contributions. SHAP scores are back-projected from the principal components to original genes using PCA loadings, enabling identification of the most significant genes.

Results: Our model achieves a Pearson correlation coefficient of 0.941 and an R-squared value of 0.885, outperforming existing pathway-based approaches for drug response prediction. Using SHAP-based model interpretation, we quantify the contributions of different omics and drug features, and identify critical pathways and gene-drug interactions involved in resistance mechanisms. These results highlight the potential of integrative deep learning models not only for accurate prediction, but also for uncovering biologically meaningful insights that can inform drug discovery and precision oncology. Furthermore, our framework enables the identification of key pathways, genes, and atomic-level drug attributes associated with drug sensitivity across diverse cancer types.

Discussion: Our intuitive feature extraction approach, based on pathway-level principal components, effectively reduces dimensionality while preserving data variance and enhancing biological interpretability. Tumor response is a complex biological phenomenon that extends beyond single gene–drug interactions. Therefore, integrating multi-omics profiles and molecular drug features within the context of biological pathways is essential for understanding drug response. This integrative approach has strong potential to support targeted therapy design, biomarker discovery, and the advancement of precision medicine and drug development.

Workshop

PDSW'25: The 10th International Parallel Data Systems Workshop

9:00am - 9:05am CST Monday, 17 November 2025 230

Data Analytics

High Performance I/O, Storage, Archive, & File Systems

Storage

Livestreamed

Recorded

DescriptionEfficient storage, movement, and management of data are crucial to application performance and scientific productivity in both traditional simulation-oriented HPC environments and cloud AI/ML/big data analysis environments. This issue is further exacerbated by the growing volume of experimental and observational data, the widening gap in performance between computational hardware and storage hardware, and the emergence of new data-driven algorithms in machine learning. The goal of this workshop is to facilitate in-depth discussions of research and development that address the most critical challenges in large-scale data storage and data processing.

PDSW will continue to build on the successful tradition established by its predecessor workshops: the Petascale Data Storage Workshop (PDSW, 2006-2015) and the Data Intensive Scalable Computing Systems workshop (DISCS, 2012-2015). These workshops were successfully combined in 2016, and the resulting joint workshop has attracted up to 45 full paper submissions and 195 attendees per year from 2016 to 2024.

Workshop

Peachy Parallel Assignments (EduHPC 2025)

11:30am - 11:35am CST Sunday, 16 November 2025 261

Livestreamed

Recorded

DescriptionPeachy Parallel Assignments are high-quality assignments that require students to practice concepts in parallel and distributed computing. They are selected competitively and published in the Edu* workshops to provide instructors with inspiration and easy-to-adopt assignments. The assignments must have been successfully tested with real students, easy to adopt by other instructors in a variety of contexts, and ``cool and inspirational'' for students completing them.

This article presents three Peachy Parallel Assignments selected for presentation at EduHPC 2025.
The first is simulation of the growth of ``fairy rings'', a biologically-motivated variation of the Game of Life. The second assignment asks students to simulate flooding over uneven terrain and in the presence of active rainfall. The third assignment has them implement the softmax function in parallel, motivated by applications in deep learning.

Workshop

Peachy Q&A

11:50am - 12:00pm CST Sunday, 16 November 2025 261

Livestreamed

Recorded

Workshop

PEAK: Cost-Adaptive Profiling in a Heartbeat

5:00pm - 5:30pm CST Monday, 17 November 2025 241

Livestreamed

Recorded

DescriptionInstrumentation-based profiling is essential for uncovering fine-grained optimization opportunities in High-Performance Computing (HPC) and cloud applications, yet static instrumentation methods often impose fixed profiling overheads that cannot adapt to the dynamic workloads from applications at runtime.
We further develop PEAK, a Dynamic Binary Instrumentation (DBI)-based profiler with two complementary modes for overhead control: a static mode, which enforces an upper limit on the absolute instrumentation overhead, and a dynamic mode based on the heartbeat mechanism, which controls the relative overhead in real time to maintain a user-defined ratio.
Evaluations of workloads ranging from compute-intensive kernels to lightweight functions show that the heartbeat mechanism effectively bounds overhead while improving profile accuracy compared to static methods, delivering predictable and adaptive profiling performance for long-running, dynamic workloads.

Workshop

PerfAnalyzer: Revealing Performance Trends using Version Oriented Visual Analysis of Scientific Software

12:00pm - 12:15pm CST Monday, 17 November 2025 267

Livestreamed

Recorded

DescriptionUnderstanding the behavior of scientific software is essential in maintaining the integrity and transparency of computational research. Tracking the changes in computational parameters (input-output parameters, configuration parameters for hardware and software) across different versions of software executions helps in understanding by providing a context to correlate the outcomes with meaningful changes and to interpret the results reliably. Diverse computing environments and software platforms add complexity in tracking the evolution of computational parameters across multiple runs. We present PerfAnalyzer, an interactive dashboard to simplify the collection, management, and visual analysis of computational parameters across Git commits. We demonstrate the usefulness of the dashboard in identifying performance issues through a case study on collecting and analyzing computational parameters of the CloverLeaf mini-application. The results of the case study show PerfAnalyzer's ability to highlight performance changes across versions and identify parameters related to the changes that are difficult to locate using isolated measurements.

Paper

PerfDojo: Automated ML Library Generation for Heterogeneous Architectures

10:30am - 10:52am CST Tuesday, 18 November 2025 275

HPC for Machine Learning

Programming Frameworks

Livestreamed

Recorded

DescriptionThe increasing complexity of machine learning models and the proliferation of diverse hardware architectures (CPUs, GPUs, accelerators) make achieving optimal performance a significant challenge. Heterogeneity in instruction sets, specialized kernel requirements for different data types and model features (e.g., sparsity, quantization), and architecture-specific optimizations complicate performance tuning. Manual optimization is resource-intensive, while existing automatic approaches often rely on complex hardware-specific heuristics and uninterpretable intermediate representations, hindering performance portability. We introduce PerfLLM, a novel automatic optimization methodology leveraging large language models (LLMs) and reinforcement learning (RL). Central to this is PerfDojo, an environment framing optimization as an RL game using a human-readable, mathematically-inspired code representation that guarantees semantic validity through transformations. This allows effective optimization without prior hardware knowledge, facilitating both human analysis and RL agent training. We demonstrate PerfLLM's ability to achieve significant performance gains across diverse CPU (x86, Arm, RISC-V) and GPU architectures.

Workshop

Performance Analysis of Compute Express Link (CXL) Memory Expansion with Data Interleaving

3:54pm - 4:18pm CST Monday, 17 November 2025 240

Livestreamed

Recorded

DescriptionThe performance gap between processors and memory has become a significant bottleneck nowadays, commonly referred to as the Memory Wall. Compute Express Link emerges as a promising solution to address these challenges by providing benefits to expand the memory space and bandwidth. In this work, we focus on the performance measurement and analysis of the memory interleaving strategy on CXL memory. Our experiments, conducted on both simulated and genuine CXL-enabled system, show that the naive interleaving configurations cannot always deliver the best memory bandwidth. In fact, it could be 26.97% less than the optimal configuration in the worst case. Besides, we observed distinct characteristics between emulated and genuine CXL system, presenting the limitation of evaluating performance by simulation for memory interleaving. Our work reveals the importance of interleaving configurations and provide the performance comparison with analyses for identifying the influencing factors and developing guideline of the CXL memory placement policy.

Workshop

Performance Characterization of CXL Memory and Its Use Cases

5:06pm - 5:30pm CST Monday, 17 November 2025 240

Livestreamed

Recorded

DescriptionCompute eXpress Link (CXL) is emerging as a promising memory interface technology. However, its performance characteristics remain largely unclear due to the limited availability of production hardware. In this work, we study how HPC applications and large language models (LLM) can benefit from the CXL memory, and study the interplay between memory tiering and page interleaving. We also propose a novel data object-level interleaving policy to match the interleaving policy with memory access patterns. Our findings reveal the challenges and opportunities of using CXL.

Tutorial

Performance Engineering for Sparse Linear Solvers

8:30am - 12:00pm CST Sunday, 16 November 2025 124

Livestreamed

Recorded

TUT

DescriptionThis tutorial covers code analysis, performance modeling, and optimization for sparse linear solvers on CPUs and GPUs. Performance engineering is often taught using simple loops as examples for performance models and how they can guide optimization; however, full, preconditioned linear solvers comprise multiple loops and an iteration scheme that is executed to convergence. Consequently, the concept of "optimal performance" must account for both hardware efficiency and solver convergence. After introducing basic notions of hardware organization and storage for dense and sparse data structures, we show how to apply the roofline model to such solvers in predictive and diagnostic ways and how it can be used to assess the hardware efficiency of a solver, covering important corner cases such as memory boundedness. Then we advance to preconditioned solvers, using the conjugate gradient method (CG) algorithm as a leading example. Bottlenecks of the solver are identified, followed by the introduction of optimization techniques like the use of preconditioners and cache blocking. The interplay among solver performance, convergence, and time to solution is given special attention. In hands-on exercises, attendees will be able to carry out experiments on a GPU cluster and study the influence of matrix data formats, preconditioners, and cache optimizations.

Research and ACM SRC Posters

Performance Engineering of Scientific Applications with MVAPICH and TAU Using Emerging Communication Primitives

8:00am - 5:00pm CST Tuesday, 18 November 2025 Second Floor Atrium

Research & ACM SRC Posters

DescriptionWe propose a co-design approach that integrates two powerful tools—MVAPICH and TAU—to demonstrate the new possibilities for performance-guided control and optimization for two large-scale applications—AWP-ODC and heFFTe. AWP-ODC is a highly scalable parallel finite-difference application with point-to-point operations that enables 3D earthquake calculations, while heFFTe is a massively parallel application that provides scalable and efficient implementations of the widely used Fast Fourier Transform using several MPI primitives. Through a deep integration between MVAPICH and TAU, the two applications can identify their performance bottlenecks on various supercomputers with different architectures. AWP-ODC and heFFTe can also act as representative real-world benchmarks to MVAPICH and TAU. We show how the co-design approach enables AWP-ODC and heFFTe to deliver better performance on cutting-edge HPC architectures. This is achieved using 1) more optimized and fine-tuned collective operations, and 2) reduced network traffic through real-time data compression.

Workshop

Performance portable batched linear algebra kernels for transport sweeps using Kokkos

11:30am - 12:00pm CST Monday, 17 November 2025 231

Performance Evaluation, Scalability, & Portability

Livestreamed

Recorded

DescriptionThis paper describes the development of performance portable
batched linear algebra kernels for SN-DG neutron transport
sweeps using Kokkos. We establish a new sweep algorithm
for GPUs that relies on batched linear algebra kernels. We
implement an optimized batched gesv solver for small linear
systems that builds upon state-of-the-art algorithms. Our
implementation achieves high performance by minimizing
global memory traffic and maximizing the amount of compu-
tations done at compile-time. We assess the performance of
the batched gesv kernel on NVIDIA and AMD GPUs. We
show that our custom implementation outperforms state-of-
the-art linear algebra libraries on these architectures. The
performance of the new GPU sweep implementation is as-
sessed on the H100 and MI300A GPUs. We demonstrate that
our GPU implementation is able to achieve high performance
on both architectures, and is competitive with an optimized
multithreaded CPU implementation on a 128-core CPU.

Tutorial

Performance Tuning of HPC and ML/AI Applications with the Roofline Model on GPUs, APUs, and CPUs

1:30pm - 5:00pm CST Sunday, 16 November 2025 126

Livestreamed

Recorded

TUT

DescriptionThe Roofline performance model offers an insightful and intuitive method for extracting the key execution characteristics of HPC and ML/AI applications and comparing them against the performance bounds of modern CPUs and GPUs. Its ability to abstract the complexity of memory hierarchies and identify the most profitable optimization techniques has made Roofline-based analysis increasingly popular in the HPC and ML/AI communities. Although different flavors of the Roofline model have been developed to deal with various definitions of memory data movement, there remains a need for a systematic methodology when applying them to analyze applications running on multicore and accelerated systems. This tutorial aims to bridge this gap on both CPUs and GPUs by both exposing the fundamental aspects behind different Roofline modeling principles as well as providing several practical use case scenarios that highlight their efficacy for application optimization. This tutorial presents a unique combination of instruction in Roofline by its creator; hands-on instruction in using Roofline within Intel’s, NVIDIA’s, and AMD’s production performance tools; and discussions of real-world Roofline use cases at ALCF, NERSC, and OLCF computing centers. The tutorial presenters have a long history of collaborating on the Roofline model and have presented several Roofline-based tutorials.

Workshop

Performance-Portable Symbolic Factorization through Common Graph Operations

11:45am - 12:10pm CST Sunday, 16 November 2025 232

Livestreamed

Recorded

DescriptionIn lower-upper (LU) factorization in the form A=LU, symbolic factorization is a pre-processing stage performed to discover the sparsity structure of the factors. A is usually not equal to L+U due to fill-ins (the nonzeros that do not appear in A but element of L or U) introduced in factorization. Symbolic factorization can be performed with A's pattern by utilizing the corresponding graph. In this work, we assess the viability of utilizing GraphBLAS for symbolic factorization. GraphBLAS defines a standard way to express operations on graphs in the language of linear algebra. We express edge-based and path-based symbolic factorization using graph operations and investigate utilization of masks and elimination trees. Our goal is to obtain a performant symbolic factorization, which can be used in a portable manner on any hardware GraphBLAS standard is realized. We demonstrate our approach with various sparse matrices on multi-core and many-core architectures.

Paper

PGT-I: Scaling Spatiotemporal GNNs with Memory-Efficient Distributed Training

10:52am - 11:15am CST Tuesday, 18 November 2025 260-267

HPC for Machine Learning

System Software and Cloud Computing

Partially Livestreamed

Partially Recorded

DescriptionSpatiotemporal graph neural networks (ST-GNNs) are powerful tools for modeling spatial and temporal data dependencies. However, their applications have been limited primarily to small-scale datasets because of memory constraints. While distributed training offers a solution, current frameworks lack support for spatiotemporal models and overlook the properties of spatiotemporal data. Informed by a scaling study on a large-scale workload, we present PyTorch Geometric Temporal Index (PGT-I), an extension to PyTorch Geometric Temporal that integrates distributed data parallel training and two novel strategies: index-batching and distributed-index-batching. Our index techniques exploit spatiotemporal structure to construct snapshots dynamically at runtime, significantly reducing memory overhead, while distributed-index-batching extends this approach by enabling scalable processing across multiple GPUs. Our techniques enable the first-ever training of an ST-GNN on the entire PeMS dataset without graph partitioning, reducing peak memory usage by up to 89% and achieving up to a 11.78x speedup over standard DDP with 128 GPUs.

Invited Talk

Phantoms and Traces: A Field Guide to Non-Referential Data

11:15am - 12:00pm CST Wednesday, 19 November 2025 America's Ballroom Tu-Th

Art of HPC

Creativity

Livestreamed

Recorded

DescriptionData are usually understood as descriptions of the world, a reference to something out there. Yet, today it is increasingly clear that this representational perspective falls short in addressing many of the critical issues, practices, and debates surrounding data. This talk explores three non-representational perspectives on data, discussing their surprising and paradoxical consequences: autographic, data as material traces that inscribe environmental changes without symbolic mediation; synthetic, training data for AI models fabricated to resemble observations yet detached from the world they mimic; and toxic, infotrash and AI slop, the proliferating by-products of AI that sustain digital capitalism despite their apparent uselessness. By focusing on the agency of data, I argue that focusing on representation alone obscures the relational, material, and economic dimensions through which data now operate.

Paper

Phoenix: A Refactored I/O Stack for GPU Direct Storage Without Phony Buffers

1:52pm - 2:15pm CST Wednesday, 19 November 2025 275

Data Analytics, Visualization & Storage

Livestreamed

Recorded

DescriptionGPU Direct Storage (GDS) plays a vital role in GPU storage systems, utilizing P2P-DMA technology to establish a direct data transfer path between the GPU and storage devices. This direct path reduces storage access latency and CPU overhead, thus improving data transfer efficiency. Currently, however, GDS employs a phony buffer in host memory to interact with the Linux kernel, resulting in suboptimal performance, additional resource consumption, and deployment complexity.

In this paper, we propose Phoenix, a refactored GDS software stack without phony buffers. Phoenix employs the memory mapping service of ZONE_DEVICE to map GPU memory into the page table at system startup. The kernel module of Phoenix stores the returned address information, allocates user-space virtual memory, and establishes a mapping with the designated GPU memory. Extensive evaluation shows that, compared to the existing GDS software stack, Phoenix reduces software overhead along the critical I/O path and improves end-to-end performance.

Workshop

Physical System Study on Balancing Interactive and Batch Job Performance through Oversubscribing Scheduling

9:35am - 10:00am CST Friday, 21 November 2025 260-267

Livestreamed

Recorded

DescriptionThis paper evaluates oversubscribing in High-Performance Computing (HPC) systems as a solution to balance interactive and batch job performance. Using real workload traces and physical hardware experiments, we demonstrate that oversubscribing can reduce queue waiting times while maintaining overall system performance. Our results show this approach (1) decreases waiting times for interactive jobs, (2) has minimal impact on overall system throughput, and (3) effectively manages individual job turnaround times. Unlike traditional multiple queue approaches, oversubscribing provides these benefits with simpler configuration requirements. Additionally, through quantitative memory usage analysis, we provide insights into oversubscribing applicability for production capacity planning. Our research contributes empirical evidence of its effectiveness in real HPC environments, supported by comprehensive experimental data and practical implementation insights.

Research and ACM SRC Posters

PhySiViT: A Physics Simulation Vision Transformer

8:00am - 5:00pm CST Tuesday, 18 November 2025 Second Floor Atrium

Research & ACM SRC Posters

DescriptionModern scientific computing generates massive simulation data across physics domains, yet researchers lack general-purpose tools for efficient analysis. While vision transformers like CLIP and DINO have revolutionized natural image analysis, no equivalent exists for physics simulation data. This project trains a custom vision transformer on “the Well” dataset, a 15 TB collection of diverse physics simulations. Using only 7 million images (compared to >100 million for CLIP/DINOv2), we trained our physics foundation model in 22 hours on a single Cerebras CS-3 server. Despite reduced training scale, our model demonstrates competitive classification performance while exceeding at physics-specific tasks: temporal forecasting (𝑅2 = 0.33 vs. DINOv2’s 0.23) and physics clustering (silhouette score = 0.232 vs. DINOv2’s 0.195). This work demonstrates that efficient, domain-focused foundation models can achieve better performance in specialized scientific domains.

Invited Talk

Pixels, Petabytes, and Predictions: The Supercomputing Demands of Modern Geospatial AI and How St. Louis Is the Center of Gravity

11:15am - 12:00pm CST Thursday, 20 November 2025 America's Ballroom Tu-Th

AI, Machine Learning, & Deep Learning

Artificial Intelligence & Machine Learning

Big Data

Livestreamed

Recorded

DescriptionThe proliferation of geospatial artificial intelligence (GeoAI) is generating unprecedented demands on high performance computing infrastructure. Training foundational Earth observation (EO) models on petabytes of satellite imagery, running inference across multi-modal global geospatial models, and generating global vector datasets like agricultural field boundaries and building footprints all require computational strategies at a massive scale.

These complex tasks, which are fundamental to both commercial industries and national security, push the boundaries of modern supercomputing, demanding novel approaches to data management, model parallelism, and distributed inference pipelines that can handle the sheer volume and velocity of geospatial data.

St. Louis has emerged as a critical epicenter for this technological convergence, fostering a unique ecosystem where GeoAI innovations are cross-pollinating between disparate domains. This presentation will highlight the region's synergistic environment, where cutting-edge techniques developed for precision agriculture directly inform vital applications in national security, and vice versa. We will explore specific case studies that demonstrate how this local cross-fertilization is accelerating the development of next-generation geospatial capabilities and cementing St. Louis's role as a global leader in solving planetary-scale challenges through supercomputing.

Paper

Plexus: Taming Billion-Edge Graphs with 3D Parallel Full-Graph GNN Training

10:30am - 10:52am CST Tuesday, 18 November 2025 260-267

HPC for Machine Learning

System Software and Cloud Computing

Partially Livestreamed

Partially Recorded

DescriptionGraph neural networks leverage the connectivity and structure of real-world graphs to learn intricate properties and relationships between nodes. Many real-world graphs exceed the memory capacity of a GPU due to their sheer size, and distributed full-graph training suffers from high communication overheads and load imbalance due to the irregular structure of graphs. We propose a three-dimensional parallel approach for full-graph training that tackles these issues and scales to billion-edge graphs. In addition, we introduce optimizations such as a double permutation scheme for load balancing, and a performance model to predict the optimal 3D configuration of our parallel implementation: Plexus. We evaluate Plexus on six different graph datasets and show scaling results on up to 2,048 GPUs of Perlmutter, and 1,024 GPUs of Frontier. Plexus achieves unprecedented speedups of 2.3-12.5X over prior state of the art, and a reduction in time-to-solution by 5.2-8.7X on Perlmutter and 7.0-54.2X on Frontier.

Workshop

PMBS25: The 16th International Workshop on Performance Modeling, Benchmarking, and Simulation of High-Performance Computer Systems

9:00am - 9:01am CST Monday, 17 November 2025 267

Livestreamed

Recorded

DescriptionThe PMBS25 workshop is concerned with the comparison of high-performance computing systems through performance modeling, benchmarking, or through the use of tools such as simulators. We are particularly interested in research that reports the ability to measure and make tradeoffs in software/hardware co-design to improve sustained application performance. We are also keen to capture the assessment of future systems.

The aim of this workshop is to bring together researchers, from industry and academia, concerned with the qualitative and quantitative evaluation and modeling of high-performance computing systems. Authors are invited to submit novel research in all areas of performance modeling, benchmarking, and simulation, and we welcome research that brings together current theory and practice. We recognize that the term "performance" has broadened to include power consumption and reliability, and that performance modeling is practiced through analytical methods and approaches based on software tools and simulators.

SCinet

Policy-Aware Distributed Vertical Federated Learning Infrastructure

10:40am - 11:00am CST Thursday, 20 November 2025 Booth 3537 - SCinet Theater

Not Livestreamed

Not Recorded

Tutorial

Portable GPU Acceleration of HPC Applications with Standard C++

8:30am - 12:00pm CST Monday, 17 November 2025 121

Livestreamed

Recorded

TUT

DescriptionIn this tutorial, you’ll discover the portable parallelism and concurrency features of the ISO C++23 standard and learn to accelerate HPC applications on modern, heterogeneous GPU-based systems from all three main vendors (AMD, Intel, NVIDIA), without any non-standard extensions. We’ll show you how to parallelize classic HPC patterns like multi-dimensional loops and reductions, and how to solve common problems like overlapping MPI communication with GPU computation. The material is supplemented with numerous hands-on exercises and illustrative HPC mini-applications. All exercises will be done on cloud GPU instances directly in your web browser—no setup required. The tutorial synthesizes practical techniques acquired from our professional experience to show how the C++23 standard programming model applies to real-world HPC workloads, and which thoughts went into implementing and designing the programming model itself. You'll also receive links to additional resources and a preview of upcoming C++ features.

Workshop

Porting a Fortran plasma simulation to Exascale on AMD GPUs using both OpenMP and Kokkos

12:00pm - 12:30pm CST Monday, 17 November 2025 266

Livestreamed

Recorded

DescriptionThis paper presents the 2-step work undertaken to port GYSELA, a petascale Fortran simulation code for turbulence in tokamak plasmas, to GPUs. The initial porting process using OpenMP offloading allowed for good performance in most of the code, with the exception of the collision operator, which became a major bottleneck. This performance critical operator was then rewritten in C++ using Kokkos. It is now known as KoLiOp, and is now one of the code modules with the largest speedups relative to the CPU baseline. We explain our strategy in both phases of development and provide an in-depth analysis of how we leveraged each framework for overall performance. The techniques detailed are applicable to other codes seeking to use a portability layer. Finally, we present a comparative benchmark run on the CPU (AMD Genoa) and GPU (MI250X) partitions of the Adastra machine as well as its upcoming MI300A APU nodes.

Workshop

Post-Variational Quantum Neural Networks on a Hybrid HPC-QC System

4:20pm - 4:40pm CST Sunday, 16 November 2025 231

Livestreamed

Recorded

DescriptionWe implement a post-variational quantum neural network on a real HPC-QC system and show the feasibility of fully training this class of algorithms on current Noisy Intermediate-Scale Quantum (NISQ) devices, which are limited by noise, low number of qubits, and scarcity. Post-variational methods are hybrid classical-quantum Machine Learning algorithms that remove the need for quantum circuits evaluations during training, thus making them more suited to the availability constraints of physical quantum devices. We investigate the scalability of the algorithm to a higher number of qubits, larger datasets, and more elaborate models, giving insight for more efficient implementation. Experiments for an image classification task on a cutting-edge HPC-QC system show that post-variational quantum neural networks are fully trainable in reasonable times on a superconducting device. The models trained also show performance at least comparable to a variational approach, with one configuration showing a significant improvement in classification accuracy.

Workshop

Poster Session Highlights

9:45am - 10:00am CST Monday, 17 November 2025 241

Livestreamed

Recorded

Birds of a Feather

PPP: Pain of Parallel Programming

12:15pm - 1:15pm CST Tuesday, 18 November 2025 124

Education & Workforce Development

Livestreamed

Recorded

XO/EX

DescriptionWriting good parallel programs is painful. Very painful.

This is because high performance compute systems have evolved from simple single-core machines, strung together with Ethernet, into multi-core, multi-accelerator, multi-level monsters. As a consequence, programming such systems means dealing with synchronization and communication overhead, load imbalance, and a multitude of programming models and languages.

In this BoF we bring together application people to share their pain in programming parallel systems, with people working on programming frameworks and models. We hope this can lead to insights and solutions to alleviate the pain.

Research and ACM SRC Posters

Practical Viability of Translating Legacy Fortran Code to C++ Using Large Language Models

8:00am - 5:00pm CST Tuesday, 18 November 2025 Second Floor Atrium

Research & ACM SRC Posters

DescriptionIn this study, we discuss the practicality and limitations of using current large language models (LLMs) for automatically translating Fortran legacy codes to C++, so that legacy codes written in Fortran can be modernized to exploit the performance and features available only in C++. Moreover, we investigate the effectiveness of in-context learning (ICL: translation using custom prompts) and interactive translation (IT: re-translating the code when a compile error occurs) at automatic Fortran-to-C++ translation. In our evaluation, the rate of producing the same results as the original code, called the output match rate, is used as the primary evaluation metric. The evaluation results not only demonstrate that it is difficult even for the latest LLM to achieve 100% accurate translation at present, but also that ICL and IT are effective to improve the accuracy.

Workshop

Predicting Resources for AI Workloads in HPC: Methods, Challenges, and Opportunities

2:30pm - 2:35pm CST Monday, 17 November 2025 276

Partially Livestreamed

Partially Recorded

DescriptionArtificial Intelligence (AI) is rapidly transforming scientific discovery and industrial applications, but its growth has escalated demands on high-performance computing (HPC) resources. A central challenge is predicting resource requirements for deep neural network (DNN) workloads, where inefficient provisioning leads to underutilized GPUs, wasted CPUs, and higher costs. This work explores AI resource prediction in HPC using complementary approaches: Black-Box models leverage tabular features and regressors such as XGBoost for fast, workload-specific predictions. In contrast, White-Box models extract graph-based features from High-Level Optimized (HLO) graphs to generalize across architectures. Results show hybrid methods significantly improve accuracy, reducing fit-time estimation error from 75.48% to 10.55%. The estimators are being integrated with AI-driven job schedulers to improve workload allocation and utilization, paving the way for creating agents for Machine Learning Workflow (MLOps) systems across the computing continuum.

Workshop

Preserving CUDA Syntax for SYCL Portability: A Thin C++ Abstraction without Kernel Migration

2:00pm - 2:30pm CST Monday, 17 November 2025 231

Performance Evaluation, Scalability, & Portability

Livestreamed

Recorded

DescriptionPreparing large-scale scientific applications for diverse GPU architectures requires strategies that balance performance, portability, and long-term maintainability.
We introduce a unified kernel abstraction and evaluate it using CRK-HACC, a production N-body cosmology code, enabling single-source compilation through both CUDA and SYCL toolchains. Our approach introduces a thin C++ layer that preserves the original CUDA kernel syntax and launch style while providing SYCL compatibility through a mechanical ``functorization'' process. This method avoids the complexity of automated source translation, retains architecture-specific optimizations, and reduces maintenance effort by eliminating code duplication. We evaluate the implementation on two DOE leadership systems—Polaris (NVIDIA GPUs) and Aurora (Intel GPUs)—comparing kernel-level execution times across backends and architectures. Results show competitive performance for SYCL relative to native CUDA while preserving code clarity and portability.
This case study demonstrates a practical path toward sustaining performance in complex, physics-rich codes as HPC hardware continues to evolve.

Workshop

Pretraining LLMs at Scale: Tuning Strategies and Performance Portability.

2:30pm - 3:00pm CST Monday, 17 November 2025 267

Livestreamed

Recorded

DescriptionTraining large language models (LLMs) at scale presents challenges that demand careful co-design across software, hardware, and parallelization strategies. In this work, we introduce a communication-aware tuning methodology for optimizing LLM pretraining, and extend the performance portability metric to evaluate LLM-training efficiency across our systems. Our methodology, validated through LLM pretraining workloads at a leading global technology enterprise, delivered up to 1.6x speedup over default configurations. We further provide six key insights that challenge prevailing assumptions in LLM training performance, including the trade-offs between ZeRO stages, the default DeepSpeed communication collectives, and the critical role of batch size choices. Our findings highlight the need for platform-specific tuning and advocate for a shift toward end-to-end co-design to unlock performance efficiency in LLM training.

Tutorial

Principles and Practice of High-Performance Deep/Machine Learning Training and Inference

8:30am - 5:00pm CST Sunday, 16 November 2025 131

Livestreamed

Recorded

TUT

DescriptionRecent advances in machine learning and deep learning (ML/DL) have led to many exciting challenges and opportunities. Modern ML/DL frameworks including PyTorch, TensorFlow, and cuML enable high-performance training, inference, and deployment for various types of ML models and deep neural networks (DNNs). This tutorial provides an overview of recent trends in ML/DL and the role of cutting-edge hardware architectures and interconnects in moving the field forward. We will also present an overview of different DNN architectures, ML/DL frameworks, DL training and inference, and hyperparameter optimization, with special focus on parallelization strategies for large models such as GPT, LLaMA, DeepSeek, and ViT. We highlight new challenges and opportunities for communication runtimes to exploit high-performance CPU/GPU architectures to efficiently support large-scale distributed training. We also highlight some of our co-design efforts to utilize MPI for large-scale DNN training on cutting-edge CPU/GPU/DPU architectures available on modern HPC clusters. Throughout the tutorial, we include several hands-on exercises to enable attendees to gain firsthand experience of running distributed ML/DL training and hyperparameter optimizations on a modern GPU cluster.

Research and ACM SRC Posters

Process-Based Predictors of Vulnerability Reintroduction

8:00am - 5:00pm CST Tuesday, 18 November 2025 Second Floor Atrium

Research & ACM SRC Posters

DescriptionThere is growing interest in securing scientific software, which underpins research results and often transitions into commercial systems. While source code metrics provide useful indicators of vulnerabilities, software engineering process (SEP) metrics can uncover patterns that lead to their introduction. Few studies have explored whether SEP metrics can reveal risky development activities over time—insights that are essential for predicting vulnerabilities.

This work highlights the critical role of SEP metrics in understanding and mitigating vulnerability reintroduction. We move beyond file-level prediction and analyze security fixes at the commit level, focusing on sequences of changes where vulnerabilities evolve and re-emerge. Our approach emphasizes that reintroduction is rarely the result of one isolated action, but emerges from cumulative development activities and socio-technical conditions.

Research and ACM SRC Posters

Productive Scalable Distributed Task Scheduling Using an MPI-based Backend for Dagger

8:00am - 5:00pm CST Tuesday, 18 November 2025 Second Floor Atrium

Research & ACM SRC Posters

DescriptionDistributed computing frameworks are vital for managing complex workloads in high-performance computing and scientific research. Julia's Dagger.jl supports task-based parallelism using TCP communication, suitable for cloud and local environments. However, TCP limits performance on modern HPC systems with low-latency, high-bandwidth interconnects. We introduce MPIAcceleration, an MPI-based backend replacing TCP with MPI-aware task placement and data movement. We benchmarked MPIAcceleration against TCP using parallel Cholesky decomposition on the Aurora exascale supercomputer at Argonne National Laboratory. Results show MPI successfully enables Dagger on HPC interconnects, significantly outperforming TCP on Aurora by overcoming the latency limitations inherent in standard TCP.

Workshop

Profiling Application-Specific Properties of Irregular Graph Algorithms on GPUs

10:55am - 11:20am CST Sunday, 16 November 2025 232

Livestreamed

Recorded

DescriptionWith the availability of sophisticated profiling tools for GPUs such as NVIDIA’s Nsight Compute and Nsight Systems, programmers tend to overlook the level of insight that can be gained from simple profiling techniques. For instance, the basic profiling approach of manually adding counters to source code is able to expose important application-specific behavior that general-purpose profilers cannot capture. Analyzing global or thread-local counts of certain events can help developers better reason about program behaviors that are crucial for detecting performance bottlenecks, validating key assumptions, and guiding effective optimizations. In this paper, we demonstrate on the example of 5 high-performance GPU graph-analytics codes how we used this profiling approach to uncover interesting application behaviors and to develop performance optimizations based on some of them.

Workshop

Programmatic Innovations in Cancer Team Data Science

11:35am - 11:55am CST Monday, 17 November 2025 241

Livestreamed

Recorded

DescriptionInnovation in cancer care and research come about in many ways. High performance computing (HPC) has expanded the frontier for innovation in cancer and promises accelerated impact on the massive challenge of cancer. Yet even with visionary inspiration, vast quantities of data, and a growing capacity in HPC, there are many key technical, scientific, organizational and cultural challenges that remain in realizing impactful patient outcomes.

The volume of real-world patient data at The University of Texas MD Anderson Cancer Center is tremendous with 1.6 million outpatient patient visits per year, 14 million pathology and laboratory procedures, and over 600,000 diagnostic imaging procedures each year. Harnessing and leveraging this information, however, requires more than HPC and computational algorithms. Ultimately, it requires building a data and data science ecosystem to enable interdisciplinary teams to effectively communicate around the data. To formulate actionable questions for cancer discovery and clinical data, teams must appreciate the importance of the context of the data and embrace the complexity of the data, cancer, and the surrounding healthcare system.

MD Anderson Cancer Center has taken a particularly innovative approach in creating the Institute for Data Science in Oncology (IDSO) as part of an overall institutional strategy to tackle these challenges and accelerate translational impact for data science. While not a traditional computational approach for cancer focusing on algorithms, methods, technologies, and implementations, the IDSO programmatic approach is addressing key challenges of creating an organizational ecosystem and culture that readily embraces, innovates, advances, and adopts computational and data science approaches to cancer.

Building on a decade of formative efforts and formally launched in 2023, the IDSO approach is anchored with three pillars of team data science, translational impact, and continuous learning and innovation, all with a direction for improving patient care. The IDSO serves as a hub with defined programs in education and culture, collaboration (both internal and external), and five co-led team data science focus areas emphasizing translational impact in domains of quantitative imaging, single cell spatial analytics, computational modeling for precision medicine, decision analytics for health, and safety, quality, and access.

Already, IDSO is having key impacts, opening avenues for innovation in computational approaches, data flows and growing demand for HPC in meeting cancer challenges. A collaboration with the University of Texas at Austin, the Texas Advanced Computing Center and MD Anderson co-led with IDSO has led to over 20 new collaborative projects involving HPC. The Tumor Measurement Initiative, which heavily uses HPC to train AI models using MD Anderson’s vast image resources has prepared hundreds of imaging datasets utilizing tens of thousands of images and developed an initial library of model algorithms. The IDSO affiliates program now includes more than 50 individuals from across the institution. And in just two short years, the fellowship training program has trained more than 38 personnel in data science.

The presentation will provide useful insights including lessons learned in forming, launching and establishing the IDSO, perspectives on challenges that require communities to solve, and thoughts on future directions.

Workshop

Programmer productivity and performance on AMD's AI Engines: Offloading Fortran intrinsics via MLIR a case-study

9:38am - 10:00am CST Friday, 21 November 2025 263-264

Livestreamed

Recorded

DescriptionA major challenge the HPC community faces is how to deliver increased performance demanded by scientific programmers, whilst addressing an increased emphasis on sustainable operations. Specialised architectures, such as FPGAs and AMD's AI Engines (AIEs), have demonstrated significant energy efficiency advantages, however a major challenge is that substantial expertise and investment of time is required to gain best performance from this hardware which is a major blocker.

Fortran in the lingua franca of scientific computing, and in this paper we explore the automatic offloading of Fortran intrinsics to the AIEs in AMD's Ryzen AI CPU as a case study, demonstrating how the MLIR compiler ecosystem can provide performance and programmer productivity. We describe an approach that lowers the MLIR linear algebra dialect to AMD's AIE dialects, and demonstrate that for suitable workloads the AIEs can provide significant performance advantages over the CPU without any code modifications required by the programmer.

Tutorial

Programming Novel AI Accelerators for Scientific Computing

1:30pm - 5:00pm CST Sunday, 16 November 2025 121

Livestreamed

Recorded

TUT

DescriptionScientific applications are increasingly adopting artificial intelligence (AI) techniques to advance science. There are specialized hardware accelerators designed and built to run AI applications efficiently. With a wide diversity in the hardware architectures and software stacks of these systems, it is challenging to understand the differences between these accelerators, their capabilities, programming approaches, and how they perform, particularly for scientific applications. In this tutorial, we will cover an overview of the AI accelerators landscape, focusing on SambaNova, Cerebras, Graphcore, Groq, and Intel Gaudi systems along with architectural features and details of their software stacks. Through hands-on exercises, attendees will gain practical experience in refactoring code and running models on these systems, focusing on use cases of pre-training and fine-tuning open-source large language models (LLMs) and deploying AI inference solutions relevant to scientific contexts. The tutorial will provide attendees with an understanding of the key capabilities of emerging AI accelerators and their performance implications for scientific applications.

Tutorial

Programming Your GPU with OpenMP: A 'Hands-On' Introduction

1:30pm - 5:00pm CST Monday, 17 November 2025 121

Livestreamed

Recorded

TUT

DescriptionIf you are an HPC programmer, you know OpenMP. Alongside MPI, OpenMP is the open, cross-vendor foundation of HPC. As hardware complexity has grown, OpenMP has grown as well, adding GPU support in OpenMP 4.0 (2013). With a decade of evolution since then, OpenMP GPU technology is a mature option for programming any GPU you are likely to find on the market. While there are many ways to program a GPU, the best way is through OpenMP. Why? Because the GPU does not exist in isolation. There are always one or more CPUs on a node. Programmers need portable code that fully exploits all available processors. In other words, programmers need a programming model, such as OpenMP, that fully embraces heterogeneity. In this tutorial, we explore GPU programming with OpenMP. We assume attendees already know the fundamentals of multithreading with OpenMP, so we will focus on the directives that define how to map loops onto GPUs and optimize data movement between the CPU and GPU. Students will use their own laptops (with Windows, Linux, or macOS) to connect to remote servers with GPUs and all the software needed for the tutorial.

Flash Session

Progress on Developing a Method of Test for Rating Single-Phase Liquid-to-Liquid Coolant Distribution Units

2:45pm - 3:00pm CST Wednesday, 19 November 2025 Booth 2638 - Flash Session

Not Livestreamed

Not Recorded

XO/EX

DescriptionCurrently there is no recognized test standard for determining the thermal capacity of a single-phase liquid-to-liquid coolant distribution unit (CDU). The absence of a nationally recognized method of test for rating adds to the complexity of making meaningful decisions about which vendors’ CDUs should be deployed and what their actual thermal capacity is. In the absence of an accepted standard, it is difficult for engineers and owners to make fair and accurate comparisons between different CDUs.

In this presentation, Dave Meadows will address the current state of development of a method of test for rating of liquid-to-liquid single-phase coolant distribution units. He will detail progress made within the ASHRAE 127 standard and discuss how this relates to AHRI standard 1360. Dave will discuss the proposed methodology and required test equipment needed to record the thermal capacity and hydraulic capabilities of a single-phase liquid-to-liquid CDU.

Exhibitor Forum

Proposal for OASIS: An interoperable and standards-based Computational Storage System to accelerate data analytics in HPC

10:30am - 11:00am CST Tuesday, 18 November 2025 130

Data Analytics

Livestreamed

Recorded

XO/EX

DescriptionIn high performance computing (HPC) environments, data analytics systems often face inefficiencies related to I/O, leading to increased costs. To address these challenges, we propose OASIS (Object-based Analytics Storage for Intelligent SQL Query Offloading), an interoperable and standards-based computational storage system that leverages Substrait and Arrow from the Apache ecosystem.

A key feature of the OASIS system is its ability to provide a consistent data view and a unified analytics methodology across the entire infrastructure, from compute nodes to data-aware computational storage devices (CSDs). This capability enables the creation of a vertically optimized and scalable analytics pipeline, facilitating the flexible distribution of computational loads and promoting optimal performance throughout the data analytics system.

In this talk, we will share performance results for HPC workload analysis within the OASIS-based data analytics system and discuss the applicability of integrating OASIS with existing data analytics frameworks.

SCinet Network Research Exhibition

Prototyping an AmSC/IRI Agent for Multi-Facility Portable Computing Jobs

11:40am - 12:00pm CST Wednesday, 19 November 2025 Booth 3537 - SCinet Theater

Not Livestreamed

Not Recorded

Workshop

Provisioning to Disk with Warewulf v4

9:46am - 10:01am CST Sunday, 16 November 2025 276

Livestreamed

Recorded

DescriptionWarewulf v4, the current generation of the popular cluster provisioning system, was significnatly simpler than its predecessor by supporting only a single-stage, provision-to-memory pattern. While this simplification has many benefits, users of the platform, particularly those coming from Warewulf 3, continue to request "stateful" provisioning support. The Warewulf development community, however, is protective of the simplicity of the current platform, and wants to introduce "diskful" provisioning features without sacrificing the simplicity and benefits of the stateless provisioning paradigm.
Here we present recent additions to Warewulf's ability to provision local storage and to provision a node image to that local storage. We also present a proposed roadmap for the future of disk provisioning in Warewulf as a prompt for further community feedback.

Workshop

Pushing the Limits of Cold Storage for Research Data With Elm: An Open-Source S3-to-Tape System

11:45am - 12:00pm CST Sunday, 16 November 2025 276

Livestreamed

Recorded

DescriptionThis presentation introduces Elm, Stanford Research Computing’s latest storage system designed to handle large-scale archiving of research data, up to hundreds of petabytes. With a strong focus on affordability and energy-efficiency, Elm combines several open-source technologies: MinIO for S3 compatibility and high-level data security through erasure coding, Lustre with built-in parallel hierarchical storage management (HSM), Phobos for modern tape management, and LTFS for easy access to tape data in a standardized format. Together, these elements create a seamless S3 experience for researchers, and offers them access to scalable cold storage for their archival needs. Elm opens new opportunities for data storage at Stanford and has the potential to be replicated at other research institutions.

Workshop

Q/A and Discussion

2:50pm - 3:00pm CST Monday, 17 November 2025 232

Livestreamed

Recorded

Workshop

QC-HPC Systems for the Quantum Advantage Era

11:30am - 12:00pm CST Sunday, 16 November 2025 275

Livestreamed

Recorded

DescriptionThis talk outlines system- and algorithm-level progress toward quantum-advantage-ready QC-HPC, where quantum and classical resources cooperate and are integrated into a single logical system. We will focus on Pasqal's neutral atom QPUs and their native ability for quantum simulation, as well as progress towards fault-tolerant, digital, QPUs. We also discuss the design of heterogeneous (multi-modal) QC-HPC systems.

Paper

QDockBank: A Dataset for Ligand Docking on Protein Fragments Predicted on Utility-Level Quantum Computers

3:52pm - 4:15pm CST Tuesday, 18 November 2025 260-267

Post-Moore Computing

Quantum Computing

Livestreamed

Recorded

DescriptionProtein structure prediction is a core challenge in computational biology, particularly for fragments within ligand-binding regions, where accurate modeling is still difficult. Quantum computing offers a novel first-principles modeling paradigm, but its application is currently limited by hardware constraints, high computational cost, and the lack of a standardized benchmarking dataset. In this work, we present QDockBank—the first large-scale protein fragment structure dataset generated entirely using utility-level quantum computers, specifically designed for protein–ligand docking tasks. QDockBank comprises 55 protein fragments extracted from ligand-binding pockets. The dataset was generated through tens of hours of execution on superconducting quantum processors, making it the first quantum-based protein structure dataset with a total computational cost exceeding $1 million. Experimental evaluations demonstrate that structures predicted by QDockBank outperform those predicted by AlphaFold2 and AlphaFold3 in terms of both RMSD and docking affinity scores. QDockBank serves as a new benchmark for evaluating quantum-based protein structure prediction.

Workshop

Qiskit's C API: Enabling quantum programming for HPC

2:00pm - 2:20pm CST Sunday, 16 November 2025 275

Livestreamed

Recorded

DescriptionQiskit is a popular open source SDK for quantum computing it enables users to build quantum circuits, compile them for a specific quantum computer, and provides interfaces for running circuits. Historically Qiskit only exposed a Python interface, but this has changed in recent releases and Qiskit now also has a C API. This talk will explain how Qiskit's C API was developed, how to use the new API, and show how it is building a broader ecosystem for quantum software. It will also demonstrate practical examples of using the C API to run circuits on quantum computers.

Workshop

qiskit-addon-sqd-hpc: A C++ template library for sample-based quantum diagonalization (SQD)

2:20pm - 2:40pm CST Sunday, 16 November 2025 275

Livestreamed

Recorded

DescriptionSample-based Quantum Diagonalization (SQD) is a hybrid quantum-classical algorithm which approximates the ground state of a many-body quantum system. This talk introduces a new Qiskit addon for SQD, implemented as a modern C++ template library. Closely related to the original Python-based SQD addon, this new library provides high-performance implementations of key algorithmic components: post-selection, subsampling, and configuration recovery. It is designed to integrate with the SBD eigensolver developed at RIKEN, enabling large-scale SQD calculations.This talk will demonstrate how these components, together with the Qiskit C API, enable the construction of fully compiled applications for hybrid quantum/classical workflows. As a case study, we present an open-source application that uses SQD to approximate the ground state energy of the Fe₄S₄ cluster—a molecule of interest in quantum chemistry. This example showcases Qiskit's readiness for high-performance computing environments and its growing support for compiled, scalable quantum-classical applications.

Paper

Qonductor: A Cloud Orchestrator for Quantum Computing

3:30pm - 3:52pm CST Tuesday, 18 November 2025 260-267

BSP

Post-Moore Computing

Quantum Computing

Livestreamed

Recorded

DescriptionWe describe Qonductor, a cloud orchestrator for hybrid quantum-classical applications that run on heterogeneous hybrid resources. Qonductor abstracts away the complexity of hybrid programming and resource management by exposing the Qonductor API, a high-level and hardware-agnostic API. The resource estimator strategically balances quantum and classical resources to mitigate resource contention and the effects of hardware noise. The hybrid scheduler automates job scheduling on hybrid resources and balances the tradeoff between users’ objectives of QoS and the cloud operator’s objective of resource efficiency.

We implement an open-source prototype and evaluate Qonductor using more than 7,000 real quantum runs on the IBM Quantum Cloud to simulate real cloud workloads. Qonductor achieves up to 54% lower job completion times (JCTs) while sacrificing 3% execution quality; balances the load across QPU, which increases quantum resource utilization by up to 66%; and scales with growing system sizes and loads.

Workshop

Quantifying AWS S3 I/O Performance Boundaries Using the Roofline Model

2:30pm - 3:00pm CST Monday, 17 November 2025 230

Data Analytics

High Performance I/O, Storage, Archive, & File Systems

Storage

Livestreamed

Recorded

DescriptionHigh-performance computing facilities increasingly form hybrid environments that integrate cloud services.
To avoid cumbersome network transfers when sharing data, a new class of storage gateways map a subset of facility storage to a cloud counterpart and automatically manage data mirroring.
However, the performance characteristics of accessing AWS's S3 from HPC systems using different methods and patterns remains poorly understood.
This paper presents a roofline-based analysis of three S3 integration approaches: NFS-mounted AWS Storage Gateway, data migration through Storage Gateway, and direct S3 API transfers. We extend I/O roofline modeling to characterize operational intensity and bandwidth ceilings across varying data sizes and access patterns.
Our experimental evaluation demonstrates significant performance differences between access methods, with POSIX I/O on NFS Storage Gateway achieving up to 6.4× higher bandwidth than other approaches for large transfers. The roofline analysis reveals distinct characteristics for each method, enabling informed selection of S3 integration strategies.

Exhibitor Forum

Quantum Acceleration: A Scalable Hybrid Framework for HPC

3:30pm - 4:00pm CST Wednesday, 19 November 2025 130

Quantum & Other Post Moore Computing Technologies

Livestreamed

Recorded

XO/EX

DescriptionHigh performance computing is at a critical inflection point. To move beyond Moore's Law, new computational paradigms have become a necessity. Quantum computing is no longer a future-facing technology; it is a powerful accelerator for today’s most complex challenges. This technical presentation introduces a hybrid quantum-classical framework designed to deliver scalable and reliable quantum systems for today's HPC workloads.

We will detail IonQ’s approach, centered on "qubit virtualization," which is enabled by a flexible architecture that supports any error correction code and features all-to-all connectivity. This framework, combined with advanced error mitigation techniques, unlocks the full potential of integrating quantum resources into existing HPC workflows.

Join us to explore how our accelerated roadmap and hybrid quantum architecture unlock and help solve new problem classes in critical areas like materials science and complex optimization.

Workshop

Quantum Computers in Chemistry

9:30am - 10:00am CST Sunday, 16 November 2025 275

Livestreamed

Recorded

DescriptionClassical computing hardware has been governed by Moore’s Law for decades, but in the coming years the ever increasing computational performance of classical hardware is expected to plateau. Because of this the computer hardware industry is looking for new routes to increase computational performance. Quantum computing holds great promise to extend computational performance, but the current hardware suffers from high levels of noise making these systems hard to use without error mitigations strategies. Recently, a method called subspace quantum diagonalization or SQD was introduced by IBM to solve electronic structure problems of relevance to chemistry. SQD is an example of quantum centric supercomputing (QCSC) where quantum hardware is used on specific aspects of a computational problem while the classical hardware solves the remaining aspects of the computational task. SQD uses the quantum hardware to access Slater determinants which describe how the electrons are arrayed in the molecular orbitals of a molecule, while the generated Slater determinants are then corrected on the classical hardware. The corrected Slater determinants then form the basis used in several formulations configuration interaction calculations. In my presentation I will review the theoretical details of the SQD method and highlight several applications of SQD to chemistry including the study of intermolecular interactions, calculations of large drug-like molecules and biomolecules, inclusion of solvation effects and the integration of the SQD method with statistical mechanical approaches.

Panel

Quantum Computing in the Era of Heterogeneity: What Is the Role of Quantum When We Have So Many Accelerators? Can Quantum Work Together with Other Accelerators? What Are the Applications?

1:30pm - 3:00pm CST Tuesday, 18 November 2025 231-232

Applications & Application Frameworks

SC Community Hot Topics

Livestreamed

Recorded

DescriptionQuantum computing is a growing accelerator technology with more computing centers investigating the use of such systems as part of their long-term HPC strategy. However, while the potential computational power of quantum computers for targeted problems is generally accepted in the HPC and QC communities, concrete application use cases are still a rarity. At the same time, there is a plethora of accelerators in this "Era of Heterogeneity." For this panel, we have invited five experts from different areas, backgrounds, and regions. All are investigating the feasibility and opportunities found in different acceleration approaches. We will discuss with them how they see the accelerator landscape developing and what the role of quantum is. With this panel, we aim to provide a realistic picture of where quantum should be in the field of acceleration and what applications are relevant, thereby offering the audience a well-founded picture based on first-hand experiences.

Exhibitor Forum

Quantum Computing: Tackling Hard Problems with Energy-Efficient Computation

4:30pm - 5:00pm CST Wednesday, 19 November 2025 130

Quantum & Other Post Moore Computing Technologies

Livestreamed

Recorded

XO/EX

DescriptionD-Wave Quantum Inc. develops quantum computers to tackle hard problems with energy-efficient computation. The promise of quantum computing is to extend computation beyond the capabilities of what’s possible with classical computing architectures, the basis of Moore’s Law. This is being realized today, with recent publications on computational supremacy in magnetic material simulation [King].

D-Wave’s annealing quantum computers are installed in the Jülich Supercomputing Centre, the University of Southern California Information Sciences Institute, and Davidson Technologies Inc., as well as D-Wave’s R&D center in Canada. D-Wave’s Leap™ quantum cloud service delivers greater than 99.9% uptime and its systems provide subsecond response times.

D-Wave’s Exhibitor Forum talk will discuss the differences between annealing quantum systems and other quantum computing modalities while highlighting application areas for this technology, like magnetic material simulation, AI/ML, blockchain, and optimization.

Exhibitor Forum

Quantum Geospatial Processing: Converting Classical Algorithms to Quantum Processing Pipelines

1:30pm - 2:00pm CST Wednesday, 19 November 2025 130

Quantum & Other Post Moore Computing Technologies

Livestreamed

Recorded

XO/EX

DescriptionQuantum computing provides an opportunity to greatly accelerate geospatial processing speeds. This session introduces a quantum machine learning (QML) methodology for converting classical algorithms into quantum-ready workflows. In our session we use the scale-invariant feature transform (SIFT) algorithm to demonstrate how QML can recast SIFT’s classical calculations into "quantum-kernels." We explain how design-time analytics such as quantum game theory and empirical parameter estimation yield better initial designs. We then describe how hybrid simulations within our Quantum Circuit Factory use variational and neuroevolutionary algorithms to optimize the quantum-kernels into implemented circuits. Finally, we demonstrate how our factory orchestrates these circuits to mirror classical SIFT, providing quantitative comparisons of classical and quantum processing pipelines. Attendees will gain insights into the engineering involved with converting classical algorithms to quantum, the tools and techniques that support this process, and initial methodologies for comparing classical and quantum dataflows.

Workshop

Quantum Knapsack: Optimizing Resource Allocation with QAOA

2:05pm - 2:10pm CST Monday, 17 November 2025 276

Partially Livestreamed

Partially Recorded

DescriptionCombinatorial Optimization (CO) is one of the most important areas in the field of optimization, with practical applications found in every industry, including both the private and public sectors. In recent years, it was discovered that a mathematical formulation known as QUBO (Quadratic Unconstrained Binary Optimization) problem can embrace an exceptional variety of important CO problems found in the industry. In this work, we explore the Quantum Approximate Optimization Algorithm (QAOA) as a quantum computing approach to solve the Knapsack Problem efficiently by formulating the problem as a QUBO model. Here we implemented QAOA on a quantum emulator, leveraging quantum superposition and entanglement to explore multiple solutions simultaneously. The method offers a promising route for solving large, complex scheduling and resource allocation problems that are tough for classical algorithms, with potential applications in logistics, HPC resource scheduling, and energy optimization.

Art of HPC

Quantum Universality

8:00am - 6:00pm CST Sunday, 16 November 2025 Art of HPC - Plaza Lobby

Art of HPC

Not Livestreamed

Not Recorded

TUT

XO/EX

DescriptionThe artwork explores the phenomena of quantum universality by illustrating the variance of coefficients of the characteristic polynomial for sequences of quantized circulant networks approaching the semiclassical limit, which is the limit of large quantum networks.

Quantum physics describes the behavior of the world at the scale of nanotechnology, where particles behave like waves. Quantum networks of waves in a spider’s web of wires are used to model quantum physics in a complex geometry. A phenomenon observed with quantum physics in a complex environment is universality, where many different quantum systems display the same statistical properties. Universality is possible where quantum waves have large energies or, equivalently, in large networks. In this work, it can be seen as rings of progressively smaller dots approach a constant hue.

We computed the variance of the coefficients of the characteristic polynomial for sequences of quantized circulant networks of increasing size. The radius scales inversely with the size of the network and the hue corresponds to the value of the coefficients.

The characteristic polynomial encodes the spectrum of allowed energy values of the network, which is analogous to the spectrum of musical tones and overtones of a violin. Circulant networks consist of points arranged on a circle where wires connect each point to a set of its closest neighbors, so the network has a symmetry under rotations.

Workshop

Quantum-Centric Supercomputing @ RPI

3:30pm - 4:00pm CST Sunday, 16 November 2025 275

Livestreamed

Recorded

DescriptionA quantum-centric supercomputer (QCSC) is a system that integrates quantum computing with traditional high-performance computing (HPC) resources. This next-generation, hybrid approach targets solving complex, real-world problems that are currently beyond the capabilities of classical supercomputers alone. Under this execution paradigm, the strengths of both computing modalities are integrated to enable a hybrid quantum-HPC program. This presentation will describe the RPI and IBM quantum-centric supercomputing (QCSC) architecture, which integrates the AiMOS supercomputer with RPI's 127-qubit Eagle-class quantum computer. From there, demonstration applications that rely on the QCSC architecture will be presented.

Workshop

Quantum-classical resource & workflow management

2:40pm - 3:00pm CST Sunday, 16 November 2025 275

Livestreamed

Recorded

Workshop

Quantum-HPC: Achievements, Challenges, and the Road Ahead

8:33am - 9:10am CST Friday, 21 November 2025 276

Livestreamed

Recorded

DescriptionAbstract: Quantum–HPC integration has advanced from early experiments to operational prototypes that connect quantum accelerators with leadership-class supercomputers. This talk reviews what has been achieved so far like hybrid runtimes, middleware layers, and workflow orchestration bridging HPC and quantum systems, and what continues to be challenges, like multiple emerging solution and standards, adoption, use cases, and sustained software ecosystems.
It also examines structural challenges such as aligning hardware roadmaps with software readiness and maintaining realistic expectations of computational utility. Rather than promising speedups, the focus is on what hybrid infrastructures enable today: reproducible experimentation, system co-design, and preparation for future fault-tolerant regimes.

Birds of a Feather

Randomized Numerical Linear Algebra in HPC: Toward a Sustainable, Scalable Software Ecosystem

12:15pm - 1:15pm CST Tuesday, 18 November 2025 126

Algorithms

Livestreamed

Recorded

XO/EX

DescriptionIn recent years, randomized numerical linear algebra (RandNLA) proved to be more than a theoretical novelty: projects like RandLAPACK demonstrate its practical value across architectures, and projects like RandBLAS build trust in randomization as a tool for high-performance NLA. This BoF considers two main questions. First, what are the pressing issues in software standards and implementation that need to be resolved for RandNLA to become a core component of HPC? Second, how can we mobilize a community effort to make progress on these issues? The BoF will engage the audience to discuss the idea of growing the role of RandNLA in high performance computing and what it would take to scale from niche prototypes to robust, production-quality software libraries.

Research and ACM SRC Posters

Range Search on Heterogeneous Systems with Processing-in-Memory Architecture

8:00am - 5:00pm CST Tuesday, 18 November 2025 Second Floor Atrium

Research & ACM SRC Posters

DescriptionThe growing volume of data in high performance computing (HPC) has made spatial query processing increasingly challenging due to high data transfer costs and limited memory bandwidth. To address these bottlenecks and reduce energy wasted on data movement, this work explores processing-in-memory (PIM) systems by executing range queries directly inside memory chips. Unlike prior PIM studies centered on linear scans or hash-based queries, this work is the first to map R-tree range queries onto PIM hardware. The proposed broadcast-based method constructs the R-tree bottom-up on the CPU, broadcasts top levels to UPMEM DPUs (DRAM processing units) for global filtering, and distributes lower levels for parallel batched queries in a CPU–DPU system. On the Lakes dataset (8M rectangles), it achieves 8× speedup over sequential CPU baselines, with synthetic benchmarks up to 10.9×. These results highlight the promise of PIM-based heterogeneous systems for scalable, energy-efficient spatial query processing in HPC workloads.

Workshop

Rapid Quantum Network Simulation Design with a Path to Scalable Execution

12:00pm - 12:10pm CST Sunday, 16 November 2025 266

short paper

Livestreamed

Recorded

DescriptionAs quantum networking grows in importance, its study is of interest to an ever wider community. Several simulation frameworks allow for testing such systems on commodity hardware, but can be difficult to work with and performance-limited due to their predominantly serial nature. The SeQUeNCe simulator addresses the latter issue, though has not been proven to work well across architectures or larger scales. For the former concern, we introduce BISQIT, a block-diagramme-based framework that models in terms of distinct components and the data flows between them.
This provides a simple and modular approach to experimental design that allows for rapid iteration with a library of reusable parts. We demonstrate the flexibility of its design for prototyping and show a path for how to migrate designed experiments to SeQUeNCe for production-scale testing. Our results show the simplicity of the BISQIT model and provide new insight into SeQUeNCe's scalability behaviour using ORNL's Frontier.

Paper

RAPTOR: Practical Numerical Profiling of Scientific Applications

3:30pm - 3:52pm CST Tuesday, 18 November 2025 274

Algorithms

Applications

Architectures & Networks

Livestreamed

Recorded

DescriptionThe proliferation of low-precision units in modern high-performance architectures increasingly burdens domain scientists. Historically, the choice in HPC was easy: Can we get away with 32-bit floating-point operations and lower bandwidth requirements, or is FP64 necessary? Driven by artificial intelligence, vendors introduce novel low-precision units for vector and tensor operations, and FP64 capabilities stagnate or are reduced. This forces scientists to re-evaluate their codes, but a trivial search-and-replace approach to go from FP64 to FP16 will not suffice.

We introduce RAPTOR: a numerical profiling tool to guide scientists in their search for code regions where precision lowering is feasible. Using LLVM, we transparently replace high-precision computations using low-precision units, or emulate a user-defined precision. RAPTOR is a novel, feature-rich approach—with a focus on ease of use—to change, profile, and reason about numerical requirements and instabilities, which we demonstrate with four real-world multi-physics Flash-X applications.

ACM Gordon Bell Climate Modeling Finalist

ACM Gordon Bell Finalist

Awards and Award Talks

Real-Time Bayesian Inference at Extreme Scale: A Digital Twin for Tsunami Early Warning Applied to the Cascadia Subduction Zone

1:52pm - 2:15pm CST Tuesday, 18 November 2025 261-262-265-266

Applications

Livestreamed

Recorded

DescriptionWe present a Bayesian inversion-based digital twin that employs acoustic pressure data from seafloor sensors, along with 3D coupled acoustic-gravity wave equations, to infer earthquake-induced spatiotemporal seafloor motion in real time and forecast tsunami propagation toward coastlines for early warning with quantified uncertainties. Our target is the Cascadia subduction zone, with one billion parameters. Computing the posterior mean alone would require 50 years on a 512 GPU machine. Instead, exploiting the shift invariance of the parameter-to-observable map and devising novel parallel algorithms, we induce a fast offline-online decomposition. The offline component requires just one adjoint wave propagation per sensor; using MFEM, we scale this part of the computation to the full El Capitan system (43,520 GPUs) with 92% weak parallel efficiency. Moreover, given real-time data, the online component exactly solves the Bayesian inverse and forecasting problems in 0.2 seconds on a modest GPU system, a ten-billion-fold speedup.

SCinet Network Research Exhibition

Real-Time In-Network Machine Learning and P4 Testbed Deployment on FPGA SmartNICs, DPUs, and Switches

12:40pm - 1:00pm CST Wednesday, 19 November 2025 Booth 3537 - SCinet Theater

Not Livestreamed

Not Recorded

DescriptionNRI103,NRI104,NRI106

Research and ACM SRC Posters

Real-Time ML-Based Defense Against Malicious Payload in Reconfigurable Embedded Systems

8:00am - 5:00pm CST Tuesday, 18 November 2025 Second Floor Atrium

Research & ACM SRC Posters

DescriptionField-programmable gate arrays (FPGAs) in reconfigurable systems face escalating security threats from malicious bitstreams capable of causing denial-of-service, data leakage, or covert operations. Traditional detection methods often require source code or netlists, limiting their applicability for real-time protection.

We present a supervised machine learning approach that directly analyzes FPGA bitstreams at the binary level, enabling rapid detection without design-level access. Using byte frequency analysis, truncated singular value decomposition (TSVD), and SMOTE balancing, we developed and evaluated multiple classifiers on a dataset of 122 benign and malicious configurations for the Xilinx PYNQ-Z1 board. Random Forest achieved a macro F1-score of 0.97, validating the method’s effectiveness for resource-constrained devices.

The final model was deployed on PYNQ for integrated, on-device analysis. During the poster session, we will outline our detection pipeline, dataset preparation process, and performance results, emphasizing the novelty of binary-level analysis and its implications for real-time Trojan detection in embedded systems.

Birds of a Feather

Real-Time Scientific Data Streaming to HPC Nodes: Challenges and Innovations II

5:15pm - 6:45pm CST Wednesday, 19 November 2025 130

State of the Practice

Livestreamed

Recorded

XO/EX

DescriptionEmerging scientific needs are driving a new class of workflows that require near real-time processing, urgent computing, and time-sensitive decision-making. These workflows must bypass traditional buffering and shared file systems, instead relying on direct data streaming over WAN into HPC compute nodes. Latency and variability must be minimized to enable timely responses and experiment steering. This BoF will explore strategies, technologies, and policies to support these streaming workflows at scale, with a focus on building shared infrastructure and community practices to routinely enable this next generation of high-impact, time-critical scientific computing.

Workshop

Redesigning GROMACS Halo Exchange: Improving Strong Scaling with GPU-initiated NVSHMEM

2:40pm - 3:00pm CST Sunday, 16 November 2025 230

Livestreamed

Recorded

DescriptionImproving time-to-solution in molecular dynamics simulations often requires strong scaling due to fixed-sized problems.
GROMACS is highly latency-sensitive, with peak iteration rates in the sub-millisecond, making scalability on heterogeneous supercomputers challenging.
MPI's CPU-centric nature introduces additional latencies on GPU-resident applications' critical path, hindering GPU utilization and scalability.
To address these limitations, we present an NVSHMEM-based GPU kernel-initiated
redesign of the GROMACS domain decomposition halo-exchange algorithm.
Highly tuned GPU kernels fuse data packing and communication, leveraging hardware latency-hiding for fine-grained overlap.
We employ kernel fusion across overlapped data forwarding communication phases and utilize the asynchronous copy engine over NVLink to optimize latency and bandwidth. Our GPU-resident formulation greatly increases communication-computation overlap, improving GROMACS strong scaling performance across NVLink by up to 1.5x (intra-node) and 2x (multi-node), and up to 1.3x multi-node over NVLink+InfiniBand.
This demonstrates the profound benefits of GPU-initiated communication for strong-scaling a broad range of latency-sensitive applications.

Paper

RedSan: A Redundant Memory Instruction Sanitizer for GPU Programs

2:15pm - 2:37pm CST Tuesday, 18 November 2025 263-264

Performance Measurement, Modeling, & Tools

System Software and Cloud Computing

Livestreamed

Recorded

DescriptionCUDA is the de facto programming model for GPUs, widely used in the domains of HPC and AI. To obtain bare-metal performance, vendors and academics develop various profiling tools to guide optimization. However, most existing tools focus on hotspot analysis with limited capabilities in identifying actionable opportunities. To complement existing tools, we present RedSan, a novel profiling tool that leverages binary instrumentation to identify redundant instructions in fully optimized CUDA programs. Guided by RedSan, we are able to optimize programs such as PolybenchGPU, Rodinia, PASTA, DARKNET, and LULESH, yielding up to a 6.27× speedup and 3.00× reduction in memory instructions.

Workshop

Reduction-Aware Directive-Based Programming via Multi-Dimensional Homomorphisms

2:30pm - 3:00pm CST Monday, 17 November 2025 266

Livestreamed

Recorded

DescriptionDirective-based programming is a productive way to target parallel architectures like GPUs and CPUs. Popular solutions such as OpenMP and OpenACC are widely used because they are simple and broadly applicable to general-purpose codebases. However, they often fail to deliver consistently high and portable performance, particularly for reduction-intensive computations.

We introduce a new directive design based on Multi-Dimensional Homomorphisms (MDH). Unlike existing methods, this MDH-based directive focuses on data-parallel computations, such as tensor expressions, achieving superior and portable performance even for reduction-heavy workloads found in deep learning, data mining, and quantum chemistry. It also maintains or improves programmer productivity by using Python as the host language.

Experiments show that our approach outperforms current directive-based solutions across various workloads, including linear algebra, stencil computations, data mining, quantum chemistry, and deep learning.

Workshop

Reproducible Performance Evaluation of OpenMP and SYCL Workloads under Noise Injection

4:30pm - 5:00pm CST Sunday, 16 November 2025 265

Livestreamed

Recorded

DescriptionPerformance instability caused by unpredictable system noise remains a persistent challenge in high-performance and parallel computing. This work presents a reproducible methodology to characterize this variability through noise injection, tested using workloads implemented in OpenMP and SYCL to compare their performance resilience under noisy conditions. We design a noise injector that captures real system traces and replays the deltas as controlled noises. Using this approach, we evaluate multiple mitigation efforts, that is, thread pinning, housekeeping core isolation, and simultaneous multithreading (SMT) toggling, under both default and noise-injected executions. Experiments with two benchmarks (N-body, Babelstream) and one mini-application (MiniFE) across two processor platforms show that while OpenMP consistently achieves higher raw performance, SYCL tends to exhibit greater resilience in noisy environments. Mitigation effectiveness varies with workload characteristics, system configuration, and noise intensity, with housekeeping core isolation offering the clearest benefits, particularly in high-noise scenarios.

Workshop

RESDIS'25 Morning Break

10:00am - 10:30am CST Sunday, 16 November 2025 265

Livestreamed

Recorded

Workshop

RESDIS'25 Welcome and Introduction

9:01am - 9:05am CST Sunday, 16 November 2025 265

Livestreamed

Recorded

Panel

Research Software Engineering in the Age of AI

1:30pm - 3:00pm CST Thursday, 20 November 2025 240-241-242

AI, Machine Learning, & Deep Learning

HPC Software & Runtime Systems

SC Community Hot Topics

Livestreamed

Recorded

DescriptionAI has drastically altered the area of software engineering. In this session, we will discuss its impact in the context of HPC and research software engineering: development and maintenance of the software used in scientific computing and research. Potential uses of AI include developing new code, developing tests, testing and debugging code, reducing technical debt, porting code, optimizing code, updating software when dependencies are updated, generating code documentation, and building workflows of existing components. Panelists who are both researchers and software developers/maintainers will discuss their experiences in AI and research software, and how they expect things to change in the short term and the long term.

Workshop

Research Software Engineers in HPC (RSE-HPC-2025)

9:00am - 9:05am CST Sunday, 16 November 2025 267

Livestreamed

Recorded

DescriptionResearch software engineers (RSEs) are critical to the impact of HPC, data science, and the larger scientific community. They have existed for decades, though often not under that name. The past several years, however, have seen the development of the RSE concept, common job titles, and career paths; the creation of professional networks to connect RSEs; and the emergence of RSE groups in universities, national laboratories, and industry.

This workshop will bring together RSEs and allies involved in HPC, from all over the world, to grow the RSE community by establishing and strengthening professional networks of current RSEs and RSE leaders. We'll hear about successes and challenges that RSEs and RSE groups have experienced, and discuss ways to increase awareness of RSE opportunities and improve support for RSEs.

The workshop will be highly interactive, featuring breakout discussions and panels, as well as invited addresses and submitted talks.

Workshop

RESILIO : A Scalable and Composable Architecture for Tomographic Reconstruction Workflows

9:06am - 9:24am CST Monday, 17 November 2025 264

Livestreamed

Recorded

DescriptionTomographic reconstruction (TR) aims to reconstruct a 3D object from 2D projections. It is an important technique across domains such as medical imaging and materials science, where high-resolution volumetric data is essential for decision-making. With advanced facilities such as the upgraded APS enabling unprecedented data acquisition rates, TR pipelines struggle to handle large data volumes while maintaining low latency, fault tolerance, and scalability. Traditional, tightly coupled, batch-oriented workflows are increasingly inadequate in such high-performance contexts. In response, we propose RESILIO , a composable, high-performance TR framework built atop the Mochi ecosystem that uses persistent streaming and fully leverages HPC platforms. Our design enables scalable and elastic execution across heterogeneous environments. We contribute a reimagined TR architecture, its implementation using Mochi, and an empirical evaluation showing up to 3490× reduction in the per-event overhead compared to the original implementation, and up to 3268× improvement in throughput with performance-tuned configurations using Mofka.

Panel

Rethink Computing: Pioneering Next-Level Architectures for Sustainable AI and HPC

3:30pm - 5:00pm CST Tuesday, 18 November 2025 231-232

AI, Machine Learning, & Deep Learning

Architectures

SC Community Hot Topics

Livestreamed

Recorded

DescriptionAs AI and high performance computing (HPC) demands surge, traditional electronic architectures are hitting their limits. This panel explores the next wave of computing, spotlighting photonic computing—a revolutionary approach that uses light instead of electrons to achieve unprecedented energy efficiency, bandwidth, and computational density. We will position photonic computing within a broader landscape of emerging technologies, including heterogeneous ecosystems that blend digital chips (CPUs, GPUs, TPUs) and analogue or quantum systems. Panelists will examine the challenges of power consumption, data movement, and integration with current infrastructures, offering insights into how next-generation architectures can meet AI and HPC demands while advancing environmental sustainability. The discussion will highlight key steps for moving photonic computing from lab-scale innovation to industrial-scale deployment, and address the status, challenges, and opportunities for integrating these technologies into today’s computing infrastructure and industry standards.

Paper

Rethinking Back Transformation in Two-Stage Eigenvalue Decomposition on Heterogeneous Architectures

2:15pm - 2:37pm CST Thursday, 20 November 2025 260-267

Algorithms

Livestreamed

Recorded

DescriptionThe two-stage eigenvalue decomposition (EVD) method outperforms conventional one-stage methods on GPUs and heterogeneous architectures, especially when eigenvectors are not required. However, its performance advantage diminishes when performing back transformation to obtain eigenvectors. To address this, we propose two key solutions: 1) replacing BLAS3 operations with BLAS2 operations during the bulge-chasing back transformation for better performance, and 2) reordering the back transformation workflow from a backward pattern to a new parallelism-driven pattern to hide divide-and-conquer latency, at the cost of one additional GEMM computation. Experimentally, the proposed back transformation algorithm demonstrates significant performance improvements, outperforming the SOTA implementation in MAGMA by an average factor of 3.58x. For complete FP64 precision symmetric EVD with eigenvectors, the proposed algorithm, incorporating both solutions, surpasses the SOTA implementations in MAGMA and cuSOLVER by average factors of 2.62x and 2.21x, respectively.

Workshop

Revolutionizing High-Performance Computing: Introducing NextSilicon's Maverick-2 Reconfigurable Accelerators – A Leap Beyond Traditional FPGAs

8:35am - 9:38am CST Friday, 21 November 2025 263-264

Livestreamed

Recorded

DescriptionConventional von-Neumann architectures cannot deliver the efficient performance required by today's demanding HPC and AI workloads. Dataflow and reconfigurable devices are the promising way of the future, if they could only be more developer friendly, attractive and seamless in the software ecosystem. Our invited speaker, Ilan Tayari will tackle the hurdles on this front, and point out the novel approaches in the field, including NextSilicon's recently launched Intelligent Compute Architecture. These novel approaches will revolutionize the way application developers interact with reconfigurable heterogeneous devices.

Exhibitor Forum

Rewiring HPC: Ethernet’s Ascent in a High-Performance World

3:30pm - 4:00pm CST Thursday, 20 November 2025 130

Networking

Livestreamed

Recorded

XO/EX

DescriptionOver the past years, the state of HPC has evolved at an unprecedented rate. As workloads grow in complexity and size, the network infrastructure must keep pace. This session explores Ethernet’s ascent in HPC, diving into its technical evolution, cost-effectiveness, and expansive ecosystem. Drawing comparisons to other interconnects, we’ll examine how Ethernet has matured to meet the demands of modern HPC workloads. Whether you're running an existing interconnect or planning your next deployment, this session will provide valuable insight into the advantages of Ethernet and its growing role in a rapidly changing HPC landscape.

Paper

RingX: Scalable Parallel Attention for Long-Context Learning on HPC

3:52pm - 4:15pm CST Wednesday, 19 November 2025 261-262-265-266

HPC for Machine Learning

Livestreamed

Recorded

DescriptionThe attention mechanism has become foundational to remarkable AI breakthroughs since the introduction of the Transformer, driving the demand for increasingly longer context to power AI frontier models. However, its quadratic computational and memory complexities pose a major challenge. Here, we propose RingX, a set of scalable parallel attention methods, optimized for HPC. Through better workload partitioning, communication scheme, and load balancing, we achieve up to 3.4X speedup compared to the current state-of-the-art on the Frontier supercomputer. RingX is specifically optimized for both bi-directional and casual attention, and its performance and validity are demonstrated by training of both a Vision Transformer (ViT) and a Generative Pre-trained Transformer (GPT), respectively. An end-to-end speedup of about 1.5X is obtained in both applications. To our knowledge, the achieved 38% model FLOPs utilization (MFU) for training Llama3-8B on a 1M-token sequence length using 4,096 GPUs is among the best training efficiencies on HPC systems.

Workshop

RISC-V Vectorization Coverage for HPC: A TSVC-Based Analysis

10:50am - 11:10am CST Monday, 17 November 2025 242

Livestreamed

Recorded

DescriptionThe RISC-V Vector Extension (RVV) introduces scalable, vector-length–agnostic operations with strong potential for high-performance computing (HPC). This paper presents a TSVC-based instruction coverage analysis of RVV to evaluate current compiler auto-vectorization support. We compile TSVC with GCC and Clang under both vector-length agnostic (VLA) and vector-length specific (VLS) modes and analyze the emitted instructions against the RVV specification. Our results quantify instruction usage across key groups, identify missed instructions, and classify the causes of failed vectorization, including compiler backend limitations, absent use cases in TSVC, and non-trivial or unsupported patterns. We also highlight TSVC’s limitations, including ambiguous kernel vectorizability and missing representations of modern HPC-relevant patterns. Finally, we suggest directions for enhancing benchmark suites to better reflect RVV capabilities and guide compiler development for HPC workloads.

Workshop

RL4Sys: A Lightweight System-driven RL Framework for Drop-in Integration in System Optimization

2:00pm - 2:30pm CST Monday, 17 November 2025 230

Data Analytics

High Performance I/O, Storage, Archive, & File Systems

Storage

Livestreamed

Recorded

DescriptionReinforcement learning (RL) has achieved notable success in complex decision-making tasks. Motivated by these advances, systems researchers have explored RL for optimizing system behavior. However, practical deployment remains uncommon, as existing RL frameworks are ill-suited for system-oriented use cases. To address this gap, we present \textbf{RL4Sys}, a lightweight RL framework designed specifically for seamless system-level integration. RL4Sys includes a minimal client that embeds easily within target systems to record trajectories and run inference from locally cached deep policies. RL4Sys's remote RL trainers executed asynchronously and distributed across servers leverage zero-copy gRPC and adaptive batching to update policies without blocking the original system. Our evaluation shows that RL4Sys matches the convergence behavior of conventional RL frameworks and achieves up to 220% higher throughput in environment-oriented settings compared to state-of-the-art systems such as RLlib, while incurring less than 6% runtime overhead relative to the original non-RL system.

Workshop

Roofline Analysis of Tightly-Coupled CPU-GPU Superchips: A Study on MI300A and GH200

9:30am - 10:00am CST Monday, 17 November 2025 231

Performance Evaluation, Scalability, & Portability

Livestreamed

Recorded

DescriptionThe introduction of tightly-coupled heterogeneous architectures, such as AMD's MI300A and NVIDIA's Grace-Hopper(GH200), address a bottleneck in accelerated computing, namely the CPU-GPU interface.
Whereas the GH200 can be seen as a technological leap in CPU-GPU connectivity greatly exceeding PCIe cadence, the unified memory architecture of the MI300A APU enables seamless communication through coherent caches.
When the CPU and GPU execute concurrently, they contend not only for finite bandwidth, but also contend power in a power-constrained environment.
In this paper, we extend the well-established Roofline model to capture the performance implications of contention in concurrent execution on the MI300A and GH200.
We enhance this by noting the impact of different memory allocators, the randomness of data, and the host and device arithmetic intensity.
We conclude with a discussion on the evolution of GPU architectures and the impact in performance, portability, and programmability that emerging tightly-coupled GPUs bring to the HPC landscape.

Workshop

ROSE: RADICAL Orchestrator for Surrogate Exploration

11:50am - 12:10pm CST Monday, 17 November 2025 274

Livestreamed

Recorded

DescriptionScientific computing increasingly uses surrogate models to accelerate high-fidelity simulations, enable real-time predictions, and explore large design spaces. Building surrogates at scale is challenging: simulations are costly, data generation must be managed, and surrogate learning involves large, heterogeneous, evolving workflows. In active learning, where models guide data acquisition, these challenges intensify due to tight coupling between simulation, inference, and training. We present ROSE (RADICAL Orchestrator for Surrogate Exploration), a flexible, portable, and scalable framework supporting the full surrogate modeling lifecycle in HPC environments. ROSE integrates active learning with scalable orchestration, managing asynchronous execution across diverse resources while minimizing user effort. It supports in-situ/ex-situ workflows, online/offline training, and adaptive sampling. Applied to three use cases—electrolyte structure extraction, neutron diffraction structure recovery, and colloid phase classification—ROSE sustains high throughput with low overhead on Polaris, Perlmutter, and Delta, achieving 4–8× end-to-end speedups, with asynchronous orchestration delivering 1.5–3× gains over synchronous baselines.

Workshop

RSE-HPC Featured Talk 1

9:05am - 10:00am CST Sunday, 16 November 2025 267

Livestreamed

Recorded

Workshop

RSE-HPC Featured Talk 2

2:00pm - 3:00pm CST Sunday, 16 November 2025 267

Livestreamed

Recorded

Workshop

RSE-HPC Lightning Talks 1

10:30am - 11:10am CST Sunday, 16 November 2025 267

Livestreamed

Recorded

Description- Yoshii: RSEs and the Future of HPC Architecture through Open, Scalable Chip Design

- Bello: Empowering AI/ML Innovation through Research Software Engineering

Workshop

RSE-HPC Lightning Talks 2

3:30pm - 4:10pm CST Sunday, 16 November 2025 267

Livestreamed

Recorded

Description- Eiffert: Beginner’s Guide to Starting a Research Community (By a Beginner)

- Watson: To Join or Not to Join - RSE Feedback on the Perceived Value of Joining an Open Source Software Foundation

Workshop

RSE-HPC Panel: RSEs and Research Culture

4:10pm - 5:30pm CST Sunday, 16 November 2025 267

Livestreamed

Recorded

Workshop

RSE-HPC Panel: Skills and Training for RSEs

11:10am - 12:30pm CST Sunday, 16 November 2025 267

Livestreamed

Recorded

Workshop

Run-time Energy-Efficiency Optimization for AI and HPC Workloads

2:45pm - 3:00pm CST Sunday, 16 November 2025 264

Livestreamed

Recorded

DescriptionEffective power management is crucial for balancing high performance and environmental impact in the exascale era, particularly for datacenters dominated by massively parallel GPU systems due to the rise of AI. While many strategies rely on deep application knowledge, there is a growing need for application-agnostic approaches. We introduce a node-level power management runtime designed for regular applications, featuring minimal overhead and seamless deployment across any HPC/AI system. Our approach detects, at runtime, repetitive execution patterns via spectral analysis and then traces per-pattern energy consumption. A simple gradient-descent optimizer gradually adjusts the GPU frequency until the least per-pattern energy (i.e., maximum energy efficiency) is found. With this approach, we demonstrate up to a 15% reduction in energy consumption for equivalent computational tasks, with no overhead and minimal impact on execution time. This solution has been validated across a diverse range of AI applications, and we discuss the resulting energy savings.

Workshop

Runtime correctness checking with MUST and assisting tools

9:05am - 10:00am CST Monday, 17 November 2025 261

Debugging & Correctness Tools

HPC Software & Runtime Systems

Livestreamed

Recorded

SCinet

SC25 SCinet — Source of Truth and Automation with Nautobot

10:40am - 11:00am CST Tuesday, 18 November 2025 Booth 3537 - SCinet Theater

Not Livestreamed

Not Recorded

SCinet

SC25 SCinet — Technical Overview

10:20am - 10:40am CST Tuesday, 18 November 2025 Booth 3537 - SCinet Theater

Not Livestreamed

Not Recorded

SC26

SC26 Conference Preview

8:30am - 8:40am CST Thursday, 20 November 2025 America's Ballroom Tu-Th

Livestreamed

Recorded

TUT

XO/EX

DescriptionKevin Hayden, SC26 General Chair, is a Senior Network Engineer at the U.S. Department of Energy’s Argonne National Laboratory. For over 30 years, he has been at the center of designing and operating the high-performance research networks that drive modern science. His work has supported the data-intensive collaborations behind some of the world’s most ambitious scientific efforts.
A member of the SC community for two decades, Hayden has contributed to nearly every part of the conference—from technical operations to leadership—serving as SCinet Chair in 2020. He believes the heart of SC lies in the connections between people, technologies, and ideas. As General Chair, he aims to create a space where collaboration thrives, and every participant can help shape the future of high-performance computing, networking, and data science.

Workshop

Scabbard: LLVM Instrumentation-aided Race Checking in CPU/GPU Unified Memory for AMD GPUs

4:30pm - 5:00pm CST Monday, 17 November 2025 260

Livestreamed

Recorded

DescriptionData Races are bugs whose occurrence in shared memory parallel and concurrent systems can cause unexpected outcomes and undermine the integrity of computational results. CPU/GPU systems that use Unified Heterogeneous Memory—for simplified memory sharing between GPUs and CPUs—are highly prone to data races. Scabbard's design is based on instrumenting the CPU/GPU codes by using LLVM's pass-plugin system to instrument the user's code to record writes, reads and synchronizations into trace files during execution, and subsequently performing offline analysis to report races. The main contribution of this paper is in detailing our algorithms, implementation challenges and the learning process faced by newcomers to the LLVM and ROCm/Hip ecosystems to better the learning experience for new LLVM tool-builders in the CPU/GPU space.

Research and ACM SRC Posters

Scalable Alternative Route Computation with ACE: A C++17 Library for HPC Traffic Simulations

8:00am - 5:00pm CST Tuesday, 18 November 2025 Second Floor Atrium

Research & ACM SRC Posters

DescriptionWe present ACE (Asynchronous Communication and Execution), a C++17 library for scalable asynchronous task execution on high performance computing (HPC) systems. Integrated into a distributed traffic simulation workflow, ACE accelerates the computation of alternative routes, a key performance bottleneck in large-scale simulations. Unlike the previous Rust-based Evkit approach, ACE eliminates the multi-minute worker-spawning overhead and manages task granularity dynamically. Using scenarios for Prague and the Central Bohemia region, with datasets of up to 25 million routes, ACE achieved up to a 15x speed-up on city-scale workloads with shorter routes and a 1.45x improvement on larger regional workloads. These results highlight ACE’s ability to adapt to workload characteristics and improve both efficiency and scalability in HPC-based route computation.

Research and ACM SRC Posters

Scalable Execution Framework for R on Manycore Systems

8:00am - 5:00pm CST Tuesday, 18 November 2025 Second Floor Atrium

Research & ACM SRC Posters

DescriptionRCOMPSs is a scalable execution framework that integrates the R programming language with the COMPSs runtime to enable task-based parallel execution on manycore and distributed systems. RCOMPSs extends conventional R workflows by allowing functions to be annotated as tasks, which the runtime system analyzes to construct a task dependency graph (DAG). This graph guides dynamic scheduling, dependency resolution, and data transfers, thereby abstracting parallel execution from the user while preserving correctness. A straightforward example of dataset standardization illustrates the minimal programming effort needed to leverage parallelism. In contrast, more complex applications like K-means clustering demonstrate the framework's capability to represent iterative statistical algorithms in a task-oriented manner. Performance evaluation on Shaheen-III and MareNostrum~5 shows strong scalability up to 32 nodes with near-linear speedup, efficient weak scalability with increasing problem sizes, and effective utilization of up to 128 and 80 threads per node, respectively.

Workshop

Scalable Hydrodynamics on multiple Field-Programmable Gate Arrays (FPGAs)

5:00pm - 5:20pm CST Sunday, 16 November 2025 231

Livestreamed

Recorded

DescriptionHydroDynamics (HD) and MagnetoHydroDynamics (MHD) simulations play a central role in modeling physical processes in fields as diverse as astrophysics, nuclear fusion, and plasma physics. These simulations often involve the resolution of hyperbolic systems of partial differential equations using finite-volume methods, yielding stencil-like computation patterns over structured grids.

While CPUs and GPUs have long dominated HPC workloads, their efficiency is increasingly challenged by the need for better control over memory access patterns and energy consumption. In contrast, Field-Programmable Gate Arrays (FPGAs) offer reconfigurable hardware with customizable memory hierarchies and dataflow pipelines, making them especially appealing for streaming and memory-bound applications. Their potential to reduce the energy footprint of scientific simulations has attracted growing attention, especially in the context of exascale computing, where power constraints are becoming a limiting factor.

Research and ACM SRC Posters

Scalable Multi-Node Multi-GPU Datalog Engine with Energy-Aware Profiling

8:00am - 5:00pm CST Tuesday, 18 November 2025 Second Floor Atrium

Research & ACM SRC Posters

DescriptionExascale computing, powered by GPUs, is reshaping high-performance computing. Declarative languages such as Datalog naturally benefit from this shift, as recursive rules can be compiled into GPU-optimized relational operations. Unlike SQL, Datalog executes queries iteratively until a fixed point is reached, making it ideal for graph mining, deductive database, and symbolic AI. Existing engines (SLOG, LogicBlox, and Soufflé) target multi-core architectures and lack support for distributed multi-GPU systems. We address this gap with MNMGDatalog, the first multi-node, multi-GPU Datalog engine, which combines CUDA for intra-node parallelism with MPI for inter-node communication. Our design introduces GPU-parallel joins, scalable recursive aggregation, and iterative all-to-all communication strategies. To assess performance and efficiency, we developed Powerlog, the first GPU-based Datalog engine energy profiler. Experiments on Argonne’s Polaris supercomputer show up to 32× speedups over state-of-the-art distributed engines and reveal tradeoffs between scaling and energy use, establishing a foundation for energy-aware declarative analytics at scale.

Workshop

Scalable Neural Network Training: Distributed Data-Parallel Approaches

4:15pm - 4:45pm CST Monday, 17 November 2025 266

Livestreamed

Recorded

DescriptionTraining large neural networks is computationally demanding and often limited by synchronization overhead in distributed environments. Traditional data-parallel frameworks, such as Horovod or PyTorch DDP, average gradients at every batch, which can limit scalability due to communication bottlenecks.
In this work, we propose two novel data-parallel strategies that reduce synchronization by averaging weights and biases only at the end of each epoch. These methods are implemented using the PyCOMPSs task-based programming model and integrated into dislib, enabled by a new distributed tensor abstraction (ds-tensor) that supports multidimensional data structures suitable for deep learning workloads.

We evaluate our approach on classification and regression tasks using real-world datasets and federated learning scenarios. Results show up to 95% training time reduction and strong scalability up to 64 workers, while maintaining or improving model accuracy. Our strategies enable asynchronous, communication-efficient training and are well-suited for heterogeneous and large-scale HPC systems.

Workshop

Scalable Softmax for Efficient Attention: Parallel and Distributed Strategies

11:35am - 11:40am CST Sunday, 16 November 2025 261

Livestreamed

Recorded

DescriptionAs large-scale deep learning models become integral to scientific discovery and engineering applications, it is increasingly important to teach students how to implement them efficiently and at scale. This paper presents a coding assignment that focuses on optimizing the Softmax function, a central component of many deep learning models, including attention mechanisms in transformer models. The assignment is designed for an undergraduate level Distributed Computing course (CPE 469, 10-week quarter system), and tailored to students with little or no prior experience in machine learning.

This assignment is one of seven designed to reinforce the foundational concepts of parallel programming. It was developed as part of an inquiry-based learning approach \cite{ibl1}, \cite{ibl2}, encouraging students to actively investigate, experiment, and discover solutions to real-world challenges. The assignment introduces essential deep learning concepts, then guides students through identifying independent tasks within the Softmax computation so they can implement parallel solutions using OpenMP and CUDA.

By integrating modern AI workloads into an HPC curriculum, this work equips students with both the conceptual understanding and practical experience needed to build scalable solutions in scientific computing.

Workshop

Scalable, High-Fidelity Monitoring of Application Communication Patterns in Vernier

4:30pm - 5:00pm CST Monday, 17 November 2025 241

Livestreamed

Recorded

DescriptionUnderstanding the irregular, dynamic communication patterns in HPC applications at scale is critical when evaluating potential software optimizations and hardware architectures. Current systems monitor communication behavior for entire applications as exhaustive traces or general-purpose aggregated statistics. Generally, these approaches often do not scale well and the data gathered is often too generic or inflexible to make specific hardware/software optimizations. This paper describes a new, configurable, histogram-based approach to gathering scalable, high-fidelity monitoring information about HPC communication that we implemented in the Vernier communication monitoring system. This approach enables targeted collection of statistical data about annotated communication patterns for online or offline analysis, benchmarking, or network simulations. We assess these capabilities by collecting communication patterns from several production HPC applications at scale, showing that the resulting statistical representations accurately characterize the communication patterns in these applications, and can be used to provide new insights into communication patterns of complex HPC applications.

Workshop

Scaling All-to-All Operations Across Emerging Many-Core Supercomputers

2:00pm - 2:30pm CST Sunday, 16 November 2025 260

Livestreamed

Recorded

DescriptionPerformant all-to-all collective operations in MPI are critical to fast Fourier transforms, transposition, and machine learning applications. There are many existing implementations for all-to-all exchanges on emerging systems, with the achieved performance dependent on many factors, including message size, process count, architecture, and parallel system partition. This paper presents novel all-to-all algorithms for emerging many-core systems. Further, the paper presents a performance analysis against existing algorithms and system MPI, with novel algorithms achieving up to 3x speedup over system MPI at 32 nodes of state-of-the-art Sapphire Rapids systems.

Workshop

Scaling Hybrid Quantum--HPC Applications with the Quantum Framework

9:45am - 10:00am CST Friday, 21 November 2025 275

Livestreamed

Recorded

DescriptionHybrid quantum–high performance computing (Q-HPC) workflows are emerging as a key strategy for running quantum applications at scale on noisy intermediate-scale quantum (NISQ) devices. These workflows must operate seamlessly across diverse simulators and hardware backends since no single simulator offers the best performance for every circuit type. Efficiency depends strongly on circuit structure, entanglement, and depth, making backend-agnostic execution essential for fair benchmarking, platform selection, and the identification of quantum advantage opportunities. We extend the Quantum Framework (QFw), a modular HPC-aware orchestration layer, to integrate local simulators (Qiskit Aer, NWQ-Sim, QTensor, TN-QVM) and a cloud backend (IonQ) under a unified interface. Benchmarking variational and non-variational workloads reveal workload-specific strengths: Qiskit Aer’s matrix product state excels for large Ising models, NWQ-Sim leads on entanglement and Hamiltonian simulation, and distributed NWQ-Sim accelerates optimization tasks. These findings demonstrate that simulator-agnostic, HPC-aware orchestration enables scalable, reproducible Q-HPC ecosystems, advancing quantum advantage.

Exhibitor Forum

Scaling Liquid Cooling in Line with the AI Development Roadmap

4:30pm - 5:00pm CST Tuesday, 18 November 2025 130

Data Analytics

Livestreamed

Recorded

XO/EX

DescriptionAs AI adoption accelerates, data centers face unprecedented power and thermal challenges, with AI-driven workloads often consuming 10x the power of traditional IT. Meeting these demands requires a complete rethinking of infrastructure, where liquid cooling emerges as a critical enabler on the AI development roadmap. This session will examine the AI Heat Density Roadmap to 2030 and the products and architectures required for heat dissipation, focusing on direct liquid cooling, responsiveness across the entire thermal chain, and the connectivity and telemetry needed to protect this powerful but sensitive IT.

We will explore practical, modular approaches for both retrofits and greenfield builds, including liquid-to-refrigerant coolant distribution units (L2R CDUs) that enable phased deployment and hybrid integration with air-cooling systems. Attendees will gain actionable insights into supporting high-density racks, managing escalating heat loads, and implementing real-time management platforms to create resilient, efficient AI factories that can scale alongside rapidly evolving compute demands.

Workshop

Scaling LLM Training Using RDMA over Converged Ethernet

2:50pm - 3:05pm CST Sunday, 16 November 2025 266

Livestreamed

Recorded

DescriptionWe present a comprehensive benchmarking study that evaluates the scaling performance of RDMA over Converged Ethernet (RoCE) and compares it with Infiniband in the context of large-scale LLM training workloads. While Infiniband is traditionally favored for its low-latency, high-bandwidth characteristics, it imposes significant infrastructure and operational costs. RoCE, leveraging commodity Ethernet and RDMA, offers a cost-effective alternative. Through extensive experiments on production clusters, we demonstrate that RoCE can achieve near-linear scaling performance comparable to Infiniband when properly configured. Our analysis spans data sharding strategies, quantization and activation recomputation techniques, batch size tuning, and system-level optimizations, providing practical guidance for designing scalable and efficient AI infrastructure.

Paper

Scaling Out Chip Interconnect Networks with Implicit Sequence Numbers

2:37pm - 3:00pm CST Wednesday, 19 November 2025 260-267

Architectures & Networks

Livestreamed

Recorded

DescriptionAs AI models exceed single-processor limits, cross-chip interconnects are essential for scalable computing. These links transfer cache-line–sized data at high rates, driving adoption of protocols like CXL, NVLink, and UALink for high bandwidth and small payloads. Faster transfers, however, increase error risk. Standard methods such as CRC and FEC ensure link reliability, but scaling to multi-node systems introduces new challenges, including detecting silently dropped flits in switches.

We present Implicit Sequence Number (ISN), which enables precise flit drop detection and in-order delivery without header overhead. We also propose Reliability Extended Link (RXL), a CXL extension integrating ISN to support scalable, reliable multi-node interconnects while preserving flit format. RXL elevates CRC to a transport-layer role for end-to-end data and sequence integrity, while FEC handles link-layer correction. This approach delivers robust reliability and scalability without reducing bandwidth efficiency.

Best Poster Presentations (Research, ACM SRC Grad/Undergrad)

Scaling Singular Values Beyond GPU Memory Limits: Out-of-Core, GPU-Accelerated, and Unified Across Data Precision and Hardware

3:30pm - 3:45pm CST Wednesday, 19 November 2025 230

Research & ACM SRC Posters

DescriptionWe present a unified, out-of-core, GPU-accelerated singular value solver that achieves performance portability across diverse hardware platforms and data precisions for datasets exceeding GPU memory. The singular value decomposition (SVD) is fundamental for processing large-scale datasets, yet the diversity of computing architectures and the proliferation of precision formats pose significant challenges in heterogeneous environments. Traditional HPC libraries require separate implementations for each architecture and precision, limiting scalability and usability. Building on our previous work, where we developed an open-source unified solver achieving performance comparable to vendor-optimized libraries across multiple precisions and GPU platforms, we extend this capability to handle larger-than-memory datasets. We adapt a QR-based communication-hiding strategy to improve the compute-to-communication ratio and leverage Julia's multiple-dispatch for seamless backend integration. Our implementation significantly outperforms CPU-based LAPACK and remains only 3–5× slower than GPU-resident solvers across different hardware and data precision configurations.

Paper

Scaling the Memory Wall Using Mixed-Precision — HPG-MxP on an Exascale Machine

10:52am - 11:15am CST Tuesday, 18 November 2025 263-264

Performance Measurement, Modeling, & Tools

Livestreamed

Recorded

DescriptionMixed-precision algorithms have been proposed as a way for scientific computing to benefit from some of the gains seen for AI on recent high performance computing (HPC) platforms. A few applications dominated by dense matrix operations have seen substantial speedups by utilizing low-precision formats such as FP16. However, a majority of scientific simulation applications are memory bandwidth limited. Beyond preliminary studies, the practical gain from using mixed-precision algorithms on a given HPC system is largely unclear.

The High Performance GMRES Mixed Precision (HPG-MxP) benchmark has been proposed to measure the useful performance of an HPC system on sparse matrix-based mixed-precision applications. In this work, we present a highly optimized implementation of the HPG-MxP benchmark for an exascale system and describe our algorithm enhancements. We show for the first time a speedup of 1.6x using a combination of double and single precision on modern GPU-based supercomputers.

Birds of a Feather

Scientific Software and the People Who Make It Happen: Building Our Communities and Practices

5:15pm - 6:45pm CST Wednesday, 19 November 2025 131

Community Meetings

Livestreamed

Recorded

XO/EX

DescriptionSoftware has become central to all aspects of modern science and technology. Especially in high performance computing (HPC) and computational science and engineering (CSE), it is becoming ever-larger and more complex while computer platforms evolve and become more diverse. Simultaneously, the teams behind the software are becoming larger, more technically diverse, and more geographically distributed.

This BoF provides an opportunity for people concerned about these topics to share existing experiences and activities, discuss how we can improve on them, and share the results. Presentations and discussion notes will be made available at the BoF series website (http://bit.ly/swe-cse-bof).

SCinet

SCinet Netsec Team Overview

2:40pm - 3:00pm CST Tuesday, 18 November 2025 Booth 3537 - SCinet Theater

Not Livestreamed

Not Recorded

SCinet

SCinet WAN Team Overview

2:00pm - 2:20pm CST Tuesday, 18 November 2025 Booth 3537 - SCinet Theater

Not Livestreamed

Not Recorded

DescriptionAn overview of the SC25 SCinet WAN Team's operation.

Research and ACM SRC Posters

ScODA: An Emerging Pipeline for Evaluating Distributed Database Performance To Support Operational Data Analytics

8:00am - 5:00pm CST Tuesday, 18 November 2025 Second Floor Atrium

Research & ACM SRC Posters

DescriptionAs high performance computing (HPC) systems scale toward the exascale era, operational data analytics (ODA) play an increasingly central role in managing system security, health, scheduling, and scientific productivity. Supercomputing facilities continuously generate massive volumes of logs and system metrics. To make actionable insights, distributed database management systems (DBMSs) are often employed, but their behavior under realistic production HPC workloads remains underexplored. This poster presents ScODA (Supercomputing Operational Data Analytics), an emerging benchmarking pipeline designed to evaluate distributed DBMS solutions—including relational, document, time-series databases and lakehouse solutions—using real and synthetic HPC environment logs. By working alongside our business intelligence colleagues to systematically model and implement common ODA workflows, ScODA enables data-driven comparisons of competing DBMS platforms and identifies trade-offs in ingestion, querying, and concurrent access. We present our methodology, preliminary benchmarks, and lessons learned from applying ScODA to multiple DBMS platforms at Argonne National Laboratory.

Paper

SDR-RDMA: Software-Defined Reliability Architecture for Planetary-Scale RDMA Communication

2:15pm - 2:37pm CST Wednesday, 19 November 2025 260-267

Architectures & Networks

Livestreamed

Recorded

DescriptionRDMA is vital for efficient distributed training across datacenters, but millisecond-scale latencies complicate the design of its reliability layer. We show that depending on long-haul link characteristics, such as drop rate, distance, and bandwidth, the widely used Selective Repeat algorithm can be inefficient, warranting alternatives like erasure coding. To enable such alternatives on existing hardware, we propose SDR-RDMA, a software-defined reliability stack for RDMA. Its core is a lightweight SDR SDK that extends standard point-to-point RDMA semantics — fundamental to AI networking stacks — with a receive buffer bitmap. SDR bitmap enables partial message completion to let applications implement custom reliability schemes tailored to specific deployments, while preserving zero-copy RDMA benefits. By offloading the SDR backend to NVIDIA’s Data Path Accelerator (DPA), we achieve line-rate performance, enabling efficient inter-datacenter communication and advancing reliability innovation for intra-datacenter training.

Workshop

Seamless End-to-End Containerized HPC Environments

12:00pm - 12:30pm CST Monday, 17 November 2025 275

Livestreamed

Recorded

DescriptionHigh-performance computing environments face increasing challenges from diverse scientific workflows, imposing conflicting demands for stability, customization, and reproducibility that traditional monolithic software stacks cannot accommodate.
We present a comprehensive approach to seamless end-to-end containerized HPC environments which decomposes the technical challenge into five manageable areas:
specification and construction of environments, session provisioning, scheduler integration, system integration, and security.
We develop and evaluate prototypes across these five technical areas, demonstrating practical feasibility through Spack-based environment construction with CI/CD pipelines, transparent session access via PAM and Kubernetes, and flexible job execution using Slurm's native container support.
% Our system integration framework supports multiple MPI strategies from host library injection to fully containerized stacks, accommodating diverse performance and portability requirements.
Through this work, we demonstrate that comprehensive containerization of HPC environments can be achieved using open standards, providing enhanced reproducibility and flexibility without sacrificing user experience.

Research and ACM SRC Posters

Seamless Scaling of Applications Across Programming Models

8:00am - 5:00pm CST Tuesday, 18 November 2025 Second Floor Atrium

Research & ACM SRC Posters

DescriptionWe present a comparative study of the productivity and performance of four programming languages: Python, Julia, C++, and DaphneDSL, for the Connected Components graph algorithm from the GAP benchmark suite. Using various code productivity metrics, we evaluated the effort of scaling applications from a local parallel version to a distributed implementation. Experiments carried out on the Vega EuroHPC system reveal that, with moderate coding effort, Julia offers the best performance, while DaphneDSL enables seamless distributed execution with no code changes, albeit at a small performance cost.

Tutorial

Secure Coding Practices and Dependency Analysis Tools

1:30pm - 5:00pm CST Monday, 17 November 2025 124

Livestreamed

Recorded

TUT

DescriptionOur goal is to increase the number of people in the workforce who can act as defenders of our high performance computing and data infrastructure. In this tutorial we cover weaknesses from the most recent "Stubborn Weaknesses in the CWE Top 25" list from MITRE. These weaknesses (coding flaws) are the ones most present in real-world security exploits and also the ones that have consistently stayed in the top 25 for at least five years. Attendees will learn how to recognize these weaknesses and code in a way that avoids them. Another issue affecting the security of our cyberinfrastructure is the fact that its software depends upon a myriad of packages and libraries, and those come from different sources. Dependency analysis tools—tools that find weaknesses in the software supply chain and develop a software bill of materials (SBOM)—can catch flaws in those packages and libraries, and that affects the safety of the application. The more programmers are exposed to training in addressing security issues, and the more they learn how to use dependency analysis tools, the bigger the impact that we can make on the security of our cyberinfrastructure.

Workshop

Secure In-Storage Execution of VTK Workloads on Modern Parallel NFS Data Servers

3:30pm - 4:00pm CST Monday, 17 November 2025 230

Data Analytics

High Performance I/O, Storage, Archive, & File Systems

Storage

Livestreamed

Recorded

DescriptionAs data volumes grow, the cost of moving large datasets increasingly limits scientific visualization performance. One promising solution is to analyze data where it is stored. This paper presents a pushdown architecture for pNFS-based storage systems that offloads early stages of a VTK pipeline—such as reading and filtering—to the pNFS data servers that hold the data. Using a FUSE-based interface, a pNFS client triggers remote processing and retrieves results by writing and reading special command and result files. Our design leverages pNFS clients' ability to locate file-resident servers, along with a recent Linux enhancement that enables efficient local access to pNFS data without exposing filesystem internals. Offloaded code runs with the user's credentials, preserving standard permission checks. Experiments with two real-world scientific datasets show up to 6.1× speedup in end-to-end visualization runtime and up to 7.1× in data loading, thanks to early data filtering that significantly reduces data movement.

Workshop

Securing HDF5 Plugins with Digital Signatures

2:03pm - 2:22pm CST Sunday, 16 November 2025 242

Livestreamed

Recorded

DescriptionHDF5 is a popular data management and I/O library used by numerous scientific and industry applications. HDF5 in the recent years pluggable extensions to enhance HDF5’s functionality to improve performance and to utilize underlying hardware and file systems. Plugins play a crucial role in adding custom features, such as compression filters, virtual file drivers (VFDs), and virtual object layer (VOL) connectors, without requiring extensive changes to the source code or modifications to the main library. While the plugin capability gives power to extend HDF5, plugins could be misused for malicious routing of HDF5 calls. To improve the security of HDF5 systematically, in this study, we explore the option of digitally signing plugins. This would help ensure the authenticity and integrity of any plugins that users may use. We discuss a few implementation scenarios in HDF5 and assess the accuracy and overhead associated with the plugin verification process.

Workshop

Seeking Cost-Optimal Infrastructure Size for Distributed Filesystems: A Ceph Case Study

10:55am - 11:20am CST Friday, 21 November 2025 276

Livestreamed

Recorded

DescriptionDistributed Filesystems (DFS) are a crucial component of modern computing environments, and their performance is critical to the success of all the facilities that rely on them. However, predicting
the DFS I/O performance solely based on the storage system hardware is not trivial.
In this paper, we address this challenge by presenting an empirical method that tries to quantitatively assess how hardware configuration choices influence the performance of a DFS using
Ceph as a case study. We investigate the influence of three hardware parameters—number of CPU cores, amount of RAM, and disk bandwidth. To control these variables, we relied on the Linux hotplug interface and Cgroups, avoiding additional software overhead.
Our results reveal that for the analyzed workloads, decreasing hardware resources does not always yield proportional performance losses. This method offers practical insights for designing cost-effective distributed storage systems, remaining general enough to be applied to other filesystems.

Workshop

SENSE in Practice: Quantifying the End-to-End Benefits of Intent-Based Bandwidth Reservation for Exascale Science Workflows

2:35pm - 2:50pm CST Sunday, 16 November 2025 266

Livestreamed

Recorded

DescriptionThe escalating demands of scientific collaborations necessitate advanced networking for deterministic, secure, and orchestrated services across multiple administrative domains. The Software-Defined Network for End-to-end Networked Science at the Exascale (SENSE) paradigm addresses these needs through intent-based networking and multi-domain orchestration. This paper evaluates SENSE's performance on a comprehensive multi-domain testbed, including GNA-G AutoGOLE, the National Research Platform (NRP), FABRIC, and production LHC CMS infrastructure. Our results demonstrate that intent-based service requests are successfully translated into network configurations, with average provisioning times of 183 seconds for simple services and 290 seconds for complex multi-domain workflows. Performance monitoring confirms that SENSE maintains guaranteed bandwidth allocations, enabling higher-priority data flows to complete significantly faster than in best-effort scenarios. This capability transforms the network into a first-class schedulable resource, optimizing scientific workflows by prioritizing data criticality and moving beyond best-effort limitations to achieve predictable and efficient data movement for data-intensive scientific endeavors.

Invited Talk

Serving Innovation and Industrial Competitiveness Through Public Infrastructures

4:15pm - 5:00pm CST Wednesday, 19 November 2025 America's Ballroom Tu-Th

National Strategies

Livestreamed

Recorded

DescriptionCreated in 2007, GENCI is the French HPC agency. As a public HPC infrastructure, it had an initial focus on serving the needs of academic open research. Throughout the years, the missions of this agency have evolved to include AI and quantum computing, and to spread to an ever-growing industrial open research community. It required moving towards a public-private infrastructure continuum, with sovereignty at heart, supported by projects like AI Factory France and CLUSSTER.

The value creation is only possible when users receive the right level of support, with an adequate orientation. That is why GENCI and its partners have set up high-level support teams in HPC and AI, and are currently exploring ways to federate and lead communities towards the use of HPC-QC environments through the HQI program and the Maisons du Quantique network. Convinced by the efficiency of these initiatives, an increasing number of industrial end-users are joining the movement and engaging in collaborations to leverage these computing capabilities to solve practical use cases. This presentation will showcase concrete examples of the results of GENCI's industrial engagement initiatives in HPC, AI, and quantum computing, demonstrating the essential role of public computing infrastructure in supporting technology transfer, sovereignty, and competitiveness.

Workshop

Seventh International Workshop on Emerging Parallel Distributed Runtime Systems and Middleware

8:30am - 8:35am CST Friday, 21 November 2025 261-262-265-266

Livestreamed

Recorded

DescriptionIPDRM invites researchers to present and discuss innovative solutions for runtime systems and middleware, addressing challenges like efficient resource utilization, data movement, memory consistency, task scheduling, energy consumption, and performance portability. This year, we focus on the system software required to bridge post-Moore's Law architectures with classical computing. This workshop will contain both technical papers and invited talks from industry and other practitioners in the field. IPDRM's goal is to promote research, discussion, and collaboration around middleware's role in integrating classical and non-traditional computing systems.

Workshop

Shaping the future workforce: Challenges and lessons learned in HPC education from national labs and computing centers

11:20am - 11:45am CST Monday, 17 November 2025 232

Education & Workforce Development

Livestreamed

Recorded

DescriptionWorkforce training at national laboratories and computing centers is essential and typically falls into two categories: foundational training for newcomers and advanced training for experienced users. Foundational topics such as version control, build systems, and basic HPC usage are widely transferable, while center-specific training varies with compute platforms and workflows. Emerging technologies span both categories, ranging from broadly applicable programming paradigms to hardware-specific skills.

To reduce redundancy and increase impact, national labs, computing centers, and vendors can collaborate by sharing and co-developing materials, co-presenting, and offering joint training events. Such coordination enhances accessibility, scalability, and consistency in HPC training across the community.

One example is the HPC Training Working Group, a community of teaching enthusiasts that meets monthly to exchange best practices, share challenges and solutions, and develop training materials. Their collaborative initiatives highlight the value of cross-institutional efforts in strengthening HPC workforce development.

Research and ACM SRC Posters

Shipping HPC Ecosystems Across Platforms: Portable and Composable HPC Clusters as Code

8:00am - 5:00pm CST Tuesday, 18 November 2025 Second Floor Atrium

Research & ACM SRC Posters

DescriptionHigh performance computing systems require intricate platform-specific stacks and configurations, which poses a challenge for reproducing the same HPC ecosystem on a different platform, a key requirement for geo-redundancy, business continuity, and urgent computing. We present a method for declaratively defining portable HPC ecosystems that can be deployed rapidly and reliably with a high degree of automation. Our model enables infrastructure-layer portability, going beyond existing cloud-native solutions.

We introduce a two-tiered modular abstraction framework: provider-specific lower-level modules that handle the implementation details; and provider-agnostic high-level modules that define core infrastructure logic, designed around the versatile software-defined clusters (vClusters) developed at CSCS.

To evaluate our approach, we showcase a portable implementation of the Weather and Climate HPC vCluster that runs on the Alps ecosystem, and deploy it on Google Cloud Platform. Our work demonstrates the effectiveness of our declarative approach in migrating HPC systems across heterogeneous platforms.

Research and ACM SRC Posters

Shortcut Mixup Policy: Toward Improving Robustness and Speed in Goal-Conditioned RL

8:00am - 5:00pm CST Tuesday, 18 November 2025 Second Floor Atrium

Research & ACM SRC Posters

DescriptionNeural networks trained on large datasets can be effective policies for the control of robotic manipulators. Using self-supervised learning, these networks can achieve near-perfect success rates on complex pick-and-place-style tasks. However, the speed of task completion is often a barrier to making learned policies practical for deployment. For instance, tasks that require 500 distinct token predictions will require many forward passes through the network, in real time. Moreover, to learn optimal task behavior—as in reinforcement learning—would require state value assignment across a long time horizon. This is often an impediment to learning. To address these challenges, we present Shortcut Mixup Policy, a method to artificially reduce the task horizon length. Our method consists of training a model on next-token prediction tasks optionally conditioned on a target state-shortcut size. We present initial results using Shortcut Mixup Policy and propose future directions for improvement.

Birds of a Feather

SIGHPC Annual Member Meeting

12:15pm - 1:15pm CST Tuesday, 18 November 2025 145

Translation of HPC into Societal Context

Livestreamed

Recorded

XO/EX

DescriptionThe annual business meeting of SIGHPC is your opportunity to hear about and discuss the status of SIGHPC and its chapters. All of the elected officers and many of the other volunteers will be present to answer your questions about SIGHPC. Representatives from our chapters will also be available. We will also be discussing upcoming plans for the year.

Paper

SIGMo: High-Throughput Batched Subgraph Isomorphism on GPUs for Molecular Matching

3:52pm - 4:15pm CST Wednesday, 19 November 2025 275

Algorithms

Applications

Data Analytics

Livestreamed

Recorded

DescriptionSubgraph isomorphism is a fundamental graph problem with applications in diverse domains. Of particular interest is molecular matching, which uses a subgraph isomorphism formulation for the drug discovery process. While subgraph isomorphism is known to be NP-complete, in molecular matching a number of domain constraints allow for efficient implementations.

This paper presents SIGMo, a high-throughput, portable subgraph isomorphism framework for GPUs, specifically designed for batch molecular matching. SIGMo takes advantage of the specific domain formulation to provide a more efficient filter-and-join strategy: the framework introduces a novel multi-level iterative filtering technique based on neighborhood signature encoding to efficiently prune candidates before the join phase.

SIGMo is written in SYCL, allowing portable execution on AMD, Intel, and NVIDIA GPUs. Our experimental evaluation on a large dataset from ZINC demonstrates up to 1,470x speedup over state-of-the-art frameworks, achieving a throughput of 7.7 billion matches per second on a cluster with 256 GPUs.

Art of HPC

Simulated Worlds and Computed Geometries

8:00am - 6:00pm CST Sunday, 16 November 2025 Art of HPC - Plaza Lobby

Art of HPC

Not Livestreamed

Not Recorded

TUT

XO/EX

DescriptionThis image combines three datasets into a single composition, revealing the beauty of structures explored with high performance computing.

On the left, a large-scale cosmology simulation on the Aurora supercomputer shows the distribution of dark matter in the universe. The visualized subset contains 40 million particles from a large-scale cosmology simulation. These particles were mapped to a continuous density field using a Smooth Particle Hydrodynamics interpolation in ParaView and rendered as a luminous volumetric cloud.

In the center, a procedurally generated 3D wavelet forms a flowing orange structure. Its oscillating patterns evoke both the dynamics of physical waves and the abstract elegance of signal analysis.

On the right, a Mandelbulb fractal emerges from mathematics. Computed with ParaView’s programmable data source, it was isosurfaced, smoothed, and rendered as a detailed polygonal form.

By placing physics simulation and abstract mathematical forms side by side, the image highlights HPC’s ability to bridge natural phenomena and computation, revealing patterns that unite science and art.

Acknowledgments: Simulation data courtesy of the HACC collaboration. This research used resources of the Argonne Leadership Computing Facility, a U.S. Department of Energy (DOE) Office of Science user facility at Argonne National Laboratory, and is based on research supported by the U.S. DOE Office of Science-Advanced Scientific Computing Research Program, under Contract No. DE-AC02-06CH11357.

Workshop

Simulating Hybrid Analog + RISC-V Systems for HPC Applications

9:40am - 9:50am CST Monday, 17 November 2025 242

Livestreamed

Recorded

DescriptionAs digital scaling trends have slowed over the past decade, there has been renewed interest in new computing paradigms such as analog. Analog computing has the potential to provide performance and efficiency beyond what is achievable by digital systems; however, many challenges remain. One such challenge is supporting complex applications using analog components that implement few computational kernels. We consider on a class of hybrid analog + digital systems where analog accelerators are used as tightly integrated coprocessors within each core. The RISC-V ISA simplifies the design of hybrid systems, providing a mature software stack for the digital components allowing system designers to focus on the analog-specific aspects of the architecture and software. To investigate the viability of these architectures for high performance computing we evaluate two iterative linear solvers using hybrid analog + RISC-V processors using the Structural Simulation Toolkit.

ACM Gordon Bell Finalist

Awards and Award Talks

Simulating Many-Engine Spacecraft: Exceeding 1 Quadrillion Degrees of Freedom via Information Geometric Regularization

10:52am - 11:15am CST Tuesday, 18 November 2025 261-262-265-266

Applications

Livestreamed

Recorded

DescriptionWe present an optimized implementation of the recently proposed information geometric regularization (IGR) for unprecedented scale simulation of compressible fluid flows applied to multi-engine spacecraft boosters. We improve upon state-of-the-art computational fluid dynamics (CFD) techniques along computational cost, memory footprint, and energy-to-solution metrics.

Unified memory on coupled CPU-GPU or APU platforms increases problem size with negligible overhead. Mixed half/single-precision storage and computation on well-conditioned numerics is used. We simulate flow at 200 trillion grid points and 1 quadrillion degrees of freedom, exceeding the current record by a factor of 20. A factor of 4 wall-time speedup is achieved over optimized baselines. Ideal weak scaling is seen on OLCF Frontier, LLNL El Capitan, and CSCS Alps using the full systems. Strong scaling is near ideal at extreme conditions, including 80\% efficiency on CSCS Alps with an 8-node baseline and stretching to the full system.

Paper

SIREN: Software Identification and Recognition in HPC Systems

10:30am - 10:52am CST Thursday, 20 November 2025 275

HPC for Machine Learning

State of the Practice

System Software and Cloud Computing

Livestreamed

Recorded

DescriptionHPC systems use monitoring and operational data analytics to ensure efficiency, performance, and orderly operations. Application-specific insights are crucial for analyzing the increasing complexity and diversity of HPC workloads, particularly through the identification of unknown software and recognition of repeated executions, which facilitate system optimization and security improvements. However, traditional identification methods using job or file names are unreliable for arbitrary user-provided names. Fuzzy hashing the content of executables detects similarities despite different code versions or compilation approaches while preserving privacy and file integrity, overcoming these limitations. We introduce SIREN, a process-level data collection framework for software identification and recognition. SIREN improves observability in HPC job execution by enabling analysis of process metadata, environment information, and executable fuzzy hashes. Findings from an opt-in deployment campaign on LUMI show SIREN’s ability to provide insights into software usage, recognition of repeated executions of known applications, and similarity-based identification of unknown applications.

Workshop

Sixth Combined Workshop on Interactive and Urgent High-Performance Computing

8:30am - 8:31am CST Friday, 21 November 2025 260-267

Livestreamed

Recorded

DescriptionInteractivity enables the exploitation of HPC in new and revolutionary ways, delivering many new and exciting opportunities for our community. Interactive HPC involves users being in the loop during job execution where a human is monitoring a job, steering the experiment, or visualizing results to make immediate decisions about the results to influence the current or subsequent interactive jobs. Likewise, urgent computing combines interactive computational modeling with the near-real-time detection of unfolding disasters to take real-time actions. Supporting interactive and urgent workloads on HPC requires expertise in a wide range of areas and the solving of numerous technical and organizational challenges.

This workshop brings together stakeholders, researchers, and practitioners from across interactive and urgent computing with the wider HPC community. We will share success stories, case studies, and technologies to continue community building around leveraging interactive HPC as an important tool responding to disasters and societal issues.

Doctoral Showcase

Sketch-Based Algorithmic Frameworks for Genome-Scale Mapping

2:00pm - 2:15pm CST Thursday, 20 November 2025 230

Research & ACM SRC Posters

Livestreamed

Recorded

DescriptionSketching is a widely used class of techniques aimed at generating compact representations of longer biological sequences. Instead of comparing sequences, sketches allow us to sample from a subspace of k-mers and use those samples for comparison, saving both time and memory in the end application. One of the key metrics to consider here is density, which refers to the fraction of the sampled k-mers retained by the sketch. While a lower density is preferable for space considerations, it could also impact the sensitivity of the mapping process.

In this work, we study sketch-based data sparsification with high performance computing to improve scalability in mapping. Our contributions are twofold: 1) we present a scalable parallel algorithmic framework for alignment-free mapping, called JEM-mapper, and 2) we present a sketch library called MHSketch by extending JEM-mapper to adopt different sequence sketching schemes. Experimental evaluation demonstrates the ability of our approach to significantly reduce density and reap performance benefits from it. In particular, results show that MHSketch achieves accurate mapping while reducing time-to-solution (speedups between 2.2x to 9.3x), and drastically reducing memory usage (>92% savings) compared to other tools.

Workshop

Slicing Is All You Need: Towards A Universal One-Sided Algorithm for Distributed Matrix Multiplication

11:50am - 12:10pm CST Sunday, 16 November 2025 230

Livestreamed

Recorded

DescriptionScience, data analytics, and AI workloads depend on distributed matrix multiplication. Prior work has developed an array of algorithms suitable for different problem sizes, partitionings, and replication factors. A limitation of existing algorithms is that they are limited to a subset of partitionings. Multiple algorithm implementations are required to support the full space of possible partitionings. If no algorithm implementation is available for a set of partitions, one or more operands must be redistributed, increasing communication overhead. We present a one-sided algorithm for distributed matrix multiplication supporting all combinations of partitionings and replication factors. Our algorithm uses index arithmetic to compute sets of overlapping tiles that must be multiplied together. This list of local matrix multiplies can then either be executed directly, or reordered and lowered to an optimized IR to maximize overlap. We implement our algorithm using a high-level C++-based PGAS programming framework, finding it competitive with state-of-the-art systems.

Workshop

SlimIO: Lightweight I/O Path Design for Write Isolation in FDP-backed In-Memory Databases

11:00am - 11:30am CST Monday, 17 November 2025 230

Data Analytics

High Performance I/O, Storage, Archive, & File Systems

Storage

Livestreamed

Recorded

DescriptionIn-Memory Databases (IMDBs) are widely used with HPC applications to manage transient data, often using snapshot-based persistence for backups. Redis, a representative IMDB, employs both snapshot and Write-Ahead Log (WAL) mechanisms, storing data on persistent devices via the traditional kernel I/O path. This method incurs syscall overhead, I/O contention between processes, and SSD garbage collection (GC) delays. To address these issues, we propose SlimIO, which adopts I/O passthru to minimize syscall overhead and inter-process I/O interference.
Additionally, it leverages Flexible Data Placement (FDP) SSDs as backup storage to avoid performance degradation from SSD GC. Experimental results show that SlimIO reduces snapshot time by up to 25%, increases query throughput by up to 30% during non-snapshot periods, and lowers 99.9%-ile latency by up to 50%. Furthermore, it achieves a write amplification factor (WAF) of 1.00, indicating no redundant internal writes, thus extending SSD lifespan.

Paper

SlimPipe: Memory-Thrifty and Efficient Pipeline Parallelism for Long-Context LLM Training

4:15pm - 4:37pm CST Wednesday, 19 November 2025 261-262-265-266

HPC for Machine Learning

Livestreamed

Recorded

DescriptionPipeline parallelism serves as a crucial technique for training large language models, owing to its capability to alleviate memory pressure from model states with low communication overhead. However, in long-context scenarios, existing pipeline parallelism methods fail to address the substantial activation memory pressure, due to the peak memory consumption resulting from the accumulation of activations across multiple microbatches. Moreover, these approaches inevitably introduce considerable pipeline bubbles, further hindering efficiency.

To tackle these challenges, we propose SlimPipe, a novel approach to fine-grained pipeline parallelism that employs uniform sequence slicing coupled with one-forward-one-backward scheduling. It reduces the accumulated activations from several microbatches to just one, which is split into several slices. Although the slices are evenly partitioned, the computation cost is not equal across slices due to causal self-attention. We develop a sophisticated workload redistribution technique to address this load imbalance. SlimPipe achieves near-zero memory overhead and minimal pipeline bubbles simultaneously.

Birds of a Feather

Slurm Community BoF

12:15pm - 1:15pm CST Thursday, 20 November 2025 240-241-242

System Software

Livestreamed

Recorded

XO/EX

DescriptionSlurm is an open-source workload manager used on many of the TOP500 systems and provides a rich set of features.

An updated Slurm community survey will be distributed ahead of the BoF, and introduced at the start of the session.

Changes made in the Slurm 25.05 and 25.11 releases will be presented, alongside the future roadmap for 26.05 and beyond.

Initial results from the community survey will be discussed. Discussion will focus around how Slurm development should react to changes in Linux distribution lifecycles, Linux cgroup versions, container runtimes, and external tools such as MPI/PMIx.

Remaining time will be used as an open community forum.

Everyone interested in Slurm use and development is encouraged to attend.

Flash Session

Smarter Fiber Cabling: Less Space, Less Risk, Less Time

11:00am - 11:15am CST Wednesday, 19 November 2025 Booth 2638 - Flash Session

Not Livestreamed

Not Recorded

XO/EX

DescriptionWith the ever-increasing capacity expansion of AI and GenAI in data centers, fiber-optic cable density has experienced exponential growth while pathways and spaces within cabinets remain static. Efficiencies are needed to accommodate and properly manage the significant cable assembly volume. Properly routing cables on-site between GPUs/CPUs to leaf and spine switches among other applications can be challenging and time-consuming. Co-locating the network electronics with power and cooling apparatus in cabinets also adds to the cable routing challenge. This presentation will review an innovative approach that reduces connection time and risk on-site, provides spare connections for operations, and is optimized for efficiencies in both rack scale GPU/CPU and switch cabinets.

Workshop

SmartIO: A Lightweight End-to-End Workflow for Runtime I/O Optimization of HPC Systems

12:00pm - 12:30pm CST Monday, 17 November 2025 230

Data Analytics

High Performance I/O, Storage, Archive, & File Systems

Storage

Livestreamed

Recorded

DescriptionThe I/O subsystem is often a critical bottleneck in HPC systems, making its optimization crucial for performance. Existing efforts to optimize the I/O in HPC applications rely on approaches that are time consuming and resource intensive. This work introduces SmartIO, an end-to-end workflow to optimize the I/O performance of HPC systems at runtime without requiring prior model training, profiling, or parameter searches. SmartIO leverages context-free grammars to predict future I/O calls during the application runtime, learns key I/O characteristics from predicted calls, and proactively optimizes performance before those calls occur using a rules-based mapping engine. Our evaluation shows that SmartIO achieves up to ∼ 13× and ∼ 12× improvements in IOR read and write bandwidth, respectively, and delivers a ∼ 4× speedup in overall I/O bandwidth for Flash-X—all with negligible overhead. Compared to the state-of-the-art I/O optimization tools, SmartIO delivers comparable or better performance while drastically reducing the tuning cost.

Workshop

SmartNIC Data Exchange Framework

11:30am - 12:00pm CST Friday, 21 November 2025 261-262-265-266

Livestreamed

Recorded

DescriptionAs the field of HPC grows ever larger, now more than ever, it is important to adapt to the rapidly evolving hardware landscape, leveraging the cutting edge and advancing beyond the limits of what is considered conventional computing. Smart Network Interface Cards (SmartNICs) are one such emerging technology that have the potential to overhaul classical computing paradigms. This paper will provide an overview of a novel data exchange framework which leverages SmartNICs to gather arbitrary host data from HPC systems and exchange it via three different methods with minimal system overhead. We discuss the latency with which the framework operates, along with the ways in which its varying configurations affect performance. Finally, we provide some context as to how the field of HPC will benefit from the introduction of SmartNICs, especially as they leverage the presented framework for a variety of future applications.

Workshop

SNAcc: An Open-Source Framework for Streaming-based Network-to-Storage Accelerators

10:53am - 11:16am CST Friday, 21 November 2025 263-264

Livestreamed

Recorded

DescriptionNetwork accessible databases are a common use case in modern data centers, often paired with pre-processing before storing results for later use. However, general purpose CPUs struggle to keep up with current Ethernet line speeds. Furthermore, in such a compute pipeline, the CPU is mostly used to manage storage accesses, wasting compute resources and communication bandwidth.

Due to their wide data path, FPGAs are very suitable for network applications.
Hence, we propose an open-source framework for the seamless high-performance integration of custom FPGA-based network-to-storage accelerators. Our solution leverages the flexible communication interfaces of FPGAs, namely Ethernet and PCIe for direct access to NVMe storage, without host CPU interaction. We are able to saturate the bandwidth of both 100G Ethernet and state-of-the-art SSDs, and demonstrate our implementation in a case study performing DNN-based classification on an image stream.

Birds of a Feather

Spack Community BoF

12:15pm - 1:15pm CST Wednesday, 19 November 2025 275

Community Meetings

Livestreamed

Recorded

XO/EX

DescriptionSpack is an open-source package manager for scientific computing with a rapidly growing community of over 1,500 contributors. This year, Spack has undergone some of the most significant changes in its 12-year history, with the release of v1.0. This BoF will feature an update from developers, covering 1.0’s enhanced compiler dependency model, improved parallelism, and stable package API. We’ll announce version 1.1 with performance and usability improvements, and we will conduct a poll to understand how users have received v1.0. Finally, we’ll open the floor for questions. Help us make installing HPC software simple!

Paper

Sparsified Preconditioned Conjugate Gradient Solver on GPUs

3:30pm - 3:52pm CST Tuesday, 18 November 2025 263-264

HPC for Machine Learning

Performance Measurement, Modeling, & Tools

Programming Frameworks

Livestreamed

Recorded

DescriptionPreconditioned iterative sparse linear solvers are memory-efficient for large scientific simulations, but the dependences between iterations introduced by preconditioners limit parallelization. This issue is exacerbated on GPUs, which feature many parallel cores. We propose a sparsified preconditioned conjugate gradient (SPCG) solver that increases parallelism by reducing dependences through sparsification, while preserving convergence behavior. We evaluate the proposed SPCG using both ILU(0) and ILU(K) preconditioners on a wide range of symmetric positive definite (SPD) matrices. The proposed SPCG improves the performance of the iterative phase of SPCG by a geometric mean speedup of 1.23$\times$ and 1.65$\times$ over the non-sparsified PCG using ILU(0) and ILU(K), respectively, on an NVIDIA A100 GPU. SPCG also yields geometric mean end-to-end speedups of 1.68$\times$ and 3.73$\times$ over the non-sparsified versions with ILU(0) and ILU(K), respectively, on the same platform.

Paper

SparStencil: Retargeting Sparse Tensor Cores to Scientific Stencil Computations via Structured Sparsity Transformation

4:37pm - 5:00pm CST Wednesday, 19 November 2025 260-267

Algorithms

BSP

Livestreamed

Recorded

DescriptionSparse Tensor Cores offer exceptional performance gains for AI workloads by exploiting structured 2:4 sparsity. However, their potential remains untapped for core scientific workloads such as stencil computations, which exhibit irregular sparsity patterns.

This paper presents SparStencil, the first system to retarget sparse TCUs for scientific stencil computations through structured sparsity transformation. SparStencil introduces three key techniques:

(1) Adaptive Layout Morphing, which restructures stencil patterns into staircase-aligned sparse matrices via a flatten-and-crush pipeline;

(2) Structured Sparsity Conversion, which formulates transformation as a graph matching problem to ensure compatibility with 2:4 sparsity constraints;

(3) Automatic Kernel Generation, which compiles transformed stencils into optimized sparse MMA kernels via layout search and table-driven memory mapping.

Evaluated on 79 stencil kernels spanning diverse scientific domains, SparStencil achieves up to 7.1x speedup (3.1x on average) over state-of-the-art framework, while reducing code complexity and matching or exceeding expert-tuned performance in both compute throughput and memory efficiency.

Exhibitor Forum

Speed, Smarts, and Science: Next-Gen HPC Networking at UT Dallas

4:00pm - 4:30pm CST Thursday, 20 November 2025 130

Networking

Livestreamed

Recorded

XO/EX

DescriptionThe University of Texas at Dallas (UT Dallas) launched the High Performance Computing for Research and Education (HPCRE) initiative to support cutting-edge research with advanced computing capabilities. As HPC demands increased, legacy network infrastructure struggled with performance, visibility, and operational efficiency. In response, UT Dallas partnered with Juniper Networks to modernize its HPC network, replacing outdated Cisco and Dell systems with Juniper QFX series switches and Apstra automation. This transformation delivered reduced latency, proactive fault detection, and a unified network view, streamlining collaboration between IT and research teams. The session highlights how automation and predictive analytics improved capacity planning and decreased resolution times. Attendees will learn how a modern network architecture enhanced speed, agility, and reliability in HPC workloads—accelerating research outcomes and fostering innovation. This case study offers valuable insights into reimagining infrastructure to meet the growing demands of high performance computing in academia.

Workshop

Speeding Up Phonon Dynamic Structure Factor Calculations in Phonopy with GPU-Accelerated Computing

11:20am - 11:30am CST Sunday, 16 November 2025 240

Livestreamed

Recorded

DescriptionWe utilized high-performance computing techniques, including GPU accelerators, to speed up calculations of the phonon dynamic structure factor used to model spectroscopy data measured at neutron and x-ray scattering facilities. This faster workflow is a first step toward experimental steering, enabling facility users to make informed decisions during beam time rather than after returning to their home institutions. A collection of functions in Phonopy, a mostly serial Python+C code, was identified as a bottleneck for high-fidelity calculations utilizing hundreds of thousands of points in reciprocal space. We created a proxy application replicating Phonopy’s 𝑟𝑢𝑛_𝑑𝑦𝑛𝑎𝑚𝑖𝑐_𝑠𝑡𝑟𝑢𝑐𝑡𝑢𝑟𝑒_𝑓 𝑎𝑐𝑡𝑜𝑟 function and used the Numba and CuPy libraries to run on the latest NVIDIA GH200 and AMD MI300A GPU accelerators. Two representative use cases, CaHgO2 and CsSnBr3, showed speedups of up to 10× and 15×, respectively. The utilization of high-performance computing accelerators showcases the potential use of Oak Ridge Leadership Computing Facility resources to rapidly analyze experimental data from the Spallation Neutron Source at Oak Ridge National Laboratory.

Research and ACM SRC Posters

SRAP: Sender-Side Receiver-Aware Port Selection for High-Speed Multi-Flow TCP

8:00am - 5:00pm CST Tuesday, 18 November 2025 Second Floor Atrium

Research & ACM SRC Posters

DescriptionAchieving 400 Gbps requires aggregating multiple flows across cores rather than pushing single-flow limits. However, with 16×25 Gbps TCP flows, random ephemeral ports cause receiver-side packet steering (RSS) to concentrate flows on single CPU cores, degrading throughput from 25 Gbps to below 5 Gbps.

Unlike receiver-side solutions that suffer from cache misses and state migration overhead, SRAP transparently controls source ports at senders, ensuring collision-free mapping without runtime remapping costs or application modification.

With 16×25 Gbps flows, random assignment achieves optimal distribution with probability 1.04\%, causing throughput to vary from 44.8 to 395 Gbps, while our approach consistently delivers 23.3-25 Gbps per flow with guaranteed 1:1 flow-to-core mapping.

Paper

Stability-Preserving Lossy Compression for Large-Scale Partial Differential Equations

2:15pm - 2:37pm CST Thursday, 20 November 2025 261-262-265-266

Algorithms

Applications

State of the Practice

Livestreamed

Recorded

DescriptionCheckpoint/Restart (C/R) strategies are vital for fault tolerance in PDE-based scientific simulations, yet traditional checkpointing incurs significant I/O overhead. Lossy compression offers a scalable solution by reducing checkpoint data size, but conventional methods often lack control over physical invariants (e.g., energy), leading to instability such as oscillations or divergence in partial differential equation (PDE) systems. This paper introduces a stability-preserving compression approach tailored for PDE simulations by explicitly controlling kinetic and potential energy perturbations to ensure stable restarts. Extensive experiments conducted across diverse PDE configurations demonstrate that our method maintains numerical stability with minimal error magnification—even across multiple checkpoint-restart cycles—outperforming state-of-the-art lossy compressors. Parallel evaluations on the Frontier supercomputer show up to 8.4× improvement in checkpoint write performance and 6.3× in read performance, while maintaining relative $L^2$ errors $\sim$2e-6 throughout continued simulation. These results provide practical guidance for balancing compression accuracy, stability, and computational efficiency in large-scale PDE applications.

Workshop

Stackless vs. Stackful Coroutines: A Comparative Study for RDMA-based Asynchronous Many-Task (AMT) Runtimes

3:50pm - 4:10pm CST Sunday, 16 November 2025 230

Livestreamed

Recorded

DescriptionAsynchronous Many-Task (AMT) runtimes manage parallelism by suspending and migrating tasks between processes, with their state captured in continuations. The efficiency of suspending, migrating, and resuming these continuations is critical to application performance.

This work directly compares stackful and stackless coroutines as continuation implementations in a cluster environment using RDMA-based coordinated work stealing. We implement and evaluate two functionally equivalent AMT runtimes for a fine-grained, recursive workload: one using traditional stackful coroutines, and another using C++20 stackless coroutines.

Our results show that both approaches yield nearly identical overall performance for small-state tasks. Stackful coroutines are created 2.4x faster, while stackless coroutines switch context 3.5x faster and have smaller frames. However, the smaller frame size of stackless coroutines does not significantly reduce communication time, which is dominated by network latency. We conclude that both coroutine types are viable, with stackless coroutines offering advantages as task state increases.

SCinet Network Research Exhibition

StarLight Network Prototypes: AIDTNaaS, P4 + FPGA Real-Time Header Process and SciTag at N X 400G

1:40pm - 2:00pm CST Wednesday, 19 November 2025 Booth 3537 - SCinet Theater

Not Livestreamed

Not Recorded

Workshop

State Machine Orchestration of an HPC Workflow in Cloud

11:54am - 12:12pm CST Monday, 17 November 2025 264

Livestreamed

Recorded

DescriptionThe High Performance Computing (HPC) community is facing a period of change, where access to resources is uncertain and workflows must move between available environments. The economic and innovative power of cloud presents opportunity by offering state-of-the-art orchestration frameworks like Kubernetes. However, the existence of environments does not guarantee access to them, and porting HPC applications to cloud is non-trivial. In this work, we redesign the orchestration of an ensemble-based workflow – the Multiscale Machine-Learned Modeling Infrastructure (MuMMI) for Kubernetes. We perform experiments representing a progression from a traditional MuMMI run on HPC to a fully portable variant running in cloud to assess the relative contributions of cloud-native features to workflow performance improvement. Moving from a traditional design based on service and filesystem components to an event-driven design we demonstrate 62.24% and 40.29% faster workflow completion times for CPU and GPU setups, respectively, resulting in 45.0% and 38.3% lower costs.

Community Engagement and Support

State of the Art GPU Code Generation with Reinforcement Learning

4:30pm - 5:00pm CST Tuesday, 18 November 2025 Booth 3537 - SCinet Theater

Not Livestreamed

Not Recorded

XO/EX

DescriptionWriting high performance parallel code takes a long time and offers a steep learning curve. Today's LLMs are helpful, but not quite up to the task. We create an agent architecture with a fine tuned model that achieves state of the art results, allowing anyone to write code for GPUs effectively.

Paper

STELLAR: Storage Tuning Engine Leveraging LLM Autonomous Reasoning for High-Performance Parallel File Systems

1:30pm - 1:52pm CST Wednesday, 19 November 2025 275

Data Analytics, Visualization & Storage

Livestreamed

Recorded

DescriptionI/O performance is crucial to efficiency in data-intensive scientific computing, but tuning large-scale storage systems is complex, costly, and notoriously manpower-intensive, making it inaccessible for most domain scientists. In this study, we propose STELLAR, an autonomous tuner for high-performance parallel file systems. Our evaluations show that STELLAR always selects near-optimal configurations for the parallel file systems within the first five attempts, even for previously unseen applications. STELLAR’s human-like efficiency is fundamentally different from existing auto-tuning methods, which often require hundreds of thousands of iterations to converge. STELLAR achieves this through Retrieval-Augmented Generation, external tool execution, LLM-based reasoning, and a multi-step agent design to stabilize reasoning and combat hallucinations. STELLAR's architecture opens new avenues for addressing complex system optimization problems, especially those characterized by vast search spaces and high exploration costs. Its extremely efficient autonomous tuning will broaden access to I/O performance optimizations for domain scientists with minimal additional resource investment.

Paper

Story of Two GPUs: Characterizing the Resilience of Hopper H100 and Ampere A100 GPUs

3:52pm - 4:15pm CST Wednesday, 19 November 2025 263-264

Applications

Performance Measurement, Modeling, & Tools

State of the Practice

Livestreamed

Recorded

DescriptionThis study characterizes GPU resilience in Delta, a large-scale AI system that consists of 1,056 A100 and H100 GPUs, with over 1,300 petaflops of peak throughput. We used 2.5 years of operational data (11.7 million GPU hours) on GPU errors. Our major findings include:

1) H100 GPU memory resilience is worse than A100 GPU memory, with 3.2x lower per-GPU MTBE for memory errors.

2) The GPU memory error-recovery mechanisms on H100 GPUs are insufficient to handle the increased memory capacity.

3) H100 GPUs demonstrate significantly improved GPU hardware resilience over A100 GPUs with respect to critical hardware components.

4) GPU errors on both A100 and H100 GPUs frequently result in job failures due to the lack of robust recovery mechanisms at the application level.

5) We project the impact of GPU node availability on larger scales and find that significant overprovisioning of 5% is necessary to handle GPU failures.

Paper

StraGCN: GPU-Accelerated Strassen’s Sparse-Dense Matrix Multiplication for Graph Convolutional Network Training

4:15pm - 4:37pm CST Tuesday, 18 November 2025 263-264

HPC for Machine Learning

Performance Measurement, Modeling, & Tools

Programming Frameworks

Livestreamed

Recorded

DescriptionGraph convolutional networks (GCNs) are a fundamental approach to deep learning on graph-structured data. However, they face a significant challenge in training efficiency due to the high computational cost of Sparse-Dense Matrix Multiplication (SpMM). This paper presents StraGCN, the first GPU-accelerated SpMM implementation based on Strassen’s algorithm specifically designed for GCN training. First, we propose a horizontal fusion model for GPU kernels as an alternative to commonly-used multi-stream CUDA models, significantly improving data locality of on-chip shared memory for Strassen’s SpMM. Second, StraGCN exploits the immutability of the adjacency matrix in GCNs to reuse intermediate results from submatrix operations, substantially reducing redundant computations. Third, we propose two-stage matrix partitioning to mitigate load imbalance caused by the irregular distribution of non-zero elements. We evaluate StraGCN with 15 benchmark datasets. Experimental results show that StraGCN achieves performance speedups of 2.1×, 2.6×, and 3.3× compared with state-of-the-art GCN frameworks—GNNA, PyG, and DGL, respectively.

Workshop

StreamHub: High-performance Managed SciStream as a Service

4:50pm - 5:05pm CST Sunday, 16 November 2025 266

Livestreamed

Recorded

DescriptionScientific data streaming enables the real-time transfer, processing, and analysis of high-throughput experimental data, providing low-latency insights that are essential for adaptive experiment control. We present SciStream-as-a-Service (StreamHub), a secure, high-performance, and scalable framework that integrates Globus Compute with SciStream to deliver continuous memory-to-memory data transfer, in-transit processing, and robust zero-trust security while orchestrating the entire streaming setup across distributed facilities with only approximately 4.3s overhead. We describe StreamHubs design, core components, and deployment model, and evaluate its performance under diverse configurations. Our results demonstrate that StreamHub can be deployed without privileged access, ensures end-to-end encryption and authentication across institutions, and achieves near line-rate throughput with under 2% overhead compared to unencrypted transfers, making it a practical solution for real-time scientific discovery and experiment steering.

Workshop

Streaming X-ray Detector Data to Remote Facilities Using EJFAT

9:25am - 9:50am CST Sunday, 16 November 2025 240

Livestreamed

Recorded

DescriptionPropelled by the increasing need for the near real-time feedback for user experiments on its X-ray beamlines, the Advanced Photon Source continues to investigate the use of streaming workflows, with several of those being successfully deployed on its local computing infrastructure. With ever-growing data volumes and compute resource needs, the ability to analyze beamline data at remote facilities is becoming more and more important.

In this paper we investigate the possibility of using ESnet JLab FPGA Accelerated Transport (EJFAT) project infrastructure to bring X-ray detector data directly from the instrument into an analysis application running at a remote high performance computing center. To that end, we describe successful integration of PvaPy, a Python API for the EPICS PV Access protocol, with the EJFAT software library. We also discuss potential use cases, as well as illustrate system performance in terms of maximum achievable frame and data rates in a test environment.

Workshop

Stretch Break

11:30am - 11:35am CST Monday, 17 November 2025 241

Livestreamed

Recorded

Doctoral Showcase

Structural Equation Modeling for Heterogeneous Platforms

2:15pm - 2:30pm CST Thursday, 20 November 2025 230

Research & ACM SRC Posters

Livestreamed

Recorded

DescriptionHeterogeneous platforms introduce new complexities into performance modeling and prediction. The intrinsic performance asymmetries found in these platforms require radically different approaches to manage the compute diversity of this polymorphic architectural design space. Core heterogeneity augments our traditional notion of compute semantics, making simply defining the number of compute units of a given speed no longer sufficient. Workload adaptation and complexity mitigation is no longer simply a question of volume (number of cores) but also constitution (types of cores). To ensure performance portability of workloads on these platforms requires an understanding of and investigation into the interactions of workloads, compute units, and parameters that create a spectrum of performance opportunities extending across classes of platforms.

Using the Orange Pi 5, a heterogeneous asymmetric multiprocessing platform (AMP), to examine this prolific domain, we embark on a principled analysis journey into the performance implications of classical workloads (i.e., matrix-matrix multiply) on these platforms. We demonstrate techniques enabling complexity mitigation and performance portability across compute unit groups. Finally, by applying structural equation modeling (SEM) to this reference platform, we discover the most critical components impacting performance for these classical workloads, revealing component interactions affecting platform performance, and articulating the impact of parameter effects on platform performance using a novel and unprecedented approach in computer engineering.

Students@SC

Students@SC: Ignite Your Career — Career Panel

3:45pm - 4:45pm CST Sunday, 16 November 2025 263

Not Livestreamed

Not Recorded

TUT

XO/EX

DescriptionFuel your future with HPC at SC25! Join this dynamic and interactive career panel, moderated by Andrekka “AJ” Lanier from Lawrence Livermore National Laboratory, where innovation meets inspiration. In this 60-minute session, students will connect with professionals from government agencies, national labs, and industry who are driving breakthroughs in high performance computing. Discover how HPC ignites possibilities and empowers the next generation to blaze their own trails toward impactful careers. Don’t miss this opportunity to spark your journey in HPC and shape the future!

Students@SC

Students@SC: Ignite Your Career — Portfolio Development for Professional Growth

10:00am - 10:45am CST Monday, 17 November 2025 263

Not Livestreamed

Not Recorded

TUT

XO/EX

DescriptionIgnite your professional potential with our HPC Portfolio Course at SC25! Designed to showcase your skills and achievements in high performance computing, this course will guide you through creating a standout portfolio that sparks interest and opens doors to new opportunities. Whether you're advancing your career or preparing for your next big step, this is your chance to fuel your future in HPC.

Paper

STZ: A High-Quality and High-Speed Streaming Lossy Compression Framework for Scientific Data

3:52pm - 4:14pm CST Thursday, 20 November 2025 261-262-265-266

Data Analytics, Visualization & Storage

Livestreamed

Recorded

DescriptionError-bounded lossy compression is one of the most efficient solutions to reduce the volume of scientific data. For lossy compression, progressive decompression and random-access decompression are critical features that enable on-demand data access and flexible analysis workflows. However, these features can severely degrade compression quality and speed. To address these limitations, we propose a novel streaming compression framework that supports both progressive decompression and random-access decompression while maintaining high compression quality and speed. Our contributions are three-fold: (1) we design the first compression framework that simultaneously enables both progressive decompression and random-access decompression; (2) we introduce a hierarchical partitioning strategy to enable both streaming features, along with a hierarchical prediction mechanism that mitigates the impact of partitioning and achieves high compression quality—even comparable to state-of-the-art (SOTA) non-streaming compressor SZ3; and (3) our framework delivers high compression and decompression speeds, up to 6.7x faster than SZ3.

Workshop

Summary: Hierarchical Framework for Multi-node Compute eXpress Link Memory Transactions

4:42pm - 5:06pm CST Monday, 17 November 2025 240

Livestreamed

Recorded

DescriptionThere is a growing need to support high-volume, concurrent transaction processing on shared data in both high-performance computing and data center environments. A recent innovation in server architectures is the use of disaggregated memory organizations based on the Compute eXpress Link (CXL) interconnect protocol. While CXL memory architectures alleviate many concerns in data centers, enforcing ACID semantics for transactions in CXL memory faces many challenges.
This paper is a summary of a full paper at MEMSYS25, where we describe a novel solution for supporting ACID (Atomicity, Consistency, Isolation, Durability) transactions in a CXL-based disaggregated shared-memory architecture. We call this solution HTCXL for Hierarchical Transactional CXL.
HTCXL is implemented in a software library that enforces transaction semantics within a host, along with a back-end controller to detect conflicts across
hosts. HTCXL is a modular solution allowing different combinations of HTM or software-based transaction management to be mixed as needed.

Birds of a Feather

Super(computing)heroes

12:15pm - 1:15pm CST Thursday, 20 November 2025 261-262-265-266

Community Meetings

Livestreamed

Recorded

XO/EX

DescriptionMembers of underrepresented groups often lack access to role models within their minority. The HPC community is still predominantly male, making it difficult for young women to find female "superheroes" to identify with. Such role models are crucial for career planning and guidance. This session aims to provide especially women with the opportunity to meet influential, well-recognized female HPC "superheroes" from academia, research labs, HPC centers, and industry. Join us to be inspired and find relatable role models as we work together to build a more inclusive and connected HPC community.

Birds of a Feather

Supercharging Battery Modelling: Adapting Mature Codes for Exascale HPC

5:15pm - 6:45pm CST Tuesday, 18 November 2025 126

Applications

Livestreamed

Recorded

XO/EX

DescriptionWith the increasing prevalence of electric vehicles and Net Zero targets, battery simulation is crucial. HPC facilitates accurate simulations of battery electrochemistry and thermal behavior, allowing improved predictions of cell performance, cyclic life, and safety to inform R&D. This BoF will address the challenges of bringing a mature software ecosystem into the exascale and AI era. The BoF will consist of a combined talk from an international group who are engaged in developing battery simulations for exascale applications, followed by an audience discussion and panel session sharing insights, obstacles, and solutions to accelerate the collaborative development of batteries.

Invited Talk

Supercomputing the Skies: How GE Aerospace Is Revolutionizing the Future of Flight

9:20am - 10:00am CST Thursday, 20 November 2025 America's Ballroom Tu-Th

Livestreamed

Recorded

DescriptionThe aviation industry is entering a new era of innovation, with aircraft and engine manufacturers advancing technologies that will shape the future of flying.

GE Aerospace, a leading jet engine maker, is leveraging U.S. Department of Energy exascale supercomputers—cutting-edge tools that enable the rapid development of new jet engine technologies, such as the Open Fan engine architecture.

The Open Fan design eliminates the traditional outer casing, allowing for a larger fan size with reduced drag, which significantly improves fuel efficiency. Historically, designing, developing, and testing a new type of jet engine could take decades. However, high performance computing now accelerates technology development, enabling faster and more efficient designs. This breakthrough allows GE Aerospace to develop the Open Fan engine—targeting over 20% fuel efficiency improvement compared to current engines—within a single generation. This level of efficiency represents an unprecedented milestone for the industry.

Using advanced simulations, GE Aerospace engineers analyze the aerodynamics of an Open Fan mounted on an aircraft wing under simulated flight conditions. These simulations optimize hardware designs for enhanced efficiency, reduced noise, and improved overall performance.

Workshop

Sustainable Supercomputing

2:00pm - 2:01pm CST Sunday, 16 November 2025 264

Livestreamed

Recorded

DescriptionSustainable supercomputing is a pressing topic for our community, industry, and governments. Supercomputing has an ever-increasing need for computational cycles while facing the increasing challenges of delivering performance/Watt advances within the context of climate change, the drive towards net-zero, and geo-political-economic pressures.

Improving supercomputing sustainability provides many opportunities considering an end-to-end, holistic view of the HPC system, facility, site, and broader environment. All elements of the HPC system must be considered, from low-level circuits, up the software stack and beyond to power/cooling systems. The drive towards more sustainable supercomputing requires measurements, metrics, goals, and improvement processes.

This workshop will gather users, researchers, and developers to address the opportunities and challenges of supercomputing sustainability. Topics include, but are not limited to:

• Deployment of supercomputing systems
• Data center efficiency
• Software tools for measuring energy efficiency throughout the supercomputing system
• Standardization of measurement/reporting of key sustainability metrics and emissions

Birds of a Feather

SYCL Is 10! New Features for Heterogeneous Programming in Khronos SYCL

12:15pm - 1:15pm CST Tuesday, 18 November 2025 274

Programming Frameworks

Livestreamed

Recorded

XO/EX

DescriptionFor the past decade, the open-standard SYCL programming model has provided a portable way to program heterogeneous systems across application domains such as fusion energy, molecular dynamics, aerospace, and AI, running on some of the largest GPU-accelerated machines, such as Aurora.

In this BoF, we will bring together the community of everyone using and developing SYCL applications and implementations. We will showcase new features due for release at SC25, and discuss feedback and priorities for SYCL Next. A panel of SYCL experts, runtime/compiler implementers, and application specialists will lead an audience discussion and Q&A.

Research and ACM SRC Posters

Sync-Free GPU Parallelization of Sparse Kernels from Sequential Python Code

8:00am - 5:00pm CST Tuesday, 18 November 2025 Second Floor Atrium

Research & ACM SRC Posters

DescriptionSparse matrix kernels such as SpMV, SpTRSV, and Gauss-Seidel are critical in scientific computing, AI, and engineering, but they remain difficult to parallelize due to irregular memory access patterns. Traditional compiler techniques assume affine array accesses, which do not hold in sparse formats like CSR and CSC. As a result, existing compilers often leave sparse code under-optimized, missing significant opportunities for parallelism.

We present a sync-free, runtime-based transformation that automates loop parallelization for sparse kernels with loop-carried dependencies. Our approach traces memory reads and writes to construct dependence sets, then generates Triton kernels that use flag arrays to enforce correctness without global synchronization. This method generalizes across sparse kernels by leveraging properties such as associativity and affine simplifications, enabling efficient parallel execution.

We demonstrate our work with sparse triangular solves and related kernels, and will present performance results, methodology, and case studies in the poster session.

Workshop

Synergies between classical and quantum HPC through open-source software

10:30am - 11:00am CST Sunday, 16 November 2025 275

Livestreamed

Recorded

DescriptionThe integration of quantum computing and classical high-performance computing (HPC) powered recent examples of calculations at the utility scale with practical applications. This talk presents two workflows implementing synergies between quantum and classical computing. First, the Quantum Resource Management Interface and its Slurm SPANK plugin expose quantum processing units as HPC resources for seamless scheduling and execution. Second, Qiskit Machine Learning stands out as a robust library for the large-scale experimentation of quantum machine learning workflows across heterogeneous architectures. Together, these case studies illustrate how open-source innovation, supported by public-private initiatives like HNCDI (STFC-IBM), is transforming hybrid quantum-classical computing from proof-of-concepts to scalable assets.

Panel

System and Software Testing for Post-Exascale HPC: Challenges and Opportunities

1:30pm - 3:00pm CST Wednesday, 19 November 2025 240-241-242

Debugging & Correctness Tools

HPC Software & Runtime Systems

Systems Administration and/or Resource Management of HPC Systems

Livestreamed

Recorded

DescriptionHigh performance computing (HPC) and its application software face unique testing challenges due to extreme concurrency, complex systems architecture, and variety of correctness requirements associated with floating point numbers. Ensuring correctness, performance stability, and resilience at scale requires testing not only application codes and libraries but also system software and hardware-software interactions. Despite its critical importance, HPC testing methodologies often lag behind the rapid growth in system and application complexity. This panel brings together experts from system software, programming models, numerical libraries, and large-scale applications to discuss evolving challenges and emerging opportunities in HPC testing. Panelists will share lessons from real-world failures, explore new approaches—including formal verification, large-scale fault injection, and property-based testing—and debate whether testing should become a first-class research and operational priority for future HPC systems. Audience engagement will be encouraged to collaboratively envision how testing practices must evolve to meet the demands of exascale computing and beyond.

Exhibitor Forum

Systematic Thermal Analysis of Two-Phase Direct-to-Chip Cooling: A Framework for Efficient Data Center Thermal Design

11:30am - 12:00pm CST Wednesday, 19 November 2025 130

Software Tools

Livestreamed

Recorded

XO/EX

DescriptionAs HPC systems scale to meet AI workloads, processor power densities are increasing beyond 1,000 W. Two-phase (2P) direct-to-chip (DTC) cooling offers a high-efficiency, reliable, and future-proof alternative to traditional air or single-phase liquid cooling. This talk presents a comprehensive system-level thermal analysis for 2P DTC cooling, breaking down the end-to-end temperature difference (from processor case temperature to facility water supply temperature) into three contributing thermal resistances: cold plate, vapor line pressure drop, and condenser. Each of the contributions are analyzed quantitatively for two different refrigerants: R1233zd(E) and R515B. The thermal stack-up for 2P DTC cooling is compared with single-phase DTC cooling, showing that 2P systems support higher allowable facility water temperatures with high power/heat flux processors, enabling improved energy efficiency. This presentation offers HPC system architects and thermal engineers insights into benchmarking, optimizing, and deploying 2P cooling technologies to meet the thermal challenges of AI and exascale computing.

Paper

T2-RELION: Task Parallelism, Tensor Core Accelerated RELION for Cryo-EM 3D Reconstruction

4:37pm - 5:00pm CST Thursday, 20 November 2025 263-264

Applications

Livestreamed

Recorded

DescriptionCryo-electron microscopy (cryo-EM) is a key technique for structural biology, but its computational efficiency, particularly during 3D reconstruction, remains a bottleneck. We introduce T2-RELION, a highly optimized version of RELION for cryo-EM 3D reconstruction on CPU-GPU platforms. RELION is a widely used open-source package in the cryo-EM community. We identify and resolve key inefficiencies in RELION’s parallelization strategy and memory management by proposing task parallelism and a three-phase GPU memory management strategy.

Furthermore, we leverage Tensor Cores to accelerate the hot-spot kernel for difference calculation, employing an advanced pipelining strategy to hide latency and enable thread block-level data reuse. On a quad-A100 GPU machine, performance evaluations demonstrate that T2-RELION outperforms RELION 4.0. For the hot-spot kernel, our optimizations achieve 1.90-23.7 times speedup. For the whole application using CNG and Trpv1 datasets, we observe 3.86 times and 2.68 times speedups, respectively.

Workshop

Tackling the Challenges of Adding Pulse-level Support to a Heterogeneous HPCQC Software Stack

9:15am - 9:30am CST Friday, 21 November 2025 275

Livestreamed

Recorded

DescriptionWe present an approach to add native pulse-level control to heterogeneous HPCQC stacks, using the Munich Quantum Software Stack (MQSS) as a case study. Pulse programs are captured by three low-level abstractions, that is, ports (I/O channels), frames (reference signals), and waveforms (pulse envelopes). We identify representation challenges at the user-interface, compiler (IR), backend-interface, and exchange-format layers, and propose specific solutions: 1) a compiled C/C++ pulse API to avoid Python overhead, 2) LLVM extensions for pulse instructions, 3) a C-based backend interface to query hardware constraints 4) and a portable pulse-sequence exchange format. The design provides an end-to-end pulse-aware compilation and runtime path for HPC environments and an architectural blueprint to integrate pulse-level operations without disrupting classical workflows.

Paper

TaGNN: An Efficient Topology-Aware Accelerator for High-Performance Dynamic Graph Neural Networks

11:15am - 11:37am CST Tuesday, 18 November 2025 260-267

HPC for Machine Learning

System Software and Cloud Computing

Not Livestreamed

Not Recorded

Partially Livestreamed

Partially Recorded

DescriptionExisting DGNN solutions still suffer from low data parallelism. To address this problem, we propose the topology-aware DGNN accelerator TaGNN. It presents a topology-aware concurrent execution approach in the accelerator design that calculates the final features of affected vertices while ensuring that unaffected vertices are loaded and computed only once per layer across multiple snapshots, maximizing data parallelism while minimizing memory usage. TaGNN develops a similarity-aware cell skipping strategy to selectively reuse the RNN results from the previous snapshot to bypass RNN operations, further improving data parallelism with minimal accuracy loss. TaGNN on a Xilinx Alveo U280 FPGA shows average speedups of 535.2x and 84.3x, and energy savings of 742.6x and 104.9x over state-of-the-art software DGNNs on Intel Xeon CPUs and NVIDIA A100 GPUs, respectively. TaGNN also outperforms DGNN-Booster, E-DGCN, and Cambricon-DG by average speedups of 13.5x, 10.2x, and 6.5x and energy savings of 15.9x, 11.7x, and 7.8x, respectively.

Doctoral Showcase

Taming the Beast of Dynamic Resource Management in HPC

11:15am - 11:30am CST Thursday, 20 November 2025 230

Research & ACM SRC Posters

Livestreamed

Recorded

DescriptionDynamic resource management (DRM) enables the resources assigned to a job to be adjusted during execution. From a system perspective, DRM adds flexibility to resource allocation and job scheduling, with the potential to improve utilization, throughput, energy efficiency, and responsiveness. From an application perspective, it allows users to match resource requests to evolving needs, potentially reducing queue times and costs.

Despite these benefits and a decade of research, DRM remains largely an academic concept in HPC rather than a production feature. This is due to the need for coordinated changes across the entire software stack—applications, programming models, process managers, and resource managers—along with a holistic co-design effort to develop new scheduling and optimization policies.

We present a novel, end-to-end approach to DRM in HPC, introducing generic design principles for parallel programming models that integrate applications’ dynamic process management with the resource managers’ optimization capabilities. We apply these principles across the HPC stack, incorporating standards such as MPI and PMIx, to create a fully dynamic environment supporting diverse applications. This is paired with a performance-aware scheduling strategy based on steepest-ascent optimization.

Experiments on up to 100 nodes show moderate overheads for application process reconfiguration while delivering substantial gains in system throughput and average job turnaround time compared to static scheduling under high-load conditions.

Workshop

TCPP Curriculum Initiative

12:00pm - 12:30pm CST Sunday, 16 November 2025 261

Livestreamed

Recorded

Workshop

Teaching AI Through Narrative Data: A Practical Framework for Data Science and Retrieval‑Augmented Generation

2:25pm - 2:50pm CST Monday, 17 November 2025 232

Education & Workforce Development

Livestreamed

Recorded

DescriptionBy centering the workshop on a single, richly structured dataset, participants gained technical skills while developing deeper data intuition and problem-solving abilities across the AI/ML pipeline. Moving seamlessly from data familiarization to feature engineering, model selection, and evaluation, learners explored how algorithmic choices interact with dataset characteristics and research questions. The integrated hackathon reinforced these concepts, allowing teams to pose their own questions, identify necessary features, select appropriate models, and iterate on solutions within a realistic, end-to-end workflow. This continuity reduced cognitive load, encouraged reflection on successes and failures, and highlighted the trade-offs inherent in different analytical approaches. Together, these outcomes demonstrate how a project-based, single-dataset framework fosters holistic understanding, preparing participants to apply AI/ML methods thoughtfully and effectively. This approach sets the stage for discussing the broader novelty and pedagogical impact of the workshop.

Workshop

Teaching Task-Based Parallel Programming with a Runtime Systems-Aware Perspective

9:35am - 9:50am CST Sunday, 16 November 2025 261

Livestreamed

Recorded

DescriptionConventional parallel programming using explicit multithreading over modern multicore processors imposes significant complexity in organizing and balancing work across threads. Task-based models simplify parallel programming using runtimes that handle task scheduling and resource management, improving scalability and reducing developer effort.

This paper presents the structure and experiences of teaching the Parallel Runtimes for Modern Processors course (PRMP) at IIIT Delhi. The course introduces a basic task-based parallel programming model in the async–finish style. Students implement this programming model together with a general-purpose dynamic load-balancing runtime system. As the course advances, students gradually improve both the parallel programming model and the runtime to overcome limitations and challenges of modern processor architectures. We conclude with a qualitative and quantitative evaluation of the three offerings of PRMP to date, showing that the course has significantly improved student's understanding of how to write and execute parallel programs effectively.

Workshop

TEGRA - Scaling Up Graph Processing with Disaggregated Computing

10:50am - 11:10am CST Sunday, 16 November 2025 265

Livestreamed

Recorded

DescriptionGraph processing workloads continue to grow in scale and complexity, demanding architectures that can adapt to diverse compute and memory requirements. Traditional scale-out accelerators couple compute and memory resources, resulting in resource underutilization when executing workloads with varying compute-to-memory intensities. In this paper, we present TEGRA, a composable, scale-up architecture for large-scale graph processing. TEGRA leverages disaggregated memory via CXL and a message-passing communication model to decouple compute and memory, enabling independent scaling of each. Through detailed evaluation using the gem5 simulator, we show that TEGRA improves memory bandwidth utilization by up to 15\% over state-of-the-art accelerators by dynamically provisioning compute based on workload demands. Our results demonstrate that TEGRA provides a flexible and efficient foundation for supporting emerging graph analytics workloads across a wide range of arithmetic intensities.

Research and ACM SRC Posters

Template Task-Based Multiresolution Analysis in Hybrid Environments

8:00am - 5:00pm CST Tuesday, 18 November 2025 Second Floor Atrium

Research & ACM SRC Posters

DescriptionWe present a framework that implements multiresolution analysis (MRA) on top of Template Task Graph (TTG), a distributed, task-based data-flow programming model. MRA is broadly applied across scientific domains for its ability to capture both local and global features with high accuracy, and its adaptive tree-based structure maps naturally onto data-flow execution models. TTG addresses central challenges in modern high performance computing by improving programmer productivity and enabling performance portability across heterogeneous architectures. To the best of our knowledge, this ongoing work is the first demonstration of a multiwavelet-based MRA achieving substantial performance gains on GPUs. We will present our work using visual artifacts in the poster to demonstrate the challenges and proposed solution.

Research and ACM SRC Posters

Tensor Core Accelerated Fast Multipole Method for GROMACS

8:00am - 5:00pm CST Tuesday, 18 November 2025 Second Floor Atrium

Research & ACM SRC Posters

DescriptionThe evaluation of long-range pairwise electrostatic forces is the most computationally intensive component of molecular dynamics (MD) simulations. The fast multipole method (FMM) is an alternative to reduce the computational complexity. In this work, we implemented a hybrid parallel FMM with MPI and GPU acceleration. We further utilized the Tensor Core to accelerate the expensive multipole-to-local (M2L) kernel, and optimized the M2L kernel by reducing redundant computation. Meanwhile, we integrated our FMM into GROMACS. The experimental results show that the use of Tensor Core improves the performance of the M2L kernel, and our FMM implementation, integrated with GROMACS, is effective.

Paper

TENSORMD: Accelerating Molecular Dynamics with a High-Performance Machine Learning Interatomic Potential

10:30am - 10:52am CST Thursday, 20 November 2025 263-264

Applications

Livestreamed

Recorded

DescriptionAI has been integrated into HPC across various scientific fields, significantly enhancing performance. In molecular dynamics simulations, HPC+AI facilitates the investigation of atomic-scale physical properties using machine-learning interatomic potentials (MLIPs). However, general-purpose ML tools (e.g., TensorFlow) used in MLIPs are not optimally matched, leading to missed optimization opportunities due to the higher computational complexity and greater diversity of HPC+AI applications compared to pure AI scenarios. To address this, we introduce TENSORMD, an MLIP independent of existing ML tools, enabling flexible optimizations that standard ML frameworks cannot support. TENSORMD outperforms a state-of-the-art MLIP—winner of the 2020 Gordon Bell Prize and built on an ML tool—by 1.88x on NVIDIA A100 GPU. Additionally, TENSORMD was evaluated on two supercomputers with different architectures, achieving significantly reduced time-to-solution and supporting molecular dynamics simulations at scales beyond 50 billion atoms.

Workshop

Testing and Benchmarking Emerging Supercomputers via the MFC Flow Solver

10:30am - 10:55am CST Friday, 21 November 2025 276

Livestreamed

Recorded

DescriptionDeploying new supercomputers requires testing and evaluation via application codes. Portable, user-friendly tools enable evaluation, and the Multicomponent Flow Code (MFC), a computational fluid dynamics (CFD) code, addresses this need. MFC is adorned with a toolchain that automates input generation, compilation, batch job submission, regression testing, and benchmarking. The toolchain design enables users to evaluate compiler-hardware combinations for correctness and performance with limited software engineering experience. As with other PDE solvers, wall time per spatially discretized grid point serves as a figure of merit. We present MFC benchmarking results for five generations of NVIDIA GPUs, three generations of AMD GPUs, and various CPU architectures, utilizing Intel, Cray, NVIDIA, AMD, and GNU compilers. These tests have revealed compiler bugs and regressions on recent machines such as Frontier and El Capitan. MFC has benchmarked approximately 50 compute devices and 5 flagship supercomputers.

Workshop

The (R)evolution of Scientific Workflows in the Agentic AI Era: Towards Autonomous Science

5:00pm - 5:18pm CST Monday, 17 November 2025 264

Livestreamed

Recorded

DescriptionModern scientific discovery increasingly requires coordinating distributed facilities and heterogeneous resources, forcing researchers to act as manual workflow coordinators rather than scientists. Advances in AI leading to AI agents show exciting new opportunities that can accelerate scientific discovery by providing intelligence as a component in the ecosystem. However, it is unclear how this new capability would materialize and integrate in the real world. To address this, we propose a conceptual framework where workflows evolve along two dimensions which are intelligence (from static to intelligent) and composition (from single to swarm) to chart an evolutionary path from current workflow management systems to fully autonomous, distributed scientific laboratories. With these trajectories in mind, we present an architectural blueprint that can help the community take the next steps towards harnessing the opportunities in autonomous science with the potential for 100x discovery acceleration and transformational scientific workflows.

Workshop

The 11th International Workshop on Data Analysis and Reduction for Big Scientific Data

9:00am - 9:05am CST Monday, 17 November 2025 265

Livestreamed

Recorded

DescriptionIn this new exascale computing era, applications must increasingly perform online data analysis and reduction — tasks that introduce algorithmic, implementation, and programming model challenges unfamiliar to many scientists and with major implications for the design and use of various elements of exascale systems. There are at least three important topics that this workshop is striving to address: (1) whether several orders of magnitude of data reduction is possible for exascale sciences; (2) understanding the performance and accuracy trade-off of data reduction; and (3) solutions to effectively reduce data while preserving the information hidden in large scientific datasets. Tackling these challenges requires expertise from computer science, mathematics, and application domains to study the problem holistically and develop solutions and robust software tools.

Workshop

The 12th Annual International Workshop on Innovating the Network for Data Intensive Science (INDIS)

9:00am - 9:10am CST Sunday, 16 November 2025 266

Livestreamed

Recorded

DescriptionThis workshop aims to enhance the computer networking research track within the SC and HPC communities, highlighting innovations in high-performance networking, networking testbeds, and integrated research infrastructure. A particular focus for INDIS 2025 is the emergence of intelligent networking within advanced cyber-infrastructure, including the use of AI, automation, and in-network telemetry and edge services. Moreover, the workshop aims to foster dialogue among experimental demonstrators and developers of production-quality services, while promoting the reproducibility and wide adoption of the showcased research. This workshop also brings together participants in SCinet’s Network Research Exhibitions (NRE) and Experimental Networks of the Future (XNet) teams to present papers on their latest innovations, designs, and solutions, and to showcase the next generation of networking challenges and solutions for HPC.

Exhibitor Forum

The AI-Ready Data Center: Architecting for the Demands of Modern HPC and AI Workloads

3:30pm - 4:00pm CST Tuesday, 18 November 2025 130

Data Analytics

Livestreamed

Recorded

XO/EX

DescriptionThis presentation provides a blueprint for designing and building an "AI-ready" data center. We will dissect the modern AI and HPC data stack, from the physical infrastructure to the application layer. Learn about the key hardware and architectural considerations for supporting large-scale AI workloads, the role of a multi-vendor ecosystem, and the critical decision between cloud and on-premise deployments.

Workshop

The Convergence of Containers and Kubernetes with HPC and Generative AI

9:15am - 10:00am CST Monday, 17 November 2025 275

Livestreamed

Recorded

DescriptionDuring this talk, Wes and Kush will discuss experiences with creative implementations of containers and orchestrators like Kubernetes with traditional batch schedulers like Slurm through Slinky, but also newer schedulers with integration built-in such as Flux from LLNL.

We will also highlight some of the ways that LLMs can assist scientists and users with both containerization and running across a variety of schedulers, inclusive of MCP and IDE integrations.

The overall theme of our presentation is to highlight the benefits of containerization with HPC, but to also discuss some of the ideas and implementations around how to best add generative AI into both of these converging domains.

Panel

The Cool Factor: Rising Above the Heated Innovation Race

8:30am - 10:00am CST Friday, 21 November 2025 240-241-242

Cloud, Data Center, & Distributed Computing

SC Community Hot Topics

Livestreamed

Recorded

TUT

XO/EX

DescriptionThis panel discussion will focus on how both the legacy and the next-gen tech of supercomputing pioneers enables future applications (such as cloud HPC, GenAI, etc.).

The interactive panel will recount the evolution of liquid cooling in supercompute and the decades of research, expertise, and lessons learned that have now, in turn, enabled the rapid deployment of liquid cooling for AI.

Leveraging industry knowledge and hand-on experience, the panel will share best practices for those looking to introduce liquid into application, and views on risks and opportunities for the future.

Panelists will provide insight into their learnings; "hot takes" on where technology is heading—including future requirements in terms of performance and reliability; and advice for attendees on how to overcome tomorrow's challenges for success at scale.

Workshop

The Cost of Teaching Operational ML

10:45am - 11:00am CST Sunday, 16 November 2025 261

Livestreamed

Recorded

DescriptionOperational machine learning (ML) requires skills beyond model development, including infrastructure provisioning, large-scale training across clusters, model deployment with consideration of operational performance, monitoring, and automation - capabilities grounded in high-performance computing and distributed systems. This paper presents the design and infrastructure requirements of a graduate-level course on ML Systems Engineering and Operations, aimed at equipping students with these skills. Using 186,692 total compute instance hours on the Chameleon Cloud testbed, students built end-to-end ML pipelines incorporating distributed training, reproducible experiment tracking, automated re-training and re-deployment, and continuous monitoring. We analyze compute usage across assignments, compare expected versus actual resource consumption, and estimate that replicating the course on commercial cloud platforms would cost approximately $250 per student (almost $50,000 for our course with enrollment of 191 students).
All course materials are publicly available for reuse.

Workshop

The Eleventh Computational Approaches for Cancer Workshop (CAFCW25)

9:00am - 9:05am CST Monday, 17 November 2025 241

Livestreamed

Recorded

DescriptionThe importance of high performance computing is ever increasing as a critical component of cancer research and clinical applications. The current global cancer ecosystem includes new scientific methods, AI, ever expanding sources of data, and use of simulations. These dynamic changes have set the stage for tremendous growth in HPC for cancer research and clinical application, particularly with the U.S. Cancer Moonshot 2.0 initiative, which aims to reduce mortality of cancer by 50% in 25 years. Originally established as part of SC15 during the advent of the Precision Medicine Initiative and the U.S. National Strategic Computing Initiative, this workshop provides a key venue for multiple disciplines and interests to converge, share insights, and develop collaborations in which HPC and computational approaches will advance the frontiers of cancer research and cancer care.

Workshop

The First International Workshop for Software Frameworks and Workload Management on Quantum-HPC Ecosystems

8:30am - 8:31am CST Friday, 21 November 2025 275

Livestreamed

Recorded

DescriptionThis workshop focuses on the development of software frameworks and workload management strategies that are crucial for quantum-HPC (Q-HPC) ecosystems. As quantum computing progresses, integrating quantum processors with HPC systems presents significant opportunities to tackle complex, large-scale problems. Experts from academia, industry, and national labs will discuss the challenges of managing hybrid resources, along with cutting-edge research on middleware, scheduling algorithms, decomposition strategies, and benchmarking methodologies for Q-HPC systems. The workshop will include keynote talks, paper presentations, panel discussions, and interactive demos to foster collaboration and advance the state of hybrid computing. By the end of the workshop, attendees will have gained valuable insights into best practices, emerging technologies, and future directions in Q-HPC integration, contributing to the broader goal of making quantum computing a practical extension of HPC environments.

Paper

The First Star-by-Star $N$-body/Hydrodynamics Simulation of Our Galaxy Coupling with a Surrogate Model

1:30pm - 1:52pm CST Thursday, 20 November 2025 263-264

Applications

GBC

Livestreamed

Recorded

DescriptionA major goal of computational astrophysics is to simulate the Milky Way galaxy with sufficient resolution, down to individual stars. However, the scaling fails due to some small-scale, short-timescale phenomena, such as supernova explosions. We have developed a novel integration scheme of $N$-body/hydrodynamics simulations working with machine learning. This approach bypasses the short timesteps caused by supernova explosions using a surrogate model, thereby improving scalability. With this method, we reached 300 billion particles using 148,900 nodes, equivalent to 7,147,200 CPU cores, breaking through the billion-particle barrier currently faced by state-of-the-art simulations. This resolution allows us to perform the first star-by-star galaxy simulation, which resolves individual stars in the Milky Way galaxy. The performance scales over $10^4$ CPU cores, an upper limit in the current state-of-the-art simulations using both A64FX and X86-64 processors and NVIDIA CUDA GPUs.

Birds of a Feather

The Future of NSF-Supported Advanced Cyberinfrastructure

5:15pm - 6:45pm CST Tuesday, 18 November 2025 261-262-265-266

Democritization of HPC

Livestreamed

Recorded

XO/EX

DescriptionThe U.S. National Science Foundation's vision and investment plans for cyberinfrastructure (CI) are designed to address the evolving needs of the science and engineering research community. Senior leadership and program staff from NSF’s Office of Advanced Cyberinfrastructure (OAC) will discuss OAC's vision, strategic and national priorities, as well as highlights from the latest funding opportunities across all aspects of the research cyberinfrastructure ecosystem. Substantial time will be devoted to audience Q&A between attendees and NSF staff and unstructured time to meet informally with NSF staff.

Birds of a Feather

The Future of Open Interconnects for AI

12:15pm - 1:15pm CST Thursday, 20 November 2025 276

Architectures & Networks

Livestreamed

Recorded

XO/EX

DescriptionThe rapid evolution of AI workloads is pushing the boundaries of interconnect design, requiring innovative approaches to balance performance, scalability, and efficiency. This BoF will explore cutting-edge advancements in AI interconnects and their impact on performance optimization.

Discussion topics include the evolving role of open standards in AI infrastructure, scale-out vs. scale-up networking solutions for AI workloads, innovative interconnect designs that redefine performance limits, and future-proofing AI networks: what’s next?

Attendees will engage in a forward-looking discussion with industry leaders and pioneering AI infrastructure startups that will tackle the most pressing questions shaping the future of AI interconnects.

Birds of a Feather

The Future of Python on HPC Systems

12:15pm - 1:15pm CST Wednesday, 19 November 2025 125

Standards

Livestreamed

Recorded

XO/EX

DescriptionPython is the most widely used programming language today—but on HPC systems, it’s often more pain than power. From slow loading on shared filesystems, to packaging headaches and incompatible frameworks, these hurdles block new users and frustrate seasoned developers alike. Meanwhile, major changes to Python’s core—Array APIs, packaging, JIT compilation, and threading—are happening without HPC at the table. This BoF brings together language leaders and HPC practitioners to change that. Join us to shape Python’s future, learn what’s coming, connect with peers, and help launch a new HPC working group to influence Python from the inside out.

Birds of a Feather

The Green500: Trends in Energy-Efficient Supercomputing

5:15pm - 6:45pm CST Wednesday, 19 November 2025 263-264

Community Meetings

Livestreamed

Recorded

XO/EX

DescriptionWith power being a first-order design constraint on par with performance, it is important to measure and analyze energy-efficiency trends in supercomputing. To raise the awareness of greenness as a first-order design constraint, the Green500 seeks to characterize the energy efficiency of supercomputers for different metrics, workloads, and methodologies. This BoF discusses trends across the Green500 and highlights from the current Green500 list. In addition, the Green500, Top500, and Energy Efficient HPC Working Group have been working together on improving power-measurement methodology, and this BoF presents recommendations for changes that will improve ease of submission without compromising accuracy.

Research and ACM SRC Posters

The Impact of Maximum Vector Length on Cache Management Techniques in RISC-V Vector Extension

8:00am - 5:00pm CST Tuesday, 18 November 2025 Second Floor Atrium

Research & ACM SRC Posters

DescriptionIn recent years, the RISC-V vector extension (RVV) has attracted increasing attention. The RVV allows programs to be executed on processors with various maximum vector lengths (MVLs). Consequently, even when running the same program, the memory access pattern may vary depending on the MVL of the processor, potentially leading to changes in the optimal cache management technique. In this poster, we focus on replacement policies and the use of non-temporal hints. We execute the same program on processors with different MVLs and compare multiple cache management techniques. The results demonstrate that the optimal cache management technique can vary with the MVL. This finding highlights the necessity of selecting cache management techniques while taking the MVL into account when utilizing RVV. In the poster session, we will explain these research findings using charts that illustrate the performance results.

Birds of a Feather

The Intersection of HPC and AI Infrastructure

5:15pm - 6:45pm CST Tuesday, 18 November 2025 240-241-242

State of the Practice

Livestreamed

Recorded

XO/EX

DescriptionThe convergence of traditional HPC simulations and large-scale AI is reshaping data center infrastructure. This session addresses the unique challenges of supporting both modeling and machine learning workloads on a unified platform. Key topics include navigating the demands of specialized hardware like GPUs and custom accelerators, managing complex software stacks with containers and workflow managers, and optimizing high-performance storage for diverse I/O patterns. We'll also discuss scheduling strategies for fair and efficient resource allocation. Join this interactive discussion to share your experiences, challenges, and solutions for building and managing a truly converged HPC and AI infrastructure

Workshop

The LCLStream Ecosystem for Multi-Institutional Dataset Exploration

9:50am - 10:00am CST Sunday, 16 November 2025 240

Livestreamed

Recorded

DescriptionWe describe a new end-to-end experimental data streaming framework designed from the ground up to support new types of applications – AI training, extremely high-rate X-ray time-of-flight analysis, crystal structure determination with distributed processing, and custom data science applications and visualizers yet to be created. Throughout, we use design choices merging cloud microservices with traditional HPC batch execution models for security and flexibility. This project makes a unique contribution to the DOE Integrated Research Infrastructure (IRI) landscape. By creating a flexible, API-driven data request service, we address a significant need for high-speed data streaming sources for the X-ray science data analysis community. With the combination of data request API, mutual authentication web security framework, job queue system, high-rate data buffer, and complementary nature to facility infrastructure, the LCLStreamer framework has prototyped and implemented several new paradigms critical for future generation experiments.

Workshop

The MALL is Open: Exploring Shared Caches and Latency in AMD CDNA™ 3 GPUs

3:50pm - 4:10pm CST Sunday, 16 November 2025 274

Livestreamed

Recorded

DescriptionThis paper presents an analysis of memory hierarchy latency across AMD Instinct™ MI300A, MI300X, and MI250X GPUs using a fine-grained pointer-chasing microbenchmark. We characterize the scalar L1 (sL1), L2, AMD Infinity Cache™ referred to as the MALL (Memory Attached Last Level), and HBM (High Bandwidth Memory), revealing distinct latency levels and architectural trade-offs. MI300A and MI300X, based on the CDNA3 architecture, exhibit nearly identical latency profiles, while MI250X lacks a MALL, resulting in different performance characteristics. Memory latency remains consistent across compute partitioning modes, but NUMA Partitioning per Socket (NPS) significantly impacts performance. In NPS4 mode, partitioning improves locality, reducing latency by up to 1.42× in MALL and 1.31× in HBM. We further analyze MALL contention and Translation Lookaside Buffer (TLB) behavior under varying parallelism levels, identifying conditions where MALL performance degrades. These findings provide actionable insights for optimizing memory access patterns and improving performance on AMD’s latest GPU architectures.

SCinet Network Research Exhibition

The National Research Platform and SCinet: Enabling Live, Multi-Institutional Scientific AI/ML and HPC Workflows

12:00pm - 12:20pm CST Wednesday, 19 November 2025 Booth 3537 - SCinet Theater

Not Livestreamed

Not Recorded

DescriptionNRI103,NRI104,NRI106

Birds of a Feather

The Next Wave of Confluence: AI-Driven Workflows in HPC and Cloud

12:15pm - 1:15pm CST Tuesday, 18 November 2025 260-267

Artificial Intelligence & Machine Learning

Livestreamed

Recorded

XO/EX

DescriptionThe confluence of HPC, AI, and cloud is entering a new phase, catalyzed by the integration of AI and its influence on HPC hardware. Scientific workflows are evolving to treat simulation, AI, and analytics as a a deeply connected continuum. In this session, we’ll explore how the need to incorporate large-scale AI models and agentic systems—from training on scientific data to real-time inferencing—is creating new patterns of hybrid HPC-cloud usage. We will also discuss the architectural, software, and policy challenges and opportunities facing HPC professionals in a world where AI and Cloud are now driving the evolution of high performance computing infrastructure.

Panel

The Quantum Era of HPC: Roadmaps, Challenges, and Opportunities in Navigating the Integration Frontier

10:30am - 12:00pm CST Wednesday, 19 November 2025 231-232

Architectures

Performance Evaluation, Scalability, & Portability

SC Community Hot Topics

Livestreamed

Recorded

DescriptionWith Moore's law nearing its end, hardware specialization is becoming crucial for continued HPC advancement. Quantum computing (QC) is gaining significant attention due to its potential to dramatically accelerate diverse HPC workloads, from materials science and chemistry to high-energy physics and cryptography, to name a few. However, with hardware and respective software infrastructure under development, when and how to integrate QC into the HPC ecosystem poses fundamental challenges—from logistics to support. Realizing quantum-accelerated HPC requires closely integrating QC with classical CPU- and GPU-accelerated computing and carefully considering this new computing paradigm's unique advantages and limitations. Together, HPC centers, quantum hardware providers, and quantum software specialists are collaborating to solve integration challenges in middleware, job schedulers, and co-processing models to develop tools for workflow management, and also address the growing need for user awareness and training.

Workshop

The quest for high capacity, long reach, industrial strength networks

3:50pm - 4:20pm CST Sunday, 16 November 2025 266

Livestreamed

Recorded

Exhibitor Forum

The Road to Large-Scale Fault-Tolerant Quantum Computers

4:00pm - 4:30pm CST Wednesday, 19 November 2025 130

Quantum & Other Post Moore Computing Technologies

Livestreamed

Recorded

XO/EX

DescriptionNeutral-atom quantum computing presents a compelling path toward scalable, fault-tolerant systems, with advantages in qubit count, reconfigurable connectivity, and manufacturability. This presentation details QuEra’s roadmap from today’s processors toward large-scale error-corrected architectures capable of deep integration with high performance computing (HPC) environments. We outline progress in increasing physical qubit numbers, implementing mid-circuit measurement and feed-forward, and developing modular architectures for millions of qubits. Emphasis will be placed on resource estimates for logical qubits, error-correction overheads, and co-design approaches that align quantum hardware evolution with HPC workflows. By exploring use cases in quantum simulation, optimization, and machine learning, we highlight how fault-tolerant quantum systems can augment classical supercomputers, accelerating solutions to problems beyond classical reach. The talk will map a technically credible path from today’s systems to tomorrow’s exascale-class quantum-HPC hybrid platforms.

Exhibitor Forum

The Role of Accelerated Computing and AI Physics in Enabling the Next Industrial Revolution

11:30am - 12:00pm CST Tuesday, 18 November 2025 130

Data Analytics

Livestreamed

Recorded

XO/EX

DescriptionIndustrial engineering is undergoing a significant transformation, propelled by advancements in accelerated computing, AI physics, and digital twins. This presentation will detail how these hardware and software innovations are helping to compress engineering design cycles and accelerate research and development.

We will deep-dive into mixed-precision techniques, new ML architectures, emerging quantum computing applications and the evolution of GPU hardware. Through real-world use cases from the automotive, aerospace, and semiconductor industries, this session will provide the HPC community with actionable insights into how AI-driven simulation and accelerated computing are fundamentally reshaping the industrial engineering landscape.

Workshop

Third International Workshop on HPC Testing and Evaluation of Systems, Tools, and Software (HPCTESTS 2025)

8:30am - 8:33am CST Friday, 21 November 2025 276

Livestreamed

Recorded

DescriptionThis workshop brings together HPC researchers, operators, and vendors from around the globe to present and discuss state-of-the-art HPC system testing methodologies, tools, benchmarks, procedures, and best practices. The increasing complexity of HPC architectures and growing need to leverage HPC for integrated workflows necessitates more extensive testing than ever in order to thoroughly evaluate the status of the system after installation or a software upgrade and ensure proper operation before it is transitioned to production. Different methodologies are used to evaluate systems during their lifetime, not only at the beginning during the installation, but also during maintenance windows and alongside regular operations. This workshop provides a venue to present and discuss the latest HPC system testing technologies and methodologies. The event will include an opening talk focused on current HPC system testing topics, followed by a series of paper presentations from peer-reviewed accepted submissions, and concluding with a panel discussion.

Paper

ThirstyFLOPS: Water Footprint Modeling and Analysis Toward Sustainable HPC Systems

10:30am - 10:52am CST Wednesday, 19 November 2025 263-264

Energy Efficiency

Performance Measurement, Modeling, & Tools

Power Use Monitoring & Optimization

State of the Practice

Livestreamed

Recorded

DescriptionHigh performance computing (HPC) systems are becoming increasingly water-intensive due to their reliance on water-based cooling and the energy used in power generation. However, the water footprint of HPC remains relatively underexplored—especially in contrast to the growing focus on carbon emissions. In this paper, we present ThirstyFLOPS, a comprehensive water footprint analysis framework for HPC systems. Our approach incorporates region-specific metrics, including water usage effectiveness, power usage effectiveness, and energy water factor, to quantify water consumption using real-world data. Using four representative HPC systems—Marconi, Fugaku, Polaris, and Frontier—as examples, we provide implications for HPC system planning and management. We explore the impact of regional water scarcity and nuclear-based energy strategies on HPC sustainability. Our findings aim to advance the development of water-aware, environmentally responsible computing infrastructures.

Workshop

Threads of Trouble: Unveiling GPU Software and Hardware Security Flaws

3:30pm - 4:15pm CST Sunday, 16 November 2025 242

Livestreamed

Recorded

DescriptionModern computing systems face significant security challenges. While vulnerabilities in CPUs have been extensively studied, GPUs--an increasingly important component of today's computing platforms--have received much less attention. In this talk, I will present our recent studies that aim to bridge this gap. In the first part, I will discuss our findings on GPU memory management systems and demonstrate how weaknesses in their design can be exploited to compromise GPU applications and, in some cases, even CPU applications. In the second part, I will introduce hardware side channels on modern GPUs and show that, despite the adoption of hardware isolation mechanisms, powerful side-channel attacks can still be launched, which pose serious privacy risks to applications such as video games. Finally, I will conclude the talk with a brief discussion of potential countermeasures and directions for future research in GPU security.

Paper

TianheEngine: Hierarchy-Aware Adaptive Partitioning System for Trillion-Scale Graph Processing

11:38am - 12:00pm CST Thursday, 20 November 2025 275

HPC for Machine Learning

State of the Practice

System Software and Cloud Computing

Livestreamed

Recorded

DescriptionGraph partitioning is essential for effectively processing large-scale graphs in distributed computing systems. However, traditional graph partitioning strategies frequently lead to elevated communication costs, particularly within distributed computing systems that utilize thousands of computing nodes. This is because prior partitioning methods fail to consider the variations in communication costs across the communication hierarchies. We propose TianheEngine for leveraging the communication hierarchy among distributed computing systems containing thousands of computing nodes. TianheEngine introduces an adaptive, communication hierarchy-aware methodology to partition and distribute large graphs across computing nodes. It exploits the communication hierarchy of the underlying distributed computing system and the sparsity characteristics of the input graphs to improve communication efficiency. We evaluated TianheEngine on fundamental graph operations using both synthetic and real-world datasets. Experimental results show that TianheEngine is superior to state-of-the-art graph partitioning methods and parallel graph systems and outperforms top-ranked systems on the latest Graph 500 list.

Research and ACM SRC Posters

TidalMark: A Scalable Benchmark for Coastal Water Level Forecasting

8:00am - 5:00pm CST Tuesday, 18 November 2025 Second Floor Atrium

Research & ACM SRC Posters

DescriptionAccurate forecasting of water levels is essential for flood mitigation. Traditionally, predictions have been based on harmonic analysis and sensor networks maintained by the National Oceanographic and Atmospheric Administration. However, these methods struggle with high-variance events that change water levels from the long-term tidal baseline. TidalMark evaluates the ability of a variety of deep learning models to model these high-variance events. Through extensive hyperparameter sweeps and comparisons across model variants, we have evaluated trade-offs in accuracy, generalization, and scalability. Our results show that properly tuned machine learning models consistently outperform the scientific-standard harmonic approaches between 2.1X and 4.7X (between one- to seven-day predictions) with the goal towards achieving adaptive, scalable, and accurate forecasting of coastal water levels.

Early Career

Time Management

11:30am - 12:00pm CST Monday, 17 November 2025 262

Not Livestreamed

Not Recorded

TUT

XO/EX

DescriptionThis session will cover systems and tools for time and task management in a fast-paced workplace. Participants will learn from others and share what works for them.

Research and ACM SRC Posters

Time-Stepping Hamiltonian Simulation for Solving Nonlinear PDEs via a Quantum-Classical Hybrid Approach

8:00am - 5:00pm CST Tuesday, 18 November 2025 Second Floor Atrium

Research & ACM SRC Posters

DescriptionThis work presents a time-stepping Hamiltonian simulation framework for nonlinear PDEs on a hybrid quantum–classical approach. Using warped phase transform (WPT)–based Schrödingerization, spatial discretizations are reformulated as Hermitian/anti-Hermitian operators for Schrödinger-type equations, enabling unitary propagation even for dissipative systems. In contrast to traditional linearizations (Carleman, KvN) that cause exponential statevector growth and truncation errors on NISQ hardware, we update nonlinear terms classically at each step, incorporate into the Hamiltonian, and calculate it by unitary evolution on the quantum circuit. This local linear approximation over small time intervals prevents dimensional inflation while securing calculation accuracy. We implement the framework in Qiskit and evaluate it with the Qiskit Aer statevector simulator on linear advection–diffusion and nonlinear problems including Burgers and Allen–Cahn phase field models. The results show good agreement with classical solutions, highlighting its potential for efficiently simulating nonlinear dynamics without dimensional inflation.

Workshop

To Stream or Not to Stream: Towards A Quantitative Model for Remote HPC Processing Decisions

12:10pm - 12:20pm CST Sunday, 16 November 2025 266

short paper

Livestreamed

Recorded

DescriptionModern scientific instruments generate data at rates that increasingly outpace local compute capabilities, making traditional file-based workflows inadequate for time-sensitive analysis and experimental steering. Real-time streaming frameworks promise lower latency and improved efficiency, but lack a principled feasibility assessment. We introduce a quantitative framework and accompanying Streaming Speed Score to evaluate if remote high-performance computing (HPC) resources can provide timely data processing compared to local alternatives. Our model incorporates key parameters including data generation rate, transfer efficiency, remote processing power, and file I/O overhead to compute total processing completion time (Tpct) and identify regimes where streaming is beneficial. We validate our approach through case studies from facilities such as APS, FRIB, LCLS-II, and the LHC. Our measurements show streaming can achieve up to 97% lower end-to-end completion time than file-based methods under high data rates, while worst-case congestion can increase transfer times by over an order of magnitude.

Workshop

To Virtualize or Not to Virtualize: Experiences from Building Two Generations of Virtualized Infrastructure for LLM Training

11:30am - 11:40am CST Monday, 17 November 2025 275

Livestreamed

Recorded

DescriptionLarge Language Model (LLM) training workloads share computational characteristics with high-performance computing applications, requiring intensive parallel processing, complex matrix operations, and distributed computing with frequent synchronization -- requiring specialized hardware to deliver optimal performance.

This talk presents insights from Vela, a cloud-native system architecture introduced in 2021 for LLM training using commercial hardware and open-source software. The Vela architecture combines off-the-shelf hardware, Linux KVM virtualization with PCIe passthrough, and virtualized RDMA over Converged Ethernet networks. The system employs software-defined networking with SRIOV technology for GPU Direct RDMA, achieving near-bare-metal performance while maintaining virtualization benefits.

Based on multiple data center deployments and iterations, we present two case studies examining what it takes for virtualization-based systems to deliver (a) bare-metal RoCE-like performance and (b) bare-metal InfiniBand-like performance for LLM training workloads. The discussion focuses on virtualization challenges, experiences, and runtime optimizations required for optimal performance in cloud-native training infrastructure.

Tutorial

Tools To Detect and Diagnose Floating-Point Errors in Heterogeneous Computing Hardware and Software

8:30am - 12:00pm CST Sunday, 16 November 2025 121

Livestreamed

Recorded

TUT

DescriptionHigh performance computing and machine learning applications increasingly rely on mixed-precision arithmetic on CPUs and GPUs for superior performance. However, this shift introduces several challenging numerical issues such as increased round-off errors, and INF and NaN exceptions that can render the computed solutions useless. At present, this places a heavy burden on developers, interrupting their work while they diagnose these problems manually. This tutorial presents three tools that target specific issues leading to floating-point bugs. First, we present FPChecker, which not only detects and reports INF/NaN exceptions in parallel and distributed CPU codes, but also tells programmers about the exponent value ranges for avoiding exceptions while also minimizing rounding errors. Second, we present GPU-FPX, which detects floating-point exceptions generated by NVIDIA GPUs, including their Tensor Cores via a "nixnan" extension to GPU-FPX. Third, we present FloatGuard, a unique tool that detects exceptions in AMD GPUs. The tutorial is aimed at helping programmers avoid exception bugs; for this, we will demonstrate our tools on simple examples with seeded bugs. Attendees may optionally install and run our tools. The tutorial also allocates question/answer time to address real situations faced by the attendees.

Birds of a Feather

TOP500 Supercomputers

5:15pm - 6:30pm CST Tuesday, 18 November 2025 America's Ballroom Tu-Th

Community Meetings

Livestreamed

Recorded

XO/EX

DescriptionThe TOP500 list of supercomputers serves as a “Who’s Who” in the field of high performance computing (HPC). It started as a list of the most powerful supercomputers in the world and has evolved to a major source of information about trends in HPC. The 66th TOP500 list will be published in November 2025, just in time for SC25.

This BoF will present detailed analyses of the TOP500 and discuss the changes in the HPC marketplace during the past years. The BoF is meant as an open forum for discussion and feedback between the TOP500 authors and the user community.

Workshop

Toward Practical HPC Education with MCJ-CloudHub

4:36pm - 4:42pm CST Sunday, 16 November 2025 261

Livestreamed

Recorded

DescriptionIn higher education, there is a growing need for reproducible exercise environments that facilitate teaching of HPC and data science. While JupyterHub can provide students with a consistent environment, challenges remain in operating multiple exercises simultaneously and in building and maintaining the JupyterHub system itself. To address these issues, we have developed MCJ-CloudHub, which is designed to support the concurrent operation of multiple exercises through
integration with Moodle and JupyterHub and to enable exercises that use GPU computing. Using Virtual Cloud Provider (VCP) technology, MCJ-CloudHub can be deployed flexibly across both on-premises and cloud environments.

Research and ACM SRC Posters

Towards a GPU-Accelerated Web-Based Graph Rendering Framework for Large-Scale Protein Networks

8:00am - 5:00pm CST Tuesday, 18 November 2025 Second Floor Atrium

Research & ACM SRC Posters

DescriptionWe present a WebGPU-based framework for real-time visualization of large-scale protein–protein interaction (PPI) networks directly in standard browsers. Built on GraphWaGu, our extended graph rendering API integrates GPU-accelerated force-directed layout computation with dynamic edge filtering and degree-based visual encoding. Users can adjust parameters such as confidence thresholds, iterations, and cooling factors, with layout updates. This approach sustains high frame rates for networks with millions of edges, mitigates the hairball effect, and enables biologists to explore complex PPI networks efficiently, intuitively, and without specialized software installation.

Workshop

Towards a user-centric HPC-QC environment

9:30am - 9:45am CST Friday, 21 November 2025 275

Livestreamed

Recorded

DescriptionRobust execution environments for quantum computing can aid the industry with key challenges like application development, portability, and reproducibility, and help unlock the development of more modular quantum program, driving forward hybrid quantum workflows.

In this work, we show progress towards a basic, but portable, runtime environment for developing and executing hybrid quantum-classical programs running in High Performance Computing environments enhanced with Quantum Processing Units (QPUs). The middleware includes a second layer of scheduling after the main HPC resource manager in order to improve the utilization of the QPU, and extra functionality for observability, monitoring, and admin access.

We show how this allows us to manage several programming Software Development Kits as first-class citizens in the environment by building on a recently proposed vendor-neutral Quantum Resource Management Interface (QRMI). Lastly, we discuss and show a solution for the monitoring and observability stack, completing our description of the hybrid system architecture.

Workshop

Towards AI-Driven Interfaces for Scientific Data Management

4:00pm - 4:05pm CST Monday, 17 November 2025 230

Data Analytics

High Performance I/O, Storage, Archive, & File Systems

Storage

Livestreamed

Recorded

DescriptionScientific data management requires researchers to navigate fragmented toolchains spanning data gathering, resource allocation, application deployment, and analysis. While Large Language Models offer natural language interfaces for HPC tasks, existing approaches suffer from system-specific training dependencies and lack standardized tool integration. We present IOWarp-mcps, a comprehensive suite of Model Context Protocol (MCP) tools enabling AI-driven scientific data management across complete workflows. Our framework addresses large-scale scientific datasets through two core principles: chunked I/O access for memory-efficient data partitioning and label-based filtering for selective data reduction before model ingestion. We evaluate IOWarp-mcps across three scenarios: automated dataset discovery from the National Data Platform, molecular dynamics trajectory analysis from LAMMPS simulations, and parallel I/O benchmark deployment. Results demonstrate significant productivity improvements, with configuration tasks reduced from 5-10 minutes to approximately one minute. IOWarp-mcps bridges the gap between conversational AI and scientific computing, providing intuitive interfaces for complex data management operations.

Workshop

Towards an Automated Workflow for Floating-Point Analysis of GPU Kernels

3:55pm - 4:25pm CST Monday, 17 November 2025 261

Debugging & Correctness Tools

HPC Software & Runtime Systems

Livestreamed

Recorded

DescriptionWe present early work towards establishing an automated workflow for numerical error analysis of SYCL kernels. The method leverages the OpenCL Intercept Layer to record the execution of GPU-targeted SYCL kernels and replay them with CPU runtimes, enabling detailed floating-point error evaluation without modifying the original application. We analyze the force kernel from a large-scale cosmology application code (HACC) using PoCL and Verificarlo, exploring both IEEE-compliant configurations and reduced-precision modes (\eg, FP16, TF32, BF16), as well as the effects of the \texttt{-ffast-math} compiler optimization using stochastic arithmetic via MCA and PRISM. By leveraging open tools and standards, this work contributes a reusable path toward broader adoption of numerical accuracy evaluation in HPC.

Research and ACM SRC Posters

Towards Application Agnostic HPC Profiling

8:00am - 5:00pm CST Tuesday, 18 November 2025 Second Floor Atrium

Research & ACM SRC Posters

DescriptionModern HPC systems generate large amounts of GPU and network telemetry, typically used for system health monitoring. At NERSC, we are developing a Performance API/UI that generates a job report card from this telemetry, providing an overview of performance characteristics. Using DCGM counters, we report GPU memory, compute, and power usage, and present preliminary investigations of job-level network activity. Without traditional profiling tools, this application-agnostic approach helps identify resource utilization imbalances, detect anomalies such as memory leaks, and assess overall performance for the user without additional effort.

Paper

Towards Efficient LLM Inference via Collective and Adaptive Speculative Decoding

11:15am - 11:37am CST Wednesday, 19 November 2025 261-262-265-266

Applications

Architectures & Networks

HPC for Machine Learning

Livestreamed

Recorded

DescriptionEfficient LLM inference remains challenging due to the autoregressive decoding process, which generates only one token at a time. Speculative decoding has been introduced to address the limitation by using small speculative models (SSMs) to speed up LLM inference. However, the low acceptance rate of SSMs and the high verification cost of LLMs prohibit further performance improvement. In this paper, we present Smurfs, an LLM inference system designed to accelerate LLM inference through collective and adaptive speculative decoding. Smurfs adopts a majority-voted mechanism that harnesses multiple SSMs to collaboratively predict LLM outputs in multi-task scenarios. It also decouples SSM speculation from LLM verification and uses a pipelined execution to reduce the latency of SSM speculation. Additionally, Smurfs proposes a mechanism to dynamically determine the optimal speculation length of SSM at runtime. The experimental results demonstrate the superiority of Smurfs in terms of inference throughput and latency compared to state-of-the-art systems.

Workshop

Towards Efficient Load Balancing BFS on GPUs: One Code for AMD, Intel & Nvidia

11:00am - 11:30am CST Monday, 17 November 2025 266

Livestreamed

Recorded

DescriptionEfficient graph processing is essential for a wide range of applications.
Scalability and memory access patterns are still a challenge, especially with the Breadth-First Search algorithm. This work focuses on leveraging multi-GPU HPC nodes with peer-to-peer support of the Intel oneAPI implementation of SYCL.
We propose three GPU-based load-balancing methods: work-group localisation for efficient data access, even workload distribution for higher GPU occupancy, and a hybrid strided-access approach for heuristic balancing. These methods ensure performance, portability, and productivity with a unified codebase.
Our proposed methodologies outperform state-of-the-art single-GPU implementations based on CUDA on synthetic RMAT graphs. We analysed BFS performance across NVIDIA A100, Intel Max 1550, and AMD MI300X GPUs, achieving a peak performance of 153.27 GTEPS on an RMAT25-64 graph using 8 GPUs on the NVIDIA A100. Furthermore, our work handles RMAT graphs up to scale 29, achieving superior performance on synthetic graphs and competitive results on real-world datasets.

Workshop

Towards Enabling Hostile Multi-tenancy in Kubernetes

3:50pm - 4:20pm CST Monday, 17 November 2025 275

Livestreamed

Recorded

DescriptionKubernetes is the de facto standard for container orchestration but was not designed for hostile multi-tenancy. Native constructs such as namespaces, role-based access control, and admission controllers provide logical separation but lack the strong isolation required in adversarial environments. This paper presents a Kubernetes-compatible architecture that integrates per-tenant virtual control planes, hypervisor-backed sandboxes, and automated policy enforcement to achieve secure multi-tenancy. Each tenant receives a dedicated virtual control plane (via vCluster) linked to a virtual node that schedules workloads into VM-based sandboxes (Azure Container Instances), preserving the Kubernetes API experience. A policy engine (Kyverno) hardens namespaces by enforcing network segmentation, resource limits, and strict security contexts at admission time. Evaluation demonstrates that this approach delivers strong inter-tenant isolation with negligible performance overhead, providing a practical model for zero-trust container orchestration in hostile cloud and edge environments.

Doctoral Showcase

Towards Predictive Digital Twins with Applications to Precision Oncology

3:30pm - 3:45pm CST Thursday, 20 November 2025 230

Research & ACM SRC Posters

Livestreamed

Recorded

DescriptionWell calibrated mathematical and computational models enable the prediction and control of complex systems. These models can be utilized to design engineering systems or to develop treatment protocols. In contrast to one-size-fits-all approaches that seek to mitigate risk at the population level, digital twins enable personalized modeling that seeks to improve decisions at the level of the individual to improve cohort outcomes. This tailored approach is crucial in applications such as precision oncology. In particular, high-grade gliomas exhibit significant heterogeneity in physiology and response to treatment that result in low median survival rates despite an aggressive standard-of-care.

We develop a computational pipeline that utilizes longitudinally collected MRI data to generate a patient-specific computational geometry and estimate the tumor cellularity. The data are then used to inform the spatially varying parameters of mathematical models for tumor growth through the solution of an inverse problem. The high-consequence nature of downstream decisions prompts a rigorous approach to uncertainty quantification. We utilize a Bayesian framework with a focus on scalable and efficient methods to characterize the uncertainty in the model inputs from the sparse, noisy imaging data. Furthermore, we show promising results for therapy planning using a risk-based formulation for optimization under uncertainty.

Workshop

Towards Supporting QIR: Steps for Adopting the Quantum Intermediate Representation

10:45am - 11:00am CST Friday, 21 November 2025 275

Livestreamed

Recorded

DescriptionAs quantum computers grow in qubit count and fidelity, translating applications into hardware-specific instructions becomes essential. Intermediate representations (IRs) help optimize this process. One such IR is Microsoft’s Quantum Intermediate Representation (QIR), built on the LLVM compiler framework. This article explores various ways QIR can be integrated into quantum computing workflows. It demonstrates how to convert an existing quantum circuit simulator into a QIR runtime, showing that the transition is straightforward and does not compromise performance. In fact, adopting QIR enables advanced features like classical control flow, which is crucial for testing quantum error correction protocols. The implementation is open-source and available at https://github.com/cda-tum/mqt-core, and the article concludes with future directions for QIR development.

Paper

TraceFlow: Efficient Trace Analysis for Large-Scale Parallel Applications via Interaction Pattern-Aware Trace Distribution

2:37pm - 3:00pm CST Tuesday, 18 November 2025 263-264

Performance Measurement, Modeling, & Tools

System Software and Cloud Computing

Livestreamed

Recorded

DescriptionTrace analysis of large-scale parallel applications is crucial for understanding and optimizing performance. It primarily focuses on the interaction behaviors between different parallel processes, such as synchronization waits and asynchronous overlaps. The trace size explodes as the parallel scale of applications, thus current methods analyze traces in parallel to ensure analysis speed. However, due to the interaction pattern-agnostic trace distribution, they often introduce inter-process communications to fetch non-local event data during interaction analysis, leading to excessively long trace analysis time.

To address this issue, we propose TraceFlow, a trace analysis tool for large-scale parallel applications, which achieves a nearly communication-free analysis through an interaction pattern-aware trace distribution strategy. We evaluate the efficiency of TraceFlow on widely used benchmarks and several real-world applications with up to 8,192 processes. Experimental results show that TraceFlow achieves an average speedup of 13.49× in the analysis time compared to the state-of-the-art approaches.

Workshop

Transparent Global File System Access in Environments with Multiple Authentication Domains

10:45am - 11:15am CST Sunday, 16 November 2025 276

Livestreamed

Recorded

DescriptionGlobal file systems, whose access spans multiple systems/sub-systems within a High-Performance Computing (HPC) center, are common at many institutions due to a range of benefits they provide. In the vast majority of cases however, they operate under a single authentication domain or in more complex cases, support multiple domains with each getting siloed data access. Recently at the National Center for Supercomputing Applications (NCSA), we have integrated a cluster that pushed us to engineer a solution that provides users the ability to seamlessly access their data, regardless of which authentication domain the system is tied to.
This paper describes the design, technologies, and processes NCSA architected to deliver this capability to researchers and shares the suggested practices we’ve discovered while operating it. Additionally, we lay out the benefits researchers gained by us providing this level of integration between the different authentication domains at the global file system layer.

Paper

Trillion Ligands per Day: Performance-Portable Virtual Screening via Compound Database Optimization and Multi-Target Docking

4:15pm - 4:37pm CST Thursday, 20 November 2025 263-264

Applications

Livestreamed

Recorded

DescriptionStructure-based virtual screening confronts a grand challenge in scaling to trillion-ligand libraries for drug discovery. We present SWDOCKP\textsuperscript{2}, a performance-portable virtual screening framework achieving 1.9 trillion ligand-receptor pairs daily across eight targets on the Sunway OceanLight supercomputer with 40 million cores—10× faster than the prior state of the art. Key innovations combine (1) a ligand database optimizer with conformational sorting and merging, (2) multi-receptor grid alignment enabling parallel target screening and SIMD-accelerated trilinear interpolation, and (3) a Sunway architecture emulator for cross-platform efficiency. These advancements bridge computational scalability with novel drug discovery demands, offering a blueprint for next-generation supercomputing in structure-based drug design. Additionally, SWDOCKP\textsuperscript{2} will generate an unprecedented dataset of predicted protein-ligand interactions, creating a transformative resource for machine learning applications. By addressing experimental data scarcity, this dataset empowers accurate ligand prediction, generative chemistry, and AI-driven drug discovery.

Panel

Trust, but Verify in HPC: Uncertainty for AI and Computing

10:30am - 12:00pm CST Thursday, 20 November 2025 240-241-242

AI, Machine Learning, & Deep Learning

SC Community Hot Topics

Scientific & Information Visualization

Livestreamed

Recorded

DescriptionThis panel will discuss the role of uncertainty in HPC from the perspective of predictive simulation and data-driven modeling, with a focus on future scientific workloads and interpretable AI for science. Why is treatment of uncertainty a necessity for robust prediction? What are the particular challenges and opportunities for probabilistic methods in ModSim at exascale? How can uncertainty quantification be a scaffold for scientific AI/ML, and what are the pitfalls? This discussion will lay the foundation for future work in HPC co-design at the interface of theory, software, and hardware optimization as we prepare for new paradigms of predictive modeling and simulation in the era of AI.

Birds of a Feather

Trusted Research Environments for AI and Integrated Science

12:15pm - 1:15pm CST Tuesday, 18 November 2025 123

Architectures & Networks

Security & Privacy

Livestreamed

Recorded

XO/EX

DescriptionWe explore the multifaceted aspects of managing risk for research projects involving sensitive data and AI models which depend deeply on supercomputing infrastructures. This risk spectrum spans traditional technical cyber controls as well as policy and sociological (including human factors) risks. In the context of multi-facility, multi-institutional workflows such as IRI and the American Science Cloud (AmSC), our goal is to advance progress in developing secure and trustworthy infrastructures for AI and integrated science.

Three intertwined challenges emerge as we advance this vision for IRI and AmSC: technological, policy, and sociological. With the rise of AI and the increasing use of sensitive data for training models, our goal is to leverage this BoF to build a community of practice that will advance a secure and trusted research environment (TRE) that addresses challenges in all three domains. How do we best achieve a TRE that is transparent, reproducible, ethical, secure, worthwhile, and collaborative, with clear data provenance and assurance? How might trust be rightfully earned and retained through modern workflows through managed risk and secure governance?

The outcomes from this BoF are: (1) explore TRE challenges in the age of AI and science integration; (2) identify alignment and divergence in TRE practices; (3) learn from complementary efforts across institutions around the globe; and (4) build a community of practice committed to trustworthy integrated science in the age of AI. We invite an audience with interests in these topics to participate in advancing these outcomes.

Paper

TT-LoRA MoE: Using Parameter-Efficient Fine Tuning and Sparse Mixture-of-Experts

1:52pm - 2:15pm CST Wednesday, 19 November 2025 261-262-265-266

HPC for Machine Learning

Performance Measurement, Modeling, & Tools

Livestreamed

Recorded

DescriptionWe propose Tensor-Trained Low-Rank Adaptation Mixture-of-Experts (TT-LoRA MoE), a novel computational framework integrating parameter-efficient fine-tuning (PEFT) with sparse MoE routing to address scalability challenges in large model deployments. Unlike traditional MoE approaches, which face substantial computational overhead as expert counts grow, TT-LoRA MoE decomposes training into two distinct, optimized stages. First, we independently train lightweight, tensorized low-rank adapters (TT-LoRA experts), each specialized for specific tasks. Subsequently, these expert adapters remain frozen, eliminating inter-task interference and catastrophic forgetting in multi-task settings. A sparse MoE router, trained separately, dynamically leverages base model representations to select exactly one specialized adapter per input at inference time, automating expert selection without explicit task specification. Comprehensive experiments confirm our architecture retains the memory efficiency of low-rank adapters, seamlessly scales to large expert pools, and achieves robust task-level optimization. This structured decoupling significantly enhances computational efficiency and flexibility, enabling practical and scalable multi-task inference deployments.

Paper

TurboFNO: High-Performance Fourier Neural Operator with Fused FFT-GEMM-iFFT on GPU

11:37am - 12:00pm CST Wednesday, 19 November 2025 261-262-265-266

Applications

Architectures & Networks

HPC for Machine Learning

Livestreamed

Recorded

DescriptionFourier neural operators (FNOs) are widely used for learning partial differential equation solution operators. However, FNOs lack architecture-aware optimizations, with their Fourier layers executing FFT, filtering, GEMM, zero padding, and iFFT as separate stages, incurring multiple kernel launches and significant global memory traffic. We propose TurboFNO, the first fully fused FFT-GEMM-iFFT GPU kernel with built-in FFT optimizations. We first develop FFT and GEMM kernels from scratch, achieving performance comparable to cuBLAS and cuFFT. Additionally, our FFT integrates a built-in high-frequency truncation, input zero-padding, and pruning feature to avoid additional memory copy kernels. To fuse FFT and GEMM, we propose an FFT variant where a threadblock iterates over hidden dimension to align with GEMM’s $k$-loop, along with two shared memory swizzling patterns that ensure 100\% bank utilization when forwarding FFT output to GEMM and retrieving results for iFFT. Experimental results show TurboFNO outperforms PyTorch, cuBLAS, and cuFFT by up to 150\%.

Art of HPC

Turbulent Strata

8:00am - 6:00pm CST Sunday, 16 November 2025 Art of HPC - Plaza Lobby

Art of HPC

Not Livestreamed

Not Recorded

TUT

XO/EX

DescriptionWe here visualize a snapshot in time from one of the largest numerical simulations of stratified turbulence performed to date, generated by directly simulating the Navier–Stokes equations on the Frontier supercomputer (Oak Ridge National Laboratory) through an INCITE award. Stratified turbulence refers to chaotic motion arising in a fluid of variable density, where buoyancy forces strongly influence the flow dynamics. This type of turbulence plays a central role in a variety of natural and industrial processes, such as influencing the dispersion of heat and pollutants in the ocean and atmosphere, but remains poorly understood due to the vast range of interacting length scales that must be resolved.

A zoomed-in vertical slice of the 6 trillion grid point simulation is shown here, yielding unprecedented resolution into the rich variety of flow structures underpinning the turbulence. Energy is injected at large scales to drive the turbulence, which cascades down through progressively smaller structures until eventually being dissipated by viscosity. The colors in the visualization show perturbations to the fluid’s density relative to the stable background gradient: red and blue indicate lighter and denser fluid, respectively.

This simulation is one of the first to fully resolve stratified turbulence at high Prandtl number, meaning that momentum diffuses significantly faster than density, as is characteristic of ocean flows. A challenge in simulating high Prandtl flows is that the density field develops extremely fine structures that require immense resolution to capture, necessitating the use of Frontier.

Workshop

TUSQ: Tracking, Uncomputation, and Sampling for Noisy Quantum Simulation

2:00pm - 2:05pm CST Monday, 17 November 2025 276

Partially Livestreamed

Partially Recorded

DescriptionQuantum computers have grown rapidly, enabling execution of complex quantum circuits. However, for most researchers, access to compute time on quantum hardware is limited. Thus we need to build simulators that mimic execution of quantum circuits on noisy quantum hardware efficiently.

Here, we propose TUSQ, that can perform noisy simulation of up to 30-qubit Adder circuits on single A100 GPU in lessthan 820seconds. To represent the stochastic noisy channels, we average the output of multiple quantum circuits with fixed noisy gates sampled from channels. This leads to increase in circuit overhead, which slows down the simulation. To eliminate this overhead, TUSQ uses two modules: the \textit{Error Characterization} (ECM), and the Tree-based Execution} (TEM).

The ECM tracks number of unique circuit executions to accurately represent noise. This is followed by TEM, which reuses computation across reduced circuits. We evaluate and report average speedup of $52.5\times$ and $12.53\times$ over Qiskit and CUDA-Q.

Workshop

Twelfth SC Workshop on Best Practices for HPC Training and Education

9:00am - 9:10am CST Monday, 17 November 2025 232

Livestreamed

Recorded

DescriptionThe inherent wide distribution, heterogeneity, and dynamism of the current and emerging high-performance computing and software environments increasingly challenge cyberinfrastructure facilitators, trainers, and educators. The challenge is how to support and train the current multidisciplinary users and prepare the future educators, researchers, developers, and policymakers to keep pace with the rapidly evolving HPC environments to advance discovery and economic competitiveness for many generations.

The twelfth annual full-day workshop on HPC training and education is an ACM SIGHPC Education Chapter coordinated effort, aimed at fostering more collaborations among the practitioners from traditional and emerging fields to explore educational needs in HPC, to develop and deploy HPC training, and to identify new challenges and opportunities for the latest HPC platforms. The workshop will also be a platform for disseminating results and lessons learned in these areas and will be captured in a special edition of the Journal of Computational Science Education.

Workshop

Twelfth Workshop on Accelerator Programming and Directives (WACCPD 2025)

9:00am - 9:10am CST Monday, 17 November 2025 266

Livestreamed

Recorded

DescriptionHeterogeneous node architectures are ubiquitous in today’s HPC landscape. Exploiting the compute capability, while maintaining code portability and maintainability, necessitates effective accelerator programming approaches. The use of these programming approaches remains a research activity, and there are many possible trade-offs between performance, portability, maintainability, and ease of use that must be considered. Additionally, new heterogeneous computing concepts are being deployed, like ML/AI chips and QPUs, introducing challenges related to algorithms, portability, and standardization of programming models.

The WACCPD workshop highlights the improvements over state-of-the-art through accepted papers and talks. The event will also foster discussion with invited talks and a panel to draw the community’s attention to key areas that will facilitate the transition to accelerator-based HPC, including AI, and quantum computing. The workshop aims to showcase all aspects of innovative language features, lessons learned while using directives/abstractions to migrate scientific code, and experiences using novel accelerator architectures, among others.

Birds of a Feather

UALink and Ultra Ethernet: Addressing AI Networking Challenges in an Open Ecosystem

12:15pm - 1:15pm CST Wednesday, 19 November 2025 276

Artificial Intelligence & Machine Learning

Livestreamed

Recorded

XO/EX

DescriptionTraditional interconnects no longer meet the performance, latency, and scalability demands of AI systems, prompting the need for new approaches to data movement at the node and data center levels. In response, two new open industry organizations have emerged: UALink and the Ultra Ethernet Consortium (UEC). This BoF will discuss the two organizations' approaches to meeting the growing network demands for AI applications.

This presentation will also explore the architectural challenges driving the need for these new fabrics and explain how UALink and UEC are addressing AI networking challenges.

Flash Session

Ultra Ethernet Starts Here: Slingshot’s Role in the Next HPC Era

5:00pm - 5:15pm CST Tuesday, 18 November 2025 Booth 2638 - Flash Session

Not Livestreamed

Not Recorded

XO/EX

DescriptionThe Ultra Ethernet Consortium is reimagining Ethernet to surpass InfiniBand—while preserving its legacy of ubiquity and cost-effectiveness. HPE Slingshot is not just aligned with this vision—it’s helping lead it. Slingshot engineers are contributing extensively to the Ultra Ethernet specification, with several foundational components drawn directly from Slingshot’s proven innovations in fabric switching and RDMA NICs. But Slingshot goes further. While Ultra Ethernet raises the industry floor—establishing a genuine path to multi-vendor interoperability for RDMA for the first time—HPE Slingshot pushes the ceiling. It delivers advanced capabilities that far exceed the specification to deliver unmatched scalability, effective congestion control, and price/performance. From exascale systems like El Capitan to mainstream HPC in engineering, energy, and academia, HPE Slingshot is making the vision of Ultra Ethernet a reality today.

Paper

UltraAttn: Efficiently Parallelizing Attention Through Hierarchical Context-Tiling

11:15am - 11:38am CST Thursday, 20 November 2025 275

HPC for Machine Learning

State of the Practice

System Software and Cloud Computing

Livestreamed

Recorded

DescriptionLong-context comprehension is a crucial capability for LLM. Context parallelism and irregular block sparse attention are two key technologies to accelerate long contextual training and inference. Existing context parallelism techniques for attention suffer from poor scalability, owing to their common characteristics: the striped-like partition pattern. The striped-like partition pattern causes high communication traffic and inflexible kernel granularity, which in turn results in low single-kernel device utilization.

To address these problems, we propose UltraAttn, a novel context parallelism solution for irregular attention. UltraAttn hierarchically tiles the context to reduce communication cost. UltraAttn also performs context-tiling at the kernel level to adjust the granularity of kernels to trade off between kernel overlap and single-kernel device utilization. UltraAttn executes distributed attention with an ILP-based runtime to optimize latency. We evaluate UltraAttn on 64 GPUs. UltraAttn achieves 5.5× speedup on average in different types of irregular attention over the state-of-the-art context parallelism techniques.

Workshop

Umpire: Portable Memory Management for High-Performance Computing Applications

3:30pm - 3:50pm CST Sunday, 16 November 2025 274

Livestreamed

Recorded

DescriptionModern high-performance computing (HPC) systems present application
developers with increasingly complex memory hierarchies
that include multiple types of memory with varying access patterns,
capacities, and performance characteristics. Managing these
resources efficiently while maintaining code portability across different
architectures remains a significant challenge. To address
these challenges, Umpire was developed at Lawrence Livermore
National Laboratory (LLNL) as an open-source library that provides
a unified, portable memory management API for modern HPC platforms
with multiple memory devices like NUMA and GPUs. This
paper explores Umpire’s design principles, outlines Umpire’s primary
performance advantages, and examines howits memory pools
can provide speedups of 15x or greater. Next, it demonstrates how
its integration with the RAJA Portability Suite enables the development
of portable and performant HPC applications. With real-world
examples from LLNL’s production codes, Umpire provides a comprehensive
solution for managing the challenges of performance
portability in modern HPC environments.

Research and ACM SRC Posters

Understanding Communication Bottlenecks in Multi-Node LLM Inference

8:00am - 5:00pm CST Tuesday, 18 November 2025 Second Floor Atrium

Research & ACM SRC Posters

DescriptionAs large language models (LLMs) grow in parameter count, efficient generation requires inference to scale beyond a single node. Current approaches use tensor parallelism (TP) or pipeline parallelism (PP), but TP incurs high communication volume, while PP suffers from pipeline bubbles and is unsuitable for latency-critical scenarios. We present Yalis (Yet Another LLM Inference System), a lightweight and modular distributed inference framework that performs comparably to existing state-of-the-art systems for offline inference, while enabling rapid prototyping. Using Yalis, we study strong scaling of LLM inference on the Alps and Perlmutter supercomputers, revealing the poor scaling performance of existing parallelism strategies due to high communication overheads. We further compare the all-reduce performance of NCCL and MPI in the small-message regime, finding that while NCCL is efficient intra-node, MPI can outperform it cross-node for messages between 256-1024 KB. These results motivate the need for communication-efficient parallelism strategies for multi-node LLM inference.

Research and ACM SRC Posters

Understanding GPU Utilization Using LDMS Data on Perlmutter

8:00am - 5:00pm CST Tuesday, 18 November 2025 Second Floor Atrium

Research & ACM SRC Posters

DescriptionGPGPU-based clusters and supercomputers have grown significantly in popularity over the past decade. While numerous GPGPU hardware counters are available to users, their potential for workload characterization remains underexplored. In this work, we analyze previously overlooked GPU hardware counters collected via the Lightweight Distributed Metric Service on Perlmutter. We examine spatial imbalance, defined as uneven GPU usage within the same job, and perform a temporal analysis of how counter values change during execution. Using temporal imbalance, we capture deviations from average usage over time. Our findings reveal inefficiencies and imbalances that can guide workload optimization and inform future HPC system design.

Research and ACM SRC Posters

Understanding LLM Behavior on HPC Data via Mechanistic Interpretability

8:00am - 5:00pm CST Tuesday, 18 November 2025 Second Floor Atrium

Research & ACM SRC Posters

DescriptionLarge language models (LLMs) are increasingly used in HPC for tasks like code generation and analysis, but their internal reasoning remains opaque. To address this, we study three tasks—OpenMP code completion, data race detection, and OMP code generation—using mechanistic interpretability. Sparse autoencoder ablations reveal causal features, function vector injection improves zero-shot predictions and direction vector shifts the model's output toward a desired behavior or style, even without explicitly stating it in the prompt. These methods expose and influence LLM behavior in HPC contexts.

Birds of a Feather

Unified Communication X (UCX) Community

12:15pm - 1:15pm CST Tuesday, 18 November 2025 276

Architectures & Networks

Community Meetings

Livestreamed

Recorded

XO/EX

DescriptionIn order to exploit the capabilities of new HPC systems and to meet their demands in scalability, communication software needs to scale on millions of cores and support applications with adequate functionality. UCX is a collaboration between industry, national labs, and academia that consolidates and provides a unified open-source framework.

The UCX project is managed by the UCF Consortium (http://www.ucfconsortium.org/) and includes members from Los Alamos National Laboratory, Argonne National Laboratory, Ohio State University, AMD, NVIDIA, and more. The session will serve as the UCX community meeting and will introduce the latest developments to HPC developers and the broader user community.

Best Poster Presentations (Research, ACM SRC Grad/Undergrad)

Unified Performance Modeling Stack for Distributed GPU Applications: Complementing Analytical Insights with Machine Learning

3:45pm - 4:00pm CST Wednesday, 19 November 2025 230

Research & ACM SRC Posters

DescriptionModern HPC applications increasingly use GPUs to solve larger problems with higher accuracy and speed. However, committing resources to these large-scale systems is often costly and time-consuming. Hence, performance modeling enables developers to estimate runtime, analyze scalability, and identify resource bottlenecks in advance. In this work, we propose a unified software ecosystem for end-to-end performance modeling of distributed GPU applications. To this end, we propose a combination of analytical and machine learning-based modeling methodology, and design a comprehensive software stack to combine the various components for implementing such an approach. We validate the proposed framework using two real-life applications and provide performance estimations for the GPU kernel and inter-GPU MPI communications.

Research and ACM SRC Posters

Unmasking Performance Variability in GPU Codes on Supercomputers

8:00am - 5:00pm CST Tuesday, 18 November 2025 Second Floor Atrium

Research & ACM SRC Posters

DescriptionPerformance variability is often a critical issue on GPU-accelerated systems, undermining efficiency and reproducibility. Since large-scale investigations of performance variability on GPU clusters are lacking, we set up a longitudinal experiment on Perlmutter and Frontier. We benchmark representative HPC and AI applications and collect detailed performance data to assess the impact of compute variability, allocated node topology, and network conditions on overall runtime. We also use an ML-based approach to identify potential correlations between these factors and to forecast the execution time. Our analysis identifies network performance as the dominant source of runtime variability. These findings provide crucial insights that can inform the development of future mitigation strategies.

Paper

Uno: A One-Stop Solution for Inter- and Intra-Data Center Congestion Control and Reliable Connectivity

1:30pm - 1:52pm CST Wednesday, 19 November 2025 260-267

Architectures & Networks

Livestreamed

Recorded

DescriptionCloud computing and AI workloads are driving unprecedented demand for efficient communication within and across datacenters. However, the coexistence of intra- and inter-datacenter traffic within datacenters, plus the disparity between the RTTs of intra- and inter-datacenter networks, complicates congestion management and traffic routing. Particularly, faster congestion responses of intra-datacenter traffic causes rate unfairness when competing with slower inter-datacenter flows. Additionally, inter-datacenter messages suffer from slow loss recovery and, thus, require reliability. Existing solutions overlook these challenges and handle inter- and intra-datacenter congestion with separate control loops or at different granularities. We propose Uno, a unified system for both inter- and intra-datacenter environments that integrates a transport protocol for rapid congestion reaction and fair rate control with a load-balancing scheme that combines erasure coding and adaptive routing. Our findings show that Uno significantly improves the completion times of both inter- and intra-datacenter flows compared to state-of-the-art methods such as Gemini.

Research and ACM SRC Posters

Unraveling Distant Galaxies: Analyzing IFU Data with Parsl and Academy

8:00am - 5:00pm CST Tuesday, 18 November 2025 Second Floor Atrium

Research & ACM SRC Posters

DescriptionIntegral field spectroscopy is a powerful technique in observational astrophysics enabling the study of spatially-complex objects like distant strongly-lensed galaxies. Integral field units (IFUs) are an increasingly common addition to many powerful ground- and space-based observatories, which motivates the need for efficient and accurate data reduction pipelines. Using Parsl, we developed a scalable processing pipeline to obtain spatially-resolved calibrated spectra for the integral field fiber head at Magellan (IFU-M) and the Magellan/Michigan Fiber System (M2FS) at Las Campanas Observatory, Chile. To enable fast filtering of cosmic rays, we integrated an Academy agent into the pipeline that can learn the time-evolving parameters of the instrument and accelerate that step by 1.5 while reducing the noise in the output spectra. We scaled the pipeline to 32 nodes, allowing one night of data to be processed in 25 minutes, a 16x speedup.

Workshop

Up-scaling Python functions for HPC with executorlib

3:45pm - 4:00pm CST Monday, 17 November 2025 242

Livestreamed

Recorded

DescriptionThe up-scaling of Python workflows from the execution on a local workstation to the parallel execution on an HPC typically faces three challenges: (1) the management of inter-process communication, (2) the data storage and (3) the management of task dependencies during the execution. These challenges commonly lead to a rewrite of major parts of the reference serial Python workflow to improve computational efficiency. Executorlib addresses these challenges by extending Python’s ProcessPoolExecutor interface to distribute Python functions on HPC systems. It interfaces with the job scheduler directly without the need for a database or daemon process, leading to seamless up-scaling.

Paper

UpANNS: Enhancing Billion-Scale ANNS Efficiency with Real-World PIM Architecture

10:30am - 10:52am CST Wednesday, 19 November 2025 260-267

Applications

Architectures & Networks

Livestreamed

Recorded

DescriptionApproximate Nearest Neighbor Search (ANNS) is a critical component of modern AI systems, such as recommendation engines and retrieval-augmented large language models (RAG-LLMs). However, scaling ANNS to billion-entry datasets exposes critical inefficiencies: CPU-based solutions are bottlenecked by memory bandwidth limitations, while GPU implementations underutilize hardware resources, leading to suboptimal performance and energy consumption. We introduce UpANNS, a novel framework leveraging Processing-in-Memory (PIM) architecture to accelerate billion-scale ANNS. UpANNS integrates four key innovations, including: architecture-aware data placement to minimize latency through workload balancing; dynamic resource management for optimal PIM utilization; co-occurrence optimized encoding to reduce redundant computations; and an early-pruning strategy for efficient top-k selection. Evaluation on commercial UPMEM hardware demonstrates that UpANNS achieves 4.3x higher QPS than CPU-based Faiss, while matching GPU performance with 2.3x greater energy efficiency. Its near-linear scalability ensures practicality for growing datasets, making it ideal for applications like real-time LLM serving and large-scale retrieval systems.

Workshop

Usability Evaluation of Cloud for HPC Applications

2:00pm - 2:30pm CST Monday, 17 November 2025 275

Livestreamed

Recorded

DescriptionThe rise of AI and the economic dominance of cloud computing have created a new nexus of innovation for high performance computing (HPC), which has a long history of driving scientific discovery. Beyond performance needs, scientific workflows increasingly demand capabilities of cloud environments: portability, reproducibility, dynamism, and automation. As converged cloud-HPC environments emerge, there is growing need to study their suitability for HPC use cases. Here we present a cross-platform usability study that assesses 11 different HPC proxy applications and benchmarks across three clouds (Microsoft Azure, Amazon Web Services, and Google Cloud), six environments, and two compute configurations (CPU and GPU) against on-premises HPC clusters at the Lawrence Livermore National Laboratory. We perform applications scaling tests in all environments up to 28,672 CPUs and 256 GPUs. We present methodology and results to guide future study and provide a foundation to define best practices for running HPC workloads in cloud.

Workshop

Using Code Coverage to Assess Feature Gaps in MPI Correctness Tool Classification Tests

10:30am - 10:55am CST Monday, 17 November 2025 261

Debugging & Correctness Tools

HPC Software & Runtime Systems

Livestreamed

Recorded

In-person

DescriptionWe examine the code generator-based MPI correctness benchmark MPI-BugBench (MBB) by analyzing the code coverage it triggers in three tools: MUST, PARCOACH, and clang-tidy. Our analysis complements MBB’s design, which prunes potentially exhaustive test sets based on real-world MPI usage. Our assessment identifies two key limitations in the generated tests: incomplete coverage of MPI features, such
as varying-count collectives, and limited structural diversity of the generated tests, such as lack of loops and lack of array-based MPI handles. We find increasing test volume alone offers limited benefit for exercising the tool's analysis code in our assessment.
To address these gaps, we propose a new generation level with the missing features and more varied code structures. To that end, we implemented 34 additional tests to exercise previously uncovered analysis code, adding as many as 770 lines of code coverage in MUST with a single test for varying-count collectives.

Tutorial

Using Containers To Accelerate HPC

8:30am - 5:00pm CST Sunday, 16 November 2025 132

Livestreamed

Recorded

TUT

DescriptionThe use of containers has revolutionized the way in which industries and enterprises have developed and deployed computational software and distributed systems. This containerization model has gained traction within the HPC community as well, with the promise of improved reliability, reproducibility, portability, and levels of customization that were not previously possible on supercomputers. This adoption has been enabled by a number of HPC container runtimes that have emerged, including Singularity, Shifter, Sarus, Podman, and others. This hands-on tutorial looks to train users on the use of containers for HPC use cases. We will provide a detailed background on Linux containers, along with an introductory hands-on experience building a container image, sharing the container, and running it on an HPC cluster. Furthermore, the tutorial will provide more advanced information on how to run MPI-based and GPU-enabled HPC applications, how to optimize I/O intensive workflows, and how to set up GUI-enabled interactive sessions. Cutting-edge examples will include machine learning and bioinformatics. Users will leave the tutorial with a solid foundational understanding of how to utilize containers on HPC resources using Podman, Shifter, and Singularity, and in-depth knowledge to deploy custom containers on their own resources.

Best Poster Presentations (Research, ACM SRC Grad/Undergrad)

Using Hardware Metrics To Understand Performance of the RAJA Performance Suite Kernels in Different GPU Modes on MI300A

4:15pm - 4:30pm CST Wednesday, 19 November 2025 230

Research & ACM SRC Posters

DescriptionModern GPUs play a crucial role in accelerating a wide range of computational workloads. However, their performance is often limited by the memory access patterns of the kernels they execute. AMD’s MI300A APU supports multiple logical GPU partitioning modes to optimize compute resource allocation, offering new opportunities for performance tuning. In this work, we evaluate how different GPU kernels from the RAJA Performance Suite perform in various partitioning modes. Using hardware counters, we compare two kernels with identical computational complexity but different data layouts, highlighting how memory organization can influence performance outcomes. The results demonstrate that data layout and access patterns have a significant impact on runtime performance across different partitioning modes, even when computational complexity and problem size remain constant.

Birds of a Feather

Utilizing Inference for Science Services Within the Academic Community

5:15pm - 6:45pm CST Tuesday, 18 November 2025 131

Artificial Intelligence & Machine Learning

Livestreamed

Recorded

XO/EX

DescriptionThis BoF explores the emerging role of AI inference services in the academic HPC community. Participants will share use cases, best practices, and strategies for deploying inference across research, education, and operations, including specialized applications such as indigenous language models. Through lightning talks, live polls, and interactive panel discussions, the session aims to identify shared challenges, opportunities for service sharing, and the academic value proposition of inference in science. The ultimate goal is to foster a cross-institutional “community of practice” among global academic and governmental HPC centers.

Paper

Utilizing Sparsity in the GPU-Accelerated Assembly of Schur Complement Matrices in Domain Decomposition Methods

3:52pm - 4:15pm CST Wednesday, 19 November 2025 260-267

Algorithms

Livestreamed

Recorded

DescriptionSchur complement matrices emerge in many domain decomposition methods that can utilize supercomputers to solve complex engineering problems. As most of today's high-performance clusters' performance lies in GPUs, these methods should also be accelerated.

Typically, the offloaded components are the explicitly assembled dense Schur complement matrices used later in the iterative solver for multiplication with a vector. As the explicit assembly is expensive, it adds a significant overhead to this approach of acceleration. It has already been shown that the overhead can be minimized by assembling the Schur complements directly on the GPU.

This paper shows that the GPU assembly can be further improved by wisely utilizing the matrix sparsity. In the context of FETI, we achieved a speedup of 5.1 in the GPU section of the code and 3.3 for the whole assembly, making the acceleration beneficial from as few as 10 iterations for subdomains with 1,000-70,000 unknowns.

Research and ACM SRC Posters

VaultX Merge: Breaking Memory Barriers in Proof-of-Space Plot Generation

8:00am - 5:00pm CST Tuesday, 18 November 2025 Second Floor Atrium

Research & ACM SRC Posters

DescriptionProof-of-work blockchains, like Bitcoin, consume substantial energy, motivating greener alternatives such as proof-of-space (PoSp), which relies on storage rather than computation. Existing PoSp implementations face scalability challenges due to high memory and I/O requirements, especially when generating large plots needed for fast lookups.

We present VaultX Merge, a novel out-of-memory PoSp plot generation method. VaultX Merge first creates multiple small in-memory subplots and then merges them into large plots, reducing redundant storage I/O, minimizing data size, and improving lookup latency. Our approach enables efficient operation across a range of devices, from small nodes like Raspberry Pis to high-end servers.

In our poster, we will demonstrate VaultX Merge’s performance across different hardware and storage configurations, showing up to 50% faster plot generation compared to previous out-of-RAM implementations, and highlight how this approach facilitates scalable, energy-efficient blockchain participation.

Doctoral Showcase

Viral Pneumonia Disease Classification with Machine Learning Techniques

4:45pm - 5:00pm CST Thursday, 20 November 2025 230

Research & ACM SRC Posters

Livestreamed

Recorded

DescriptionPneumonia is a dreadful condition that is the primary cause of death globally for individuals of all ages, but it is especially dangerous for small children who are younger than five. The radiological results obtained from an X-ray could lead to mistakes, incorrect diagnoses, and unnecessary delays. Datasets with chest X-ray were acquired from a Hopkins Diagnostic Center and 10 classifiers were applied. This work aims to develop ensemble machine and transfer learning models to classify viral pneumonia disease and apply ensemble techniques to the models. The models incorporate a variety of machine learning approaches, including k-nearest neighbors (KNN), decision tree (DT), random forest (RF), logistic regression (LR), and support vector machine (SVM). Furthermore, the transfer learning approach is used on the deep learning architectures VGG-19, DenseNet-121, GoogLeNet, AlexNet, and MobileNet-V2. The Keras code backend was implemented using TensorFlow.

On the general model’s performance: The model's output using the local dataset with 1,113 has the SVM, KNN, RF, LR, AlexNet, and GoogLeNet with 97%, 98%, 95%, 94%, 99%, and 100%, respectively, as the best in their performances; while DT, MobileNet, VGG-19, and DenseNet, with 89%, 80%, 77%, and 70% respectively, are the lowest in performance. The Max Voting Ensemble yielded 97% and weighted average yielded 98%. The results obtained from the analysis revealed the highest performance classification abilities of the seven models. The classification models for viral pneumonia disease enhance clinical practice by enabling improved interpretation of results, early prediction, detection, and life-saving interventions.

Art of HPC

Visualization of a High-Fidelity Wind Farm Simulation with ExaWind

8:00am - 6:00pm CST Sunday, 16 November 2025 Art of HPC - Plaza Lobby

Art of HPC

Not Livestreamed

Not Recorded

TUT

XO/EX

DescriptionThe image depicts the flow of wind through a wind farm comprising four five-megawatt wind turbines. The flow and turbine motions are shown in real time. The blades and tower are in a deflected state due to fluid-structure interaction. Vortical flow structures are colored by velocity magnitude to indicate turbulent high-speed flow regions.

Art of HPC

Visualization of Detonation Research Test Facility at Texas A&M University

8:00am - 6:00pm CST Sunday, 16 November 2025 Art of HPC - Plaza Lobby

Art of HPC

Not Livestreamed

Not Recorded

TUT

XO/EX

DescriptionLocated at the Texas A&M University System RELLIS Campus, the Detonation Research Test Facility (DRTF) exemplifies pioneering research, led by Dr. Elaine Oran, a world-renowned authority on the physics of explosions. The DRTF is dedicated to understanding the intricate dynamics of detonations, where flammable gases and materials interact under extreme conditions to produce powerful and potentially devastating outcomes.

At the heart of this facility lies a massive steel tube, stretching 150 meters in length with a diameter of 2 meters and reinforced by 3/4-inch thick walls. This immense structure is designed to safely contain and replicate the magnitude of real-world detonation events, providing researchers with the rare opportunity to observe and analyze explosive phenomena at scale.

Complementing this experimental infrastructure, Dr. Jian Tao leads a groundbreaking initiative to create a digital twin of the DRTF. This virtual replica is being developed at the Texas A&M RELLIS Campus to enable near-real-time simulation, predictive analysis, and operational optimization. By integrating high performance computing and advanced visualization, the digital twin promises to transform the way researchers model, test, and refine their understanding of detonation processes, enhancing both safety and scientific insight.

The 3D model showcased here was created by Sina Alidoust Salimi and Britain Thomas, master’s students in the College of Performance, Visualization and Fine Arts. Their contribution highlights the power of interdisciplinary collaboration at Texas A&M, where art, science, and technology converge to advance research capabilities and safeguard future innovations.

Art of HPC

Visualizing Clouds - Evolution of Technology & Techniques

8:00am - 6:00pm CST Sunday, 16 November 2025 Art of HPC - Plaza Lobby

Art of HPC

Not Livestreamed

Not Recorded

TUT

XO/EX

Description3d volumetric clouds (VDBs) are rendered from particles. The clouds are then composited into a still frame from NVIDIA's Omniverse Flight example USD project. When animated the camera flights through the 3d cloud formation, revealing the internal structure of the clouds.

Early Career

Voices from the Field: Navigating Careers in Academia, Government, and Industry

3:30pm - 4:45pm CST Monday, 17 November 2025 262

Not Livestreamed

Not Recorded

TUT

XO/EX

DescriptionThis session will feature a diverse, moderated panel of individuals from academia, government, and industry, who will share their career stories and answer questions from participants.

Research and ACM SRC Posters

Wafer-Scale Simulation of Mutator Allele Dynamics in Large Asexual Populations

8:00am - 5:00pm CST Tuesday, 18 November 2025 Second Floor Atrium

Research & ACM SRC Posters

DescriptionWithin evolving microbial populations, genes that elevate mutation rate impose a fundamental trade-off: on one hand, increasing harmful mutations among offspring, but, on the other, allowing more opportunities for rare beneficial mutations. Existing single-CPU agent-based simulation work suggests that increased population size should generally favor the proliferation of mutator alleles due to "hitch-hiking" effects associated with beneficial mutation discovery. However, in contrast to this expectation, this outcome is often not the case in large asexual populations found in nature. To address this knowledge gap, we leveraged the 850,000-processor Cerebras Wafer-Scale Engine (WSE) to increase simulation scale up to 1.5 billion-agent populations. In benchmarks, WSE provided 294× speedup over GPU and 111,091× speedup over single-core CPU execution. Among other results, our experiments indicate that limitation of adaptive potential (i.e., few beneficial mutations available) can produce a tertiary regime where complete mutator allele takeover becomes disfavored at very large population sizes.

Workshop

WAGES: Workload-Aware GPU Sharing System for Energy-Efficient Serverless LLM Serving

11:20am - 11:40am CST Sunday, 16 November 2025 264

Partially Livestreamed

Partially Recorded

DescriptionServerless LLM serving lowers costs by elastically provisioning GPUs and charging only for usage. However, current systems mostly target cold-start latency, overlooking inefficiencies: (i) static, exclusive GPU allocation that wastes compute resources and increases costs, and (ii) fixed hardware-controlled clock speeds that waste energy. Our analysis shows many LLM workloads can meet SLOs with partial SM allocations and reduced clock speeds, enabling GPU multiplexing and dynamic clock scaling. We present WAGES, a workload-aware GPU sharing system that uses NVIDIA MPS to co-locate LLMs, dynamically adjusting SM partitions and clock speeds to workload needs while meeting SLOs. A two-tier scheduler coordinates global GPU consolidation and local SLO-aware tuning, overlapping model/KV migration with execution to reduce reconfiguration overhead. On real LLM traces, WAGES improves SLO attainment by up to 4% over prior GPU sharing approaches and reduces energy use by up to 26%.

Paper

Wasp: Efficient Asynchronous Single-Source Shortest Path on Multicore Systems via Work Stealing

3:52pm - 4:14pm CST Thursday, 20 November 2025 260-267

Algorithms

Livestreamed

Recorded

DescriptionThe Single-Source Shortest Path (SSSP) problem is a fundamental graph problem with an extensive set of real-world applications. State-of-the-art parallel algorithms for SSSP, such as the ∆-stepping algorithm, create parallelism through priority coarsening. Priority coarsening results in redundant computations that diminish the benefits of parallelization and limit parallel scalability.

This paper introduces Wasp, a novel SSSP algorithm that reduces parallelism-induced redundant work by utilizing asynchrony and an efficient priority-aware work stealing scheme. Contrary to previous work, Wasp introduces redundant computations only when threads have no high-priority work locally available to execute. This is achieved by a novel priority-aware work stealing mechanism that controls the inefficiencies of indiscriminate priority coarsening.

Experimental evaluation shows competitive or better performance compared to GAP, GBBS, MultiQueues, Galois, ∆*-stepping, and ρ-stepping on 13 diverse graphs with geometric mean speedups of 2.26x on AMD Zen 3 and 2.16x on Intel Sapphire Rapids using 128 threads.

Workshop

Weak Scaling of NVSHMEM Applied To Hashed Distributed Structured Data

2:20pm - 2:40pm CST Sunday, 16 November 2025 230

Livestreamed

Recorded

DescriptionWe present a parallel, hash-based software library, HashBrick, for sparse, block-structured applications on CPUs and NVIDIA GPUs. We use a brick-based layout, where data is aggregated into small regular bricks, and hashes hide the complexity of neighbor indexing. This exposes an extra level of flexibility for irregular data and avoids packing communication buffers for ghost zones. One-sided NVSHMEM is used for GPU-GPU communication of an irregular distribution of ghost bricks. Weak scaling experiments using a high-order CFD application and a Jacobi iteration benchmark were run on a NVIDIA GH200 cluster. For constant problem size per node, the computation time is constant, and communication scales well despite load imbalance. We find that the variation in the distribution of ghost bricks more or less correlates with scaling efficiency. Our results show that MPI and NVSHMEM have similar scaling, but MPI wall-clock times are 64-84% higher for these experiments.

Workshop

Welcome

9:00am - 9:10am CST Sunday, 16 November 2025 231

Livestreamed

Recorded

Workshop

Welcome

9:01am - 9:06am CST Monday, 17 November 2025 264

Livestreamed

Recorded

Early Career

Welcome & Intro Session

8:30am - 10:00am CST Monday, 17 November 2025 262

Not Livestreamed

Not Recorded

TUT

XO/EX

DescriptionWelcome
Committee introductions
Icebreaker
Program overview
Q&A

Workshop

Welcome and Opening Presentation

9:00am - 9:30am CST Sunday, 16 November 2025 275

Livestreamed

Recorded

DescriptionThis session will explore how quantum computing is rapidly evolving from a specialized research domain into an integral part of the high-performance computing (HPC) landscape. We will examine the convergence of classical supercomputing and quantum architectures, highlighting systems such as Fugaku, Frontier, and IBM Quantum System Two, and discuss how hybrid orchestration across CPUs, GPUs, and QPUs is redefining scientific computing. The session will trace the engineering milestones on the path toward fault-tolerant quantum computing by 2029, emphasizing the role of open-source frameworks like Qiskit and new workload-management integrations that make quantum resources first-class citizens in HPC environments. By presenting a full-stack view, from hardware and middleware to compilers and algorithms, we’ll demonstrate how HPC-native quantum computing can accelerate breakthroughs in chemistry, optimization, materials science, and AI, ushering in a new era of quantum-centric supercomputing.

Workshop

Welcome to PAW-ATM

9:00am - 9:05am CST Sunday, 16 November 2025 230

Livestreamed

Recorded

Workshop

What can high-performance computing centers do with quantum computing?

4:00pm - 4:45pm CST Sunday, 16 November 2025 275

Livestreamed

Recorded

DescriptionUnderstanding how quantum and classical high-performance computing can work together is important to unlock the full potential of quantum computing. Together with a panel of experts, we discuss use cases for hybrid quantum-classical workflows, and how existing HPC centers can adopt the technology required to implement quantum-centric supercomputing at scale.

Flash Session

What It Takes To Install a Quantum Computer in an HPC Center

11:45am - 12:00pm CST Wednesday, 19 November 2025 Booth 2638 - Flash Session

Not Livestreamed

Not Recorded

XO/EX

DescriptionIncorporating quantum computers into high performance computing (HPC) environments (commonly referred to as HPC+QC integration) marks a pivotal step in advancing computational capabilities for scientific research. Here we report on the integration of a superconducting 20-qubit quantum computer into the HPC infrastructure at Leibniz Supercomputing Centre (LRZ), one of the first practical implementations of its kind. This yielded four key lessons: (1) quantum computers have stricter facility requirements than classical systems, yet their deployment in HPC environments is feasible when preceded by a rigorous site survey to ensure compliance; (2) quantum computers are inherently dynamic systems that require regular recalibration that is automatic and controllable by the HPC scheduler; (3) redundant power and cooling infrastructure is essential; and (4) effective hands-on onboarding should be provided for both quantum experts and new users. The identified conclusions provide a roadmap to guide future HPC center integrations.

Paper

What To Support When You’re Compressing: The State of Practice, Gaps, and Opportunities for Scientific Data Compression

1:30pm - 1:52pm CST Thursday, 20 November 2025 261-262-265-266

Algorithms

Applications

State of the Practice

Livestreamed

Recorded

DescriptionOver the last nearly 20 years, lossy compression has become an essential aspect of HPC applications' data pipelines, allowing them to overcome limitations in storage capacity and bandwidth and, in some cases, increase computational throughput and capacity. However, with the adoption of lossy compression comes the requirement to assess and control the impact lossy compression has on scientific outcomes.

In this work, we take a major step forward in describing the state of practice and characterizing workloads. We examine applications' needs and compressors' capabilities across nine different supercomputing application domains. We present 25 takeaways that provide best practices for applications, operational impacts for facilities achieving compressed data, and gaps in application needs not addressed by production compressors that point towards opportunities for future compression research.

Research and ACM SRC Posters

When Label Propagation Outperforms BFS in Breadth-First Graph Traversal

8:00am - 5:00pm CST Tuesday, 18 November 2025 Second Floor Atrium

Research & ACM SRC Posters

DescriptionWe tackle the challenge of breadth-first traversal (BFT) on sparse graphs with a high number of connected components. We propose a novel distributed-memory parallel algorithm that uses the label propagation (LP) algorithm to perform BFT on all connected components of the graph simultaneously. In synthetic benchmarks with RMAT-like graphs, we show that our LP-based algorithm can be up to 77x faster compared to the parallel direction-optimized BFS in the Combinatorial BLAS library, while scaling up to 1.5k CPU cores.

Birds of a Feather

Where Could Europe Add Value? ETP4HPC (The European HPC Technology Platform) Showcases Collaboration Opportunities Involving Next-Generation European Technology

5:15pm - 6:45pm CST Wednesday, 19 November 2025 275

Emerging Hardware & Software Technologies

Livestreamed

Recorded

XO/EX

DescriptionThanks to the stream of funding provided by the EuroHPC Joint Undertaking, Europe is ready to build its next-generation, world-class HPC and AI systems.

Its project portfolio offers solutions that pave the way towards global leadership and collaboration opportunities in the following complementary areas: DARE and EPI projects – HPC/AI processor development, NET4EXA – high-speed interconnects, and SEANERGYS – dynamic, energy-optimized operation.

Project representatives will showcase their results in order to identify related efforts and technologies overseas (U.S., Asia, Japan, India, other regions). The outcomes will be documented in an ETP4HPC white paper on international collaboration.

Workshop

WHPC - Chapter Talk

10:40am - 11:00am CST Monday, 17 November 2025 276

Partially Livestreamed

Partially Recorded

DescriptionIn this session, one of the WHPC Global Organization Chapters will be showcasing its activities for the community.

Workshop

WHPC - Career Pathways Talks

3:30pm - 4:30pm CST Monday, 17 November 2025 276

Partially Livestreamed

Partially Recorded

Workshop

WHPC - Close Up

4:30pm - 4:35pm CST Monday, 17 November 2025 276

Partially Livestreamed

Partially Recorded

Workshop

WHPC - Distinguished Speaker

9:05am - 10:00am CST Monday, 17 November 2025 276

Partially Livestreamed

Partially Recorded

Workshop

WHPC - Networking Breakout

2:45pm - 3:00pm CST Monday, 17 November 2025 276

Partially Livestreamed

Partially Recorded

Workshop

WHPC - SC First-timers Session

4:35pm - 5:30pm CST Monday, 17 November 2025 276

Not Livestreamed

Not Recorded

Partially Livestreamed

Partially Recorded

Workshop

WHPC - Troika Consulting Session

11:00am - 12:30pm CST Monday, 17 November 2025 276

Partially Livestreamed

Partially Recorded

DescriptionIn this session, we will invite our attendees to engage in a dynamic activity where they will share visions and strategies for building community in groups of three people. Using the Troika Consulting technique, the attendees will be guided to provide feedback to others' practical or imaginative questions within building community and networking strategies. At the end of this session, we hope our attendees will get different perspectives on possible problems they face daily and create connections.

Workshop

WHPC - Welcome

9:00am - 9:05am CST Monday, 17 November 2025 276

Partially Livestreamed

Partially Recorded

Workshop

WHPC - WHPC Distinguished Volunteering Recognition Presented to Cristin Merritt

10:30am - 10:40am CST Monday, 17 November 2025 276

Partially Livestreamed

Partially Recorded

Doctoral Showcase

Interactive Research e-Poster

Why Read It All? Just Read What Matters: A Path to Faster Scientific Visualization

8:00am - 5:00pm CST Tuesday, 18 November 2025 Second Floor Atrium

Research & ACM SRC Posters

Not Livestreamed

Not Recorded

DescriptionAs HPC simulations generate ever-larger datasets, reducing the volume of data that must be loaded into compute node memory for analysis has become essential for unlocking insights efficiently. In-storage analysis achieves this by processing data directly at the storage servers, allowing them to return only compact results that match regions of interest instead of raw datasets that may be orders of magnitude larger—significantly reducing data footprint.

This e-poster showcases a novel compute-near-storage architecture based on pNFS that enables secure in-storage analysis of scientific data with industry-standard software, including Arrow, Parquet, Substrait, and DuckDB. Using a real-world asteroid impact dataset, we present a live demonstration of a VTK visualization pipeline modified to offload analysis to pNFS data servers, tracing the aftermath of the impact over time and rendering the results on the client as 3D visuals. We show substantial data reduction by pushing down analysis and transmitting only insight-relevant information.

Research and ACM SRC Posters

WiCAT: Reducing Congestion at Wireless Interfaces in Heterogeneous Architectures

8:00am - 5:00pm CST Tuesday, 18 November 2025 Second Floor Atrium

Research & ACM SRC Posters

DescriptionHeterogeneous architectures integrating CPUs, GPUs, and memory controllers generate diverse traffic patterns that stress the on-chip network. Wireless networks-on-chip (WNoCs) provide fast, single-hop communication across distant nodes. However, their effectiveness is limited by congestion at wireless interfaces (WIs). In this work, we present WiCAT (Wireless Collate and Transfer), a lightweight two-stage framework to mitigate WI bottlenecks without modifying CPUs, GPUs, or memory controllers. WiCAT introduces Collate, a WI-level collation scheme that reduces redundant requests and corresponding reply traffic, and Transfer, a predictive medium access control protocol that dynamically allocates channel time based on both current buffer occupancy and anticipated traffic. Evaluation on Rodinia benchmarks shows that WiCAT reduces average delay by 17.8%, increases network throughput by 64%, and lowers energy consumption by 13.5%.

Art of HPC

Wildfire

8:00am - 6:00pm CST Sunday, 16 November 2025 Art of HPC - Plaza Lobby

Art of HPC

Not Livestreamed

Not Recorded

TUT

XO/EX

DescriptionThe Sculpting Vis Collaborative developed software we call Artifact-Based Rendering, used here to visualize data from the HiGrad/Firetec wildfire model at LANL. We depart from traditional visualization methods by using statistical sampling to produce multiple point sets that represent fire and smoke. An evolving point set is produced by continuously emitting particles from the fire hot-spot on the ground and allowing them to be carried by the wind interpolated through time. The instantaneous velocity of the wind is represented by path lines seeded at two levels on the up-wind face of the simulation space as well from the hot-spot on the ground. This results in several sets of points and of lines that exist in the same space, but which represent different aspects of the data. ABR enables us to use artist-made glyphs, lines and textures to differentiate between these different geometrical sets, helping the viewer to understand the complex structure of the data.

Art of HPC

Within the Marble

8:00am - 6:00pm CST Sunday, 16 November 2025 Art of HPC - Plaza Lobby

Art of HPC

Not Livestreamed

Not Recorded

TUT

XO/EX

DescriptionA look within the blue marble we call home. HPC simulation data from geophysics shows processes under the Earth's mantle that are used for fundamental research on our planet.

Workshop

WIUHPC Introduction

8:31am - 8:40am CST Friday, 21 November 2025 260-267

Livestreamed

Recorded

DescriptionInteractivity enables the exploitation of HPC in new and revolutionary ways, delivering many new and exciting opportunities for our community. Interactive HPC involves users being in the loop during job execution where a human is monitoring a job, steering the experiment, or visualizing results to make immediate decisions about the results to influence the current or subsequent interactive jobs. Likewise, urgent computing combines interactive computational modeling with time sensitive systems in the real world such as the near real time analysisdetection of unfolding disasters to informtake real-time decisionmakingactions. Supporting interactive and urgent workloads on HPC requires expertise in a wide range of areas and the solving of numerous technical and organizational challenges.

This workshop brings together stakeholders, researchers and practitioners from across interactive and urgent computing within the wider HPC community. We will share success stories, case studies and technologies to continue community building around leveraging interactive HPC as an important tool for scientific research, responding to disasters and addressing societal issues.

Research and ACM SRC Posters

WONDERS: Integrating WOW, PONDER, and SCALE for Enhanced Scheduling Performance

8:00am - 5:00pm CST Tuesday, 18 November 2025 Second Floor Atrium

Research & ACM SRC Posters

DescriptionRecent scientific workflow management systems, such as Nextflow, put a huge focus on portability of workflows. Portability encompasses replacing both the target infrastructure and the input dataset. The more portable systems become, the more the importance of automatic adaptation and optimization increases. Strategies to optimize the execution of scientific workflows are often evaluated in simulation, and only for the individual strategy. Accordingly, it is unclear how different strategies affect each other.

In this work, we fuse three strategies to optimize workflow execution. First, WOW: an approach that focuses on location-aware scheduling. Second, PONDER: an approach that predicts task memory consumption and sizes the tasks accordingly. And third, SCALE: an approach to predict task CPU and size it accordingly.

We test all three approaches together and investigate their synergies. Our results show that the whole is greater than the sum of its parts. We achieve makespan reductions of up to 67.4%.

Birds of a Feather

Workflows Community: Bridging Intelligent Workflows with Quantum and HPC for Scientific Discovery

5:15pm - 6:45pm CST Tuesday, 18 November 2025 274

Community Meetings

Livestreamed

Recorded

XO/EX

DescriptionThis BoF will convene the workflows community to discuss emerging directions in scientific workflow execution, including agentic workflows, integration of high performance and quantum computing workflows, and coordinated allocation and scheduling across experimental and computing facilities. A central focus will be on ensuring end-to-end resource availability when workflows depend on limited instrument time and distributed infrastructure. The session will also address the need for infrastructure and policy reforms to support intelligent, cross-facility execution. Through interactive discussions, participants will explore collaborative strategies to enable resilient, scalable, and adaptive workflows that meet the evolving demands of scientific discovery.

Flash Session

Workflows, Not Workloads: The Future of HPC, AI, and Big Data

3:30pm - 3:45pm CST Tuesday, 18 November 2025 Booth 2638 - Flash Session

Not Livestreamed

Not Recorded

XO/EX

DescriptionHigh performance computing (HPC), artificial intelligence (AI), and data analytics are converging in ways that are reshaping research, industry, and government operations. For decades, organizations optimized individual workloads—tuning a simulation, refining a model, or scaling a job. However, as complexity increases, this siloed approach creates bottlenecks, compliance risks, and escalating costs. The real breakthrough comes not from making single workloads faster, but from orchestrating workflows that span the entire value chain.

This flash session explores the transition from the “age of workloads” to the “age of workflows.” Drawing parallels to Henry Ford’s assembly line, we will highlight how workflow thinking accelerates insight, ensures reproducibility, and enables cost control across hybrid and multi-cloud environments. Topics include:

• Workflow orchestration in HPC/AI pipelines using tools such as NextFlow, Snakemake, WDL, Cromwell, and Airflow
• Hybrid flexibility: balancing on-premises HPC with cloud bursting for agility and scale
• Compliance and reproducibility: structured, auditable pipelines that reduce risk in regulated domains
• Knowledge preservation: encoding expert practices into workflows to mitigate the “Silver Tsunami” of retiring talent
• Case studies: accelerating drug discovery, optimizing energy production, and advancing automotive materials design

Attendees will leave with a clear understanding of how workflow optimization transforms HPC from a collection of isolated jobs into a strategic engine for innovation. The imperative is clear: organizations that shift to workflow-driven thinking will accelerate their time-to-market, reduce costs, and build a sustainable competitive advantage in the era of data-intensive science and AI.

Paper

Workload Intelligence: Workload-Aware IaaS Abstraction for Cloud Efficiency

3:30pm - 3:52pm CST Thursday, 20 November 2025 275

System Software and Cloud Computing

Livestreamed

Recorded

DescriptionToday, cloud workloads are largely opaque to the cloud platform. Typically, the only information the platform receives is the virtual machine (VM) type and possibly a decoration to the type (e.g., the VM is evictable). Similarly, workloads receive minimal information from the platform; generally, only telemetry from their VMs or occasional signals (e.g., just before a VM is evicted). The narrow interface between workloads and platforms has several drawbacks: (1) a surge in VM types and decorations in public cloud platforms complicates customer selection; (2) key workload characteristics (e.g., low availability requirements) are often unspecified, hindering platform customization for optimized resource usage and cost savings; and (3) workloads may be unaware of potential optimizations or lack sufficient time to react to platform events. To resolve these issues and improve cloud efficiency, we propose Workload Sage, a framework for enabling dynamic bi-directional communication between cloud workloads and cloud platform.

Workshop

Workshop Welcome and Overview

9:01am - 9:10am CST Sunday, 16 November 2025 241

Livestreamed

Recorded

DescriptionWorkshop organizers will provide an overview of the objectives and program as well as the international Trillion Parameter Consortium and opportunities to engage.

Workshop

Wrap up and closing remarks

12:25pm - 12:30pm CST Monday, 17 November 2025 241

Livestreamed

Recorded

Tutorial

Write Highly Parallel, Vendor-Neutral Applications Using C++ and SYCL

8:30am - 12:00pm CST Sunday, 16 November 2025 122

Livestreamed

Recorded

TUT

DescriptionSYCL is a programming model that lets developers support a wide variety of devices (CPUs, GPUs, and more) from a single code base. Given the growing heterogeneity of processor roadmaps in both HPC and AI, moving to an open standard, platform-independent model such as SYCL is essential for modern software developers. SYCL has the further advantage of supporting a single-source style of programming using completely standard C++. In this tutorial, we will introduce SYCL and provide programmers with a solid foundation they can build on to gain mastery of this language. The main benefit of using SYCL over other heterogeneous programming models is the single programming language approach, which enables one to target multiple devices using the same programming model, and therefore to have a cleaner, portable, and more readable code. This is a hands-on tutorial. The real learning will happen as students write code. The format will be short presentations followed by hands-on exercises. Hence, attendees will require their own laptop to perform the hands-on exercises.

Paper

X-MoE: Enabling Scalable Training for Emerging Mixture-of-Experts Architectures on HPC Platforms

1:30pm - 1:52pm CST Wednesday, 19 November 2025 261-262-265-266

BSP

HPC for Machine Learning

Performance Measurement, Modeling, & Tools

Livestreamed

Recorded

DescriptionEmerging expert-specialized Mixture-of-Experts (MoE) architectures, such as DeepSeek-MoE, deliver strong model quality through fine-grained expert segmentation and large top-k routing. However, their scalability is limited by substantial activation memory overhead and costly all-to-all communication. Furthermore, current MoE training systems—primarily optimized for NVIDIA GPUs—perform suboptimally on non-NVIDIA platforms, leaving significant computational potential untapped.

In this work, we present X-MoE, a novel MoE training system designed to deliver scalable training performance for next-generation MoE architectures. X-MoE achieves this via several novel techniques, including efficient padding-free MoE training with cross-platform kernels, redundancy-bypassing dispatch, and hybrid parallelism with sequence-sharded MoE blocks. Our evaluation on the Frontier supercomputer, powered by AMD MI250X GPUs, shows that X-MoE scales DeepSeek-style MoEs up to 545 billion parameters across 1,024 GPUs—10x larger than the largest trainable model with existing methods under the same hardware budget, while maintaining high training throughput.

Workshop

X-ray Ptychography at the Edge: Towards Real-Time Feedback for High-Speed Nanoimaging

10:55am - 11:20am CST Sunday, 16 November 2025 240

Livestreamed

Recorded

DescriptionX-ray ptychography is becoming an indispensable tool for nanoscale imaging, driving innovation in functional materials, electronics, life sciences, etc. To retrieve sample images, the technique relies on advanced mathematical algorithms, making it computationally intensive. Recent advances in data acquisition have greatly increased the data generation rate, making it challenging to perform reconstruction in a timely manner to support decision making during an experiment. Here, we demonstrate how efficient GPU-based iterative reconstruction algorithms, deployed at the edge, enable real-time feedback during high-speed continuous data acquisition, allowing for a more informed experiment execution and thus increasing the quality and efficiency of the measurements. These developments represent a steppingstone towards augmentation of computationally intensive experiments with data-driven decision making, paving the way for autonomous experiments performed at machine speeds.

Paper

XaaS Containers: Performance-Portable Representation with Source and IR Containers

3:30pm - 3:52pm CST Tuesday, 18 November 2025 275

State of the Practice

System Software and Cloud Computing

Livestreamed

Recorded

DescriptionHPC systems and cloud data centers are converging, and containers are becoming the default software deployment method. While containers simplify software management, they face significant performance challenges: they must sacrifice hardware-specific optimizations to achieve portability. Although HPC containers can use runtime hooks to access optimized libraries and devices, they are limited by ABI compatibility and cannot reverse the effects of early-stage compilation decisions. XaaS containers proposed a vision of performance-portable containers, and we present a practical realization with Source and Intermediate Representation (IR) containers. We delay performance-critical decisions until the target system specification is known. We analyze specialization mechanisms in HPC software and propose a new LLM-assisted method for their automatic discovery. By examining the compilation pipeline, we develop a methodology to build containers optimized for target architectures at deployment time. Our prototype demonstrates that new XaaS containers combine the convenience of containerization with the performance benefits of system-specialized builds.

Workshop

xGFabric: Coupling Sensor Networks and HPC Facilities with Private 5G Wireless Networks for Real-Time Digital Agriculture

10:30am - 10:55am CST Sunday, 16 November 2025 240

Livestreamed

Recorded

DescriptionAdvanced scientific applications require coupling distributed sensor networks with centralized high-performance computing facilities. Citrus Under Protective Screening (CUPS) exemplifies this need in digital agriculture, where citrus research facilities are instrumented with numerous sensors monitoring environmental conditions and detecting protective screening damage. CUPS demands access to computational fluid dynamics codes for modeling environmental conditions and guiding real-time interventions like water application or robotic repairs. These computing domains have contrasting properties: sensor networks provide low-performance, limited-capacity, unreliable data access, while high-performance facilities offer enormous computing power through high-latency batch processing. Private 5G networks present novel capabilities addressing this challenge by providing low latency, high throughput, and reliability necessary for near-real-time coupling of edge sensor networks with HPC simulations. This work presents xGFabric, an end-to-end system coupling sensor networks with HPC facilities through Private 5G networks. The prototype connects remote sensors via 5G network slicing to HPC systems, enabling real-time digital agriculture simulation.

Workshop

XLOOP 2025: The 7th Annual Workshop on Extreme-Scale Experiment-in-the-Loop Computing

9:00am - 9:01am CST Sunday, 16 November 2025 240

Livestreamed

Recorded

DescriptionAdvancement in computational power and high-speed networking is enabling a new model of scientific experiment, experiment-in-the-loop computing (EILC). In this model, simulation and/or learning modules are run as data is collected from observational and experimental sources. Presently, the amount and complexity of data generated by simulations and by observational and experimental sources, such as sensor networks and large-scale scientific facilities, continues to increase. Several research challenges exist, many of which are independent of the scientific application domain. New algorithms, including artificial intelligence and machine learning algorithms, to merge simulation ensembles and experimental data sets must be developed. Data transfer techniques and workflows must be constructed to control the ensembles and integrate simulated and observed data sets. The Workshop on Extreme-Scale Experiment-in-the-Loop Computing (XLOOP 2025) will be a unique opportunity to promote this interdisciplinary topic area. We invite papers, presentations, and participants from the physical and computer sciences.

Workshop

XLOOP Awards Ceremony & Conclusion

2:45pm - 3:00pm CST Sunday, 16 November 2025 240

Livestreamed

Recorded

Workshop

XLOOP Panel

2:00pm - 2:45pm CST Sunday, 16 November 2025 240

Livestreamed

Recorded

Birds of a Feather

Zero Trust in HPC Centers

12:15pm - 1:15pm CST Wednesday, 19 November 2025 261-262-265-266

Practitioners in HPC

Security & Privacy

Livestreamed

Recorded

XO/EX

DescriptionThis session discusses the critical challenge of integrating Zero Trust (ZT) security into traditional supercomputing environments. The ZT model, based on a least-privilege, per-request architecture, has profound implications for HPC centers, application developers, and end-user workflows. We will explore the fundamentals of ZT, the purpose of NIST SP 800-207, and relevant U.S. Federal mandates. We will discuss current implementation approaches and challenges at major HPC centers. Join this interactive discussion to share your experiences, questions, and solutions.

Paper

Zero-Value Code Specialization via Profile-Guided Control Data Flow Analysis

11:37am - 12:00pm CST Tuesday, 18 November 2025 263-264

Performance Measurement, Modeling, & Tools

Livestreamed

Recorded

DescriptionZero-value propagation is a common phenomenon in modern programs, where redundant operations caused by zero values can severely impact performance. Since zero values are often generated dynamically at runtime, eliminating such redundancies through static analysis alone is challenging. In this paper, we propose an efficient static control data flow analysis algorithm to identify redundancies resulting from zero-value propagation. Based on this algorithm, we design and implement ZeroSpec, a fully automated profile-guided code optimizer that detects zero values at runtime and specializes fast paths for them. To maximize performance gains, ZeroSpec also employs a fine-grained cost model that evaluates the optimization potential of individual zero-value instructions to guide the construction of targeted optimization regions. Evaluation on SPEC CPU2017, NPB and real-world applications demonstrates the effectiveness of ZeroSpec, achieving a maximum performance speedup of 1.31x.

Sessions

Workshop

11th International Workshop on Heterogeneous High-Performance Reconfigurable Computing (H2RC 2025)

8:30am - 12:00pm CST Friday, 21 November 2025 263-264

Livestreamed

Recorded

Workshop

12th SC Workshop on Best Practices for HPC Training and Education

9:00am - 5:30pm CST Monday, 17 November 2025 232

Livestreamed

Recorded

Workshop

12th Workshop on Accelerator Programming and Directives (WACCPD 2025)

9:00am - 5:30pm CST Monday, 17 November 2025 266

Livestreamed

Recorded

Workshop

14th International Workshop on Runtime and Operating Systems for Supercomputers (ROSS 2025)

2:00pm - 5:30pm CST Sunday, 16 November 2025 265

Livestreamed

Recorded

Workshop

16th Workshop on Latest Advances in Scalable Algorithms for Large-Scale Heterogeneous Systems (ScalAH'25)

9:00am - 5:30pm CST Sunday, 16 November 2025 231

Livestreamed

Recorded

Workshop

1st Annual Workshop on Large-Scale Quantum-Classical Computing

9:00am - 5:30pm CST Sunday, 16 November 2025 275

Livestreamed

Recorded

Workshop

1st International Symposium on Artificial Intelligence and Extreme-Scale Workflows

8:30am - 12:00pm CST Friday, 21 November 2025 274

Livestreamed

Recorded

Workshop

2025 International Workshop on Performance, Portability, and Productivity in HPC (P3HPC)

9:00am - 5:30pm CST Monday, 17 November 2025 231

Performance Evaluation, Scalability, & Portability

Livestreamed

Recorded

Workshop

3rd International Workshop on HPC Testing and Evaluation of Systems, Tools, and Software (HPCTESTS 2025)

8:30am - 12:00pm CST Friday, 21 November 2025 276

Livestreamed

Recorded

Workshop

4th International Workshop on Cyber Security in High Performance Computing (S-HPC)

2:00pm - 5:30pm CST Sunday, 16 November 2025 242

Livestreamed

Recorded

Workshop

5th International Symposium on Quantitative Co-Design of Supercomputers

9:00am - 12:30pm CST Monday, 17 November 2025 240

Livestreamed

Recorded

Workshop

6th Combined Workshop on Interactive and Urgent High Performance Computing

8:30am - 12:00pm CST Friday, 21 November 2025 260-267

Livestreamed

Recorded

Workshop

6th Workshop on Heterogeneity and Memory Systems (HMEM)

2:00pm - 5:30pm CST Monday, 17 November 2025 240

Livestreamed

Recorded

Workshop

7th International Workshop on Containers and New Orchestration Paradigms for Isolated Environments in HPC (CANOPIE-HPC)

9:00am - 5:30pm CST Monday, 17 November 2025 275

Livestreamed

Recorded

Workshop

7th International Workshop on Emerging Parallel Distributed Runtime Systems and Middleware

8:30am - 12:00pm CST Friday, 21 November 2025 261-262-265-266

Livestreamed

Recorded

Workshop

7th Workshop on Programming and Performance Visualization Tools (ProTools)

2:00pm - 5:30pm CST Monday, 17 November 2025 241

Livestreamed

Recorded

Workshop

9th International Workshop on Software Correctness for HPC Applications (Correctness '25)

9:00am - 5:30pm CST Monday, 17 November 2025 261

Livestreamed

Recorded

Awards and Award Talks

ACM and IEEE-CS Award Presentations

8:30am - 10:00am CST Wednesday, 19 November 2025 America's Ballroom Tu-Th

Livestreamed

Recorded

ACM Gordon Bell Climate Modeling Finalist

ACM Gordon Bell Finalist

Awards and Award Talks

ACM Gordon Bell Finalists Presentations

1:30pm - 3:00pm CST Tuesday, 18 November 2025 261-262-265-266

Applications

Livestreamed

Recorded

ACM Gordon Bell Finalist

Awards and Award Talks

ACM Gordon Bell Finalists Presentations

10:30am - 12:00pm CST Tuesday, 18 November 2025 261-262-265-266

Applications

Livestreamed

Recorded

ACM Gordon Bell Climate Modeling Finalist

Awards and Award Talks

ACM Gordon Bell Finalists Presentations

3:30pm - 5:00pm CST Tuesday, 18 November 2025 261-262-265-266

Applications

Livestreamed

Recorded

Community Engagement and Support

Advanced Computing Student Collaborative (ACSC@SC25)

3:30pm - 4:15pm CST Wednesday, 19 November 2025 Booth 3537 - SCinet Theater

Not Livestreamed

Not Recorded

TUT

XO/EX

Community Engagement and Support

Advanced Computing Student Collaborative (ACSC@SC25)

8:30am - 12:00pm CST Thursday, 20 November 2025 125

Not Livestreamed

Not Recorded

TUT

XO/EX

Workshop

AI4S: 6th Workshop on Artificial Intelligence and Machine Learning for Scientific Applications

9:00am - 5:30pm CST Monday, 17 November 2025 274

Livestreamed

Recorded

Paper

Algorithms: Matching System Capabilities

3:30pm - 5:00pm CST Thursday, 20 November 2025 260-267

Algorithms

Livestreamed

Recorded

Paper

Algorithms: Matrix Multiplication and GEMM Optimization

10:30am - 12:00pm CST Thursday, 20 November 2025 260-267

Algorithms

HPC for Machine Learning

Programming Frameworks

Livestreamed

Recorded

Paper

Algorithms: Other Matrix and Tensor Methods

1:30pm - 3:00pm CST Thursday, 20 November 2025 260-267

Algorithms

Livestreamed

Recorded

Paper

Algorithms: Sparse Matrix and Tensor Computation

3:30pm - 5:00pm CST Wednesday, 19 November 2025 260-267

Algorithms

Livestreamed

Recorded

Paper

Anomaly Detection, Failure Management, and Resilience 1

1:30pm - 3:00pm CST Wednesday, 19 November 2025 263-264

Algorithms

HPC for Machine Learning

Performance Measurement, Modeling, & Tools

State of the Practice

Livestreamed

Recorded

Paper

Anomaly Detection, Failure Management, and Resilience 2

3:30pm - 5:00pm CST Wednesday, 19 November 2025 263-264

Applications

Performance Measurement, Modeling, & Tools

State of the Practice

Livestreamed

Recorded

Invited Talk

Applications

8:40am - 10:00am CST Thursday, 20 November 2025 America's Ballroom Tu-Th

Livestreamed

Recorded

Paper

Applications: Atomistic Modeling

10:30am - 12:00pm CST Thursday, 20 November 2025 263-264

Applications

Livestreamed

Recorded

Paper

Applications: Biological Modeling

3:30pm - 5:00pm CST Thursday, 20 November 2025 263-264

Applications

Livestreamed

Recorded

Paper

Applications: Large-Scale Scientific Simulation

1:30pm - 3:00pm CST Thursday, 20 November 2025 263-264

Applications

GBC

Livestreamed

Recorded

Paper

Architectures and Networks: Hashing, Indexing, and Nearest Neighbor Search

10:30am - 12:00pm CST Wednesday, 19 November 2025 260-267

Applications

Architectures & Networks

Livestreamed

Recorded

Paper

Architectures and Networks: Networking

1:30pm - 3:00pm CST Wednesday, 19 November 2025 260-267

Architectures & Networks

Livestreamed

Recorded

Art of HPC

8:00am - 6:00pm CST Sunday, 16 November 2025 Art of HPC - Plaza Lobby

Art of HPC

Not Livestreamed

Not Recorded

TUT

XO/EX

Art of HPC

8:00am - 6:00pm CST Thursday, 20 November 2025 Art of HPC - Plaza Lobby

Art of HPC

Not Livestreamed

Not Recorded

TUT

XO/EX

Art of HPC

8:00am - 12:00pm CST Friday, 21 November 2025 Art of HPC - Plaza Lobby

Art of HPC

Not Livestreamed

Not Recorded

TUT

XO/EX

Art of HPC

8:00am - 7:00pm CST Tuesday, 18 November 2025 Art of HPC - Plaza Lobby

Art of HPC

Not Livestreamed

Not Recorded

TUT

XO/EX

Art of HPC

8:00am - 9:00pm CST Monday, 17 November 2025 Art of HPC - Plaza Lobby

Art of HPC

Not Livestreamed

Not Recorded

TUT

XO/EX

Art of HPC