Skip to main content

Gengbin Zheng

Followers

19

Following

12

Co-authors

12

Public Views

Prashanth Kanduri

Emad Tajkhorshid

University of Illinois at Urbana-Champaign

Michelle Kuttel

University of Cape Town

University of Chicago

Interests

Uploads

Papers by Gengbin Zheng

NAMD Molecular Dynamics Software Non-Exclusive, NonCommercial Use License

by Robert Brunner, David J Hardy, and Gengbin Zheng

The NAMD User's Guide describes how to run and use the various features of the molecular dynamics... more The NAMD User's Guide describes how to run and use the various features of the molecular dynamics program NAMD. This guide includes the capabilities of the program, how to use these capabilities, the necessary input files and formats, and how to run the program both on uniprocessor machines and in parallel.

A uGNI-based Asynchronous Message-driven Runtime System for Cray Supercomputers with Gemini Interconnect

2012 IEEE 26th International Parallel and Distributed Processing Symposium, 2012

Gemini, the network for the new Cray XE/XK systems, features low latency, high bandwidth and stro... more Gemini, the network for the new Cray XE/XK systems, features low latency, high bandwidth and strong scalability. Its hardware support for remote direct memory access enables efficient implementation of the global address space programming languages. Although the user Generic Network Interface (uGNI) provides a low-level interface for Gemini with support to the message-passing programming model (MPI), it remains challenging to port alternative programming models with scalable performance.

Supporting adaptivity in MPI for dynamic parallel applications

Rapport technique, Parallel …, 2007

Page 1. Supporting Adaptivity in MPI for Dynamic Parallel Applications ∗ Chao Huang, Gengbin Zhen... more

A scalable double in-memory checkpoint and restart scheme towards exascale

IEEE/IFIP International Conference on Dependable Systems and Networks Workshops (DSN 2012), 2012

... Gengbin Zheng, Xiang Ni, Esteban Meneses, Laxmikant V. Kalé Department of Computer Science Un... more

Performance modeling and programming environments for petaflops computers and the blue gene machine

18th International Parallel and Distributed Processing Symposium, 2004. Proceedings., 2004

We present a performance modeling and programming environment for petaflops computers and the Blu... more We present a performance modeling and programming environment for petaflops computers and the Blue Gene machine. It consists of a parallel simulator, BigSim, for predicting performance of machines with a very large number of processors, and BigNetSim, an ongoing effort to incorporate a pluggable module of a detailed contentionbased network model. It provides the ability to make performance predictions for machines such as BlueGene/L. We also explore the programming environments for several planned applications on the machines including Finite Element Method (FEM) simulation.

Performance Prediction Using Simulation of Large-Scale Interconnection Networks in POSE

Workshop on Principles of Advanced and Distributed Simulation (PADS'05), 2005

Parallel discrete event simulation (PDES) of mod-els with fine-grained computation remains a chal... more

FTC-Charm++: an in-memory checkpoint-based fault tolerant runtime for Charm++ and MPI

2004 IEEE International Conference on Cluster Computing (IEEE Cat. No.04EX935), 2004

As high performance clusters continue to grow in size, the mean time hetween failure shrinks. Thu... more As high performance clusters continue to grow in size, the mean time hetween failure shrinks. Thus, the issues of fault tolerance and reliahility are becoming one of the challenging factors for application scalahiliQ. The traditional disk-based method of dealing with faults is to checkpoint the state of the entire application periodically ro reliahle storage and resfart f m m the recent checkpoint. The recovery of the application f m m faults involves (often manually) restarting applications on all pmcessors and having it read the data from disks on all processors. The restart can therefore take minutes after if has been initiated. Such a strategy requires that the failed pmcessor can he replaced so that the number of processors at checkpoint-time and recovetytime are the same. We present FTC-Charm++, a fault-folerant runtime based on a scheme forfast and scalable in-memory checkpoint and restart. At restart, when there is no extra processus the program can continue lo run on the remaining processors while minimizing the performance penalty due to losing processors. The method is useful for applications whose memoryfoutprint is small at the checkpoint state, while a variafion ofthis schemein-disk checkpointhestart can be applied to applications with large memory footprint. The scheme does not require any individual component to be fault-free. We have implemented this scheme for Charm++ and AMPI (an adaptive version of MPIJ. This paper descrihes the scheme and shows perfonance data on a cluster using 128processors. CLUSTER 2004

Multiple Flows of Control in Migratable Parallel Programs

2006 International Conference on Parallel Processing Workshops (ICPPW'06), 2006

Many important parallel applications require multiple flows of control to run on a single process... more Many important parallel applications require multiple flows of control to run on a single processor. In this paper, we present a study of four flow-of-control mechanisms: processes, kernel threads, user-level threads and event-driven objects. Through experiments, we demonstrate the practical performance and limitations of these techniques on a variety of platforms. We also examine migration of these flows-of-control with focus on thread migration, which is critical for application-independent dynamic load balancing in parallel computing applications. Thread migration, however, is challenging due to the complexity of both user and system state involved. In this paper, we present several techniques to support migratable threads and compare the performance of these techniques.

Run-time Support for Controlling Communication-Induced Memory Fluctuation

Many parallel applications require a large vol- ume of transient memory to hold data from com- mu... more Many parallel applications require a large vol- ume of transient memory to hold data from com- munication, therefore demonstrating a pattern of communication-induced memory usage fluc- tuation. Even though these applications&#x27; per- sistent working data might fit in physical mem- ory, the transient peak memory usage could still lead to disk swapping or even out-of-memory er- ror. In this paper, we present a solution to the above problems by runtime support for control- ling the communication-induced memory fluctu- ation. The idea consists of imposing runtime flow control for large data transfers and thus con- trolling the peak transient memory consumed by communication. We explore the idea with both send-based and fetch-based low level communica- tion primitives. We develop a runtime support based on the Charm++ integrated runtime envi- ronment. We test this runtime system with a set of real applications and show considerable per- formance improvements.

Automatic dynamic load balancing for a crack propagation application

by Gengbin Zheng and Michael Breitenfeld

Automatic, adaptive load balancing is essential for handling load imbalance that may occur during... more Automatic, adaptive load balancing is essential for handling load imbalance that may occur during parallel finite element simulations involving mesh adaptivity, nonlinear material behavior and other localized effects. This paper demonstrates the successful application of a measurement-based dynamic load balancing concept to the finite element analysis of elasto-plastic wave propagation and dynamic fracture events. The simulations are performed with the aid of a parallel framework for unstructured meshes called ParFUM, which is based on Charm++ and Adaptive MPI (AMPI) and involves migratable user-level threads. The performance was analyzed using Projections, a performance analysis and post factum visualization tool. The bottlenecks to scalability are identified and eliminated using a variety of strategies resulting in performance gains ranging from moderate to highly significant.

Automatic MPI to AMPI program transformation

by Gengbin Zheng and Stas Negara

Adaptive MPI is an implementation of the Message Passing Interface (MPI) standard. AMPI benefits ... more Adaptive MPI is an implementation of the Message Passing Interface (MPI) standard. AMPI benefits MPI programs with features such as dynamic load balancing, virtualization, and checkpointing. AMPI runs each MPI process in a user-level thread, therefore causing problems when an MPI program has global variables. Manually removing the global variables in the program is tedious and error-prone. In this paper, we present a tool that automates this task with a source-to-source transformation that supports Fortran. We evaluate our tool on a real-world large-scale FLASH code and present preliminary results of running FLASH on AMPI. Our results demonstrate that the tool makes it easier to use AMPI.

Performance evaluation of adaptive MPI

Proceedings of the eleventh ACM SIGPLAN symposium on Principles and practice of parallel programming - PPoPP '06, 2006

Processor virtualization via migratable objects is a powerful technique that enables the runtime ... more Processor virtualization via migratable objects is a powerful technique that enables the runtime system to carry out intelligent adaptive optimizations like dynamic resource management. CHARM++ is an early language/system that supports migratable objects. This paper describes Adaptive MPI (or AMPI), an MPI implementation and extension, that supports processor virtualization. AMPI implements virtual MPI processes (VPs), several of which may be mapped to a single physical processor. AMPI includes a powerful runtime support system that takes advantage of the degree of freedom afforded by allowing it to assign VPs onto processors. With this runtime system, AMPI supports such features as automatic adaptive overlapping of communication and computation, automatic load balancing, flexibility of running on arbitrary number of processors, and checkpoint/restart support. It also inherits communication optimization from CHARM++ framework. This paper describes AMPI, illustrates its performance benefits through a series of benchmarks, and shows that AMPI is a portable and mature MPI implementation that offers various performance benefits to dynamic applications.

NAMD: Biomolecular Simulation on Thousands of Processors

ACM/IEEE SC 2002 Conference (SC'02), 2002

NAMD is a fully featured, production molecular dynamics program for high performance simulation o... more NAMD is a fully featured, production molecular dynamics program for high performance simulation of large biomolecular systems. We have previously, at SC2000, presented scaling results for simulations with cutoff electrostatics on up to 2048 processors of the ASCI Red machine, achieved with an object-based hybrid force and spatial decomposition scheme and an aggressive measurement-based predictive load balancing framework. We extend this work by demonstrating similar scaling on the much faster processors of the PSC Lemieux Alpha cluster, and for simulations employing efficient (order N log N) particle mesh Ewald full electrostatics. This unprecedented scalability in a biomolecular simulation code has been attained through latency tolerance, adaptation to multiprocessor nodes, and the direct use of the Quadrics Elan library in place of MPI by the Charm++/Converse parallel runtime system.

Biomolecular Modeling using Parallel Supercomputers

by Gengbin Zheng and Robert Skeel

Chapman & Hall/CRC Computer & Information Science Series, 2005

Adaptive Techniques for Clustered N-Body Cosmological Simulations

ChaNGa is an N-body cosmology simulation application implemented using Charm++. In this paper, we... more ChaNGa is an N-body cosmology simulation application implemented using Charm++. In this paper, we present the parallel design of ChaNGa and address many challenges arising due to the high dynamic ranges of clustered datasets. We propose optimizations based on adaptive techniques. We evaluate the performance of ChaNGa on highly clustered datasets: a snapshot of a 2 billion particle realization of a 25 Mpc volume, and a 52 million particle multi-resolution realization of a dwarf galaxy. For the 25 Mpc volume, we show strong scaling on up to 128 K cores of Blue Waters. We also demonstrate scaling up to 128 K cores of a multi-stepping run of the 2 billion particle simulation. While the scaling of the multi-stepping run is not as good as single stepping, the throughput at 128 K cores is greater by a factor of 2. We also demonstrate strong scaling on up to 512 K cores of Blue Waters for two large, uniform datasets with 12 and 24 billion particles.

NAMD: A Portable and Highly Scalable Program for Biomolecular Simulations

NAMD is a portable parallel application for biomolecular simulations. NAMD pioneered the use of h... more NAMD is a portable parallel application for biomolecular simulations. NAMD pioneered the use of hybrid spatial and force decomposition, a technique used by most scalable programs for biomolecular simulations, including Blue Matter and Desmond which were described at Supercomputing 2006. This paper describes parallel techniques and optimizations developed to enhance NAMD's scalability, to exploit recent large parallel machines. NAMD is developed using Charm++ and benefits from its adaptive communication-computation overlap and dynamic load balancing, as demonstrated in this paper. We describe some recent optimizations including: pencil decomposition of the Particle Mesh Ewald method, reduction of memory footprint, and topology sensitive load balancing. Unlike most other MD programs, NAMD not only runs on a wide variety of platforms ranging from commodity clusters to supercomputers, but also scales to thousands of processors. We present results for up to 32,000 processors on machi...

Simulation-Based Performance Prediction for Large Parallel Machines

International Journal of Parallel Programming, 2005

... (7) recipient of a 2002 Gordon Bell Award, is a production quality paral-lel molecular dynami... more

Enabling and scaling biomolecular simulations of 100 million atoms on petascale machines with a multicore-optimized message-driven runtime

Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis on - SC '11, 2011

A 100-million-atom biomolecular simulation with NAMD is one of the three benchmarks for the NSF-f... more A 100-million-atom biomolecular simulation with NAMD is one of the three benchmarks for the NSF-funded sustainable petascale machine. Simulating this large molecular system on a petascale machine presents great challenges, including handling I/O, large memory footprint and getting good strong-scaling results. In this paper, we present parallel I/O techniques to enable the simulation. A new SMP model is designed to efficiently utilize ubiquitous wide multicore clusters by extending the CHARM++ asynchronous message-driven runtime. We exploit node-aware techniques to optimize both the application and the underlying SMP runtime. Hierarchical load balancing is further exploited to scale NAMD to the full Jaguar PF Cray XT5 (224,076 cores) at Oak Ridge National Laboratory, both with and without PME full electrostatics, achieving 93% parallel efficiency (vs 6720 cores) at 9 ms per step for a simple cutoff calculation. Excellent scaling is also obtained on 65,536 cores of the Intrepid Blue Gene/P at Argonne National Laboratory.

Robust non-intrusive record-replay with processor extraction

Proceedings of the 8th Workshop on Parallel and Distributed Systems Testing, Analysis, and Debugging - PADTAD '10, 2010

With the advent of increasingly larger parallel machines, debugging is becoming more and more cha... more With the advent of increasingly larger parallel machines, debugging is becoming more and more challenging. In particular, applications at this scale tend to behave non-deterministically, leading to race condition bugs. Furthermore, gaining access to these large machines for long debugging sessions is generally infeasible. In this paper, we present a 3-step algorithm to perform what we call "processor extraction": a

Optimizing a parallel runtime system for multicore clusters

Proceedings of the 2010 TeraGrid Conference on - TG '10, 2010

... in Figure 9 (lines) using the same kNeighbor benchmark running on a 8-core multi-core desktop... more

NAMD Molecular Dynamics Software Non-Exclusive, NonCommercial Use License

by Robert Brunner, David J Hardy, and Gengbin Zheng

The NAMD User's Guide describes how to run and use the various features of the molecular dynamics... more The NAMD User's Guide describes how to run and use the various features of the molecular dynamics program NAMD. This guide includes the capabilities of the program, how to use these capabilities, the necessary input files and formats, and how to run the program both on uniprocessor machines and in parallel.

A uGNI-based Asynchronous Message-driven Runtime System for Cray Supercomputers with Gemini Interconnect

2012 IEEE 26th International Parallel and Distributed Processing Symposium, 2012

Gemini, the network for the new Cray XE/XK systems, features low latency, high bandwidth and stro... more Gemini, the network for the new Cray XE/XK systems, features low latency, high bandwidth and strong scalability. Its hardware support for remote direct memory access enables efficient implementation of the global address space programming languages. Although the user Generic Network Interface (uGNI) provides a low-level interface for Gemini with support to the message-passing programming model (MPI), it remains challenging to port alternative programming models with scalable performance.

Supporting adaptivity in MPI for dynamic parallel applications

Rapport technique, Parallel …, 2007

Page 1. Supporting Adaptivity in MPI for Dynamic Parallel Applications ∗ Chao Huang, Gengbin Zhen... more

A scalable double in-memory checkpoint and restart scheme towards exascale

IEEE/IFIP International Conference on Dependable Systems and Networks Workshops (DSN 2012), 2012

... Gengbin Zheng, Xiang Ni, Esteban Meneses, Laxmikant V. Kalé Department of Computer Science Un... more

Performance modeling and programming environments for petaflops computers and the blue gene machine

18th International Parallel and Distributed Processing Symposium, 2004. Proceedings., 2004

We present a performance modeling and programming environment for petaflops computers and the Blu... more We present a performance modeling and programming environment for petaflops computers and the Blue Gene machine. It consists of a parallel simulator, BigSim, for predicting performance of machines with a very large number of processors, and BigNetSim, an ongoing effort to incorporate a pluggable module of a detailed contentionbased network model. It provides the ability to make performance predictions for machines such as BlueGene/L. We also explore the programming environments for several planned applications on the machines including Finite Element Method (FEM) simulation.

Performance Prediction Using Simulation of Large-Scale Interconnection Networks in POSE

Workshop on Principles of Advanced and Distributed Simulation (PADS'05), 2005

Parallel discrete event simulation (PDES) of mod-els with fine-grained computation remains a chal... more

FTC-Charm++: an in-memory checkpoint-based fault tolerant runtime for Charm++ and MPI

2004 IEEE International Conference on Cluster Computing (IEEE Cat. No.04EX935), 2004

As high performance clusters continue to grow in size, the mean time hetween failure shrinks. Thu... more As high performance clusters continue to grow in size, the mean time hetween failure shrinks. Thus, the issues of fault tolerance and reliahility are becoming one of the challenging factors for application scalahiliQ. The traditional disk-based method of dealing with faults is to checkpoint the state of the entire application periodically ro reliahle storage and resfart f m m the recent checkpoint. The recovery of the application f m m faults involves (often manually) restarting applications on all pmcessors and having it read the data from disks on all processors. The restart can therefore take minutes after if has been initiated. Such a strategy requires that the failed pmcessor can he replaced so that the number of processors at checkpoint-time and recovetytime are the same. We present FTC-Charm++, a fault-folerant runtime based on a scheme forfast and scalable in-memory checkpoint and restart. At restart, when there is no extra processus the program can continue lo run on the remaining processors while minimizing the performance penalty due to losing processors. The method is useful for applications whose memoryfoutprint is small at the checkpoint state, while a variafion ofthis schemein-disk checkpointhestart can be applied to applications with large memory footprint. The scheme does not require any individual component to be fault-free. We have implemented this scheme for Charm++ and AMPI (an adaptive version of MPIJ. This paper descrihes the scheme and shows perfonance data on a cluster using 128processors. CLUSTER 2004

Multiple Flows of Control in Migratable Parallel Programs

2006 International Conference on Parallel Processing Workshops (ICPPW'06), 2006

Many important parallel applications require multiple flows of control to run on a single process... more Many important parallel applications require multiple flows of control to run on a single processor. In this paper, we present a study of four flow-of-control mechanisms: processes, kernel threads, user-level threads and event-driven objects. Through experiments, we demonstrate the practical performance and limitations of these techniques on a variety of platforms. We also examine migration of these flows-of-control with focus on thread migration, which is critical for application-independent dynamic load balancing in parallel computing applications. Thread migration, however, is challenging due to the complexity of both user and system state involved. In this paper, we present several techniques to support migratable threads and compare the performance of these techniques.

Run-time Support for Controlling Communication-Induced Memory Fluctuation

Many parallel applications require a large vol- ume of transient memory to hold data from com- mu... more Many parallel applications require a large vol- ume of transient memory to hold data from com- munication, therefore demonstrating a pattern of communication-induced memory usage fluc- tuation. Even though these applications&#x27; per- sistent working data might fit in physical mem- ory, the transient peak memory usage could still lead to disk swapping or even out-of-memory er- ror. In this paper, we present a solution to the above problems by runtime support for control- ling the communication-induced memory fluctu- ation. The idea consists of imposing runtime flow control for large data transfers and thus con- trolling the peak transient memory consumed by communication. We explore the idea with both send-based and fetch-based low level communica- tion primitives. We develop a runtime support based on the Charm++ integrated runtime envi- ronment. We test this runtime system with a set of real applications and show considerable per- formance improvements.

Automatic dynamic load balancing for a crack propagation application

by Gengbin Zheng and Michael Breitenfeld

Automatic, adaptive load balancing is essential for handling load imbalance that may occur during... more Automatic, adaptive load balancing is essential for handling load imbalance that may occur during parallel finite element simulations involving mesh adaptivity, nonlinear material behavior and other localized effects. This paper demonstrates the successful application of a measurement-based dynamic load balancing concept to the finite element analysis of elasto-plastic wave propagation and dynamic fracture events. The simulations are performed with the aid of a parallel framework for unstructured meshes called ParFUM, which is based on Charm++ and Adaptive MPI (AMPI) and involves migratable user-level threads. The performance was analyzed using Projections, a performance analysis and post factum visualization tool. The bottlenecks to scalability are identified and eliminated using a variety of strategies resulting in performance gains ranging from moderate to highly significant.

Automatic MPI to AMPI program transformation

by Gengbin Zheng and Stas Negara

Adaptive MPI is an implementation of the Message Passing Interface (MPI) standard. AMPI benefits ... more Adaptive MPI is an implementation of the Message Passing Interface (MPI) standard. AMPI benefits MPI programs with features such as dynamic load balancing, virtualization, and checkpointing. AMPI runs each MPI process in a user-level thread, therefore causing problems when an MPI program has global variables. Manually removing the global variables in the program is tedious and error-prone. In this paper, we present a tool that automates this task with a source-to-source transformation that supports Fortran. We evaluate our tool on a real-world large-scale FLASH code and present preliminary results of running FLASH on AMPI. Our results demonstrate that the tool makes it easier to use AMPI.

Performance evaluation of adaptive MPI

Proceedings of the eleventh ACM SIGPLAN symposium on Principles and practice of parallel programming - PPoPP '06, 2006

Processor virtualization via migratable objects is a powerful technique that enables the runtime ... more Processor virtualization via migratable objects is a powerful technique that enables the runtime system to carry out intelligent adaptive optimizations like dynamic resource management. CHARM++ is an early language/system that supports migratable objects. This paper describes Adaptive MPI (or AMPI), an MPI implementation and extension, that supports processor virtualization. AMPI implements virtual MPI processes (VPs), several of which may be mapped to a single physical processor. AMPI includes a powerful runtime support system that takes advantage of the degree of freedom afforded by allowing it to assign VPs onto processors. With this runtime system, AMPI supports such features as automatic adaptive overlapping of communication and computation, automatic load balancing, flexibility of running on arbitrary number of processors, and checkpoint/restart support. It also inherits communication optimization from CHARM++ framework. This paper describes AMPI, illustrates its performance benefits through a series of benchmarks, and shows that AMPI is a portable and mature MPI implementation that offers various performance benefits to dynamic applications.

NAMD: Biomolecular Simulation on Thousands of Processors

ACM/IEEE SC 2002 Conference (SC'02), 2002

NAMD is a fully featured, production molecular dynamics program for high performance simulation o... more NAMD is a fully featured, production molecular dynamics program for high performance simulation of large biomolecular systems. We have previously, at SC2000, presented scaling results for simulations with cutoff electrostatics on up to 2048 processors of the ASCI Red machine, achieved with an object-based hybrid force and spatial decomposition scheme and an aggressive measurement-based predictive load balancing framework. We extend this work by demonstrating similar scaling on the much faster processors of the PSC Lemieux Alpha cluster, and for simulations employing efficient (order N log N) particle mesh Ewald full electrostatics. This unprecedented scalability in a biomolecular simulation code has been attained through latency tolerance, adaptation to multiprocessor nodes, and the direct use of the Quadrics Elan library in place of MPI by the Charm++/Converse parallel runtime system.

Biomolecular Modeling using Parallel Supercomputers

by Gengbin Zheng and Robert Skeel

Chapman & Hall/CRC Computer & Information Science Series, 2005

Adaptive Techniques for Clustered N-Body Cosmological Simulations

ChaNGa is an N-body cosmology simulation application implemented using Charm++. In this paper, we... more ChaNGa is an N-body cosmology simulation application implemented using Charm++. In this paper, we present the parallel design of ChaNGa and address many challenges arising due to the high dynamic ranges of clustered datasets. We propose optimizations based on adaptive techniques. We evaluate the performance of ChaNGa on highly clustered datasets: a snapshot of a 2 billion particle realization of a 25 Mpc volume, and a 52 million particle multi-resolution realization of a dwarf galaxy. For the 25 Mpc volume, we show strong scaling on up to 128 K cores of Blue Waters. We also demonstrate scaling up to 128 K cores of a multi-stepping run of the 2 billion particle simulation. While the scaling of the multi-stepping run is not as good as single stepping, the throughput at 128 K cores is greater by a factor of 2. We also demonstrate strong scaling on up to 512 K cores of Blue Waters for two large, uniform datasets with 12 and 24 billion particles.

NAMD: A Portable and Highly Scalable Program for Biomolecular Simulations

NAMD is a portable parallel application for biomolecular simulations. NAMD pioneered the use of h... more NAMD is a portable parallel application for biomolecular simulations. NAMD pioneered the use of hybrid spatial and force decomposition, a technique used by most scalable programs for biomolecular simulations, including Blue Matter and Desmond which were described at Supercomputing 2006. This paper describes parallel techniques and optimizations developed to enhance NAMD's scalability, to exploit recent large parallel machines. NAMD is developed using Charm++ and benefits from its adaptive communication-computation overlap and dynamic load balancing, as demonstrated in this paper. We describe some recent optimizations including: pencil decomposition of the Particle Mesh Ewald method, reduction of memory footprint, and topology sensitive load balancing. Unlike most other MD programs, NAMD not only runs on a wide variety of platforms ranging from commodity clusters to supercomputers, but also scales to thousands of processors. We present results for up to 32,000 processors on machi...

Simulation-Based Performance Prediction for Large Parallel Machines

International Journal of Parallel Programming, 2005

... (7) recipient of a 2002 Gordon Bell Award, is a production quality paral-lel molecular dynami... more

Enabling and scaling biomolecular simulations of 100 million atoms on petascale machines with a multicore-optimized message-driven runtime

Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis on - SC '11, 2011

A 100-million-atom biomolecular simulation with NAMD is one of the three benchmarks for the NSF-f... more A 100-million-atom biomolecular simulation with NAMD is one of the three benchmarks for the NSF-funded sustainable petascale machine. Simulating this large molecular system on a petascale machine presents great challenges, including handling I/O, large memory footprint and getting good strong-scaling results. In this paper, we present parallel I/O techniques to enable the simulation. A new SMP model is designed to efficiently utilize ubiquitous wide multicore clusters by extending the CHARM++ asynchronous message-driven runtime. We exploit node-aware techniques to optimize both the application and the underlying SMP runtime. Hierarchical load balancing is further exploited to scale NAMD to the full Jaguar PF Cray XT5 (224,076 cores) at Oak Ridge National Laboratory, both with and without PME full electrostatics, achieving 93% parallel efficiency (vs 6720 cores) at 9 ms per step for a simple cutoff calculation. Excellent scaling is also obtained on 65,536 cores of the Intrepid Blue Gene/P at Argonne National Laboratory.

Robust non-intrusive record-replay with processor extraction

Proceedings of the 8th Workshop on Parallel and Distributed Systems Testing, Analysis, and Debugging - PADTAD '10, 2010

With the advent of increasingly larger parallel machines, debugging is becoming more and more cha... more With the advent of increasingly larger parallel machines, debugging is becoming more and more challenging. In particular, applications at this scale tend to behave non-deterministically, leading to race condition bugs. Furthermore, gaining access to these large machines for long debugging sessions is generally infeasible. In this paper, we present a 3-step algorithm to perform what we call "processor extraction": a

Optimizing a parallel runtime system for multicore clusters

Proceedings of the 2010 TeraGrid Conference on - TG '10, 2010

... in Figure 9 (lines) using the same kNeighbor benchmark running on a 8-core multi-core desktop... more