The NAMD User's Guide describes how to run and use the various features of the molecular dynamics... more The NAMD User's Guide describes how to run and use the various features of the molecular dynamics program NAMD. This guide includes the capabilities of the program, how to use these capabilities, the necessary input files and formats, and how to run the program both on uniprocessor machines and in parallel.
2012 IEEE 26th International Parallel and Distributed Processing Symposium, 2012
Gemini, the network for the new Cray XE/XK systems, features low latency, high bandwidth and stro... more Gemini, the network for the new Cray XE/XK systems, features low latency, high bandwidth and strong scalability. Its hardware support for remote direct memory access enables efficient implementation of the global address space programming languages. Although the user Generic Network Interface (uGNI) provides a low-level interface for Gemini with support to the message-passing programming model (MPI), it remains challenging to port alternative programming models with scalable performance.
18th International Parallel and Distributed Processing Symposium, 2004. Proceedings., 2004
We present a performance modeling and programming environment for petaflops computers and the Blu... more We present a performance modeling and programming environment for petaflops computers and the Blue Gene machine. It consists of a parallel simulator, BigSim, for predicting performance of machines with a very large number of processors, and BigNetSim, an ongoing effort to incorporate a pluggable module of a detailed contentionbased network model. It provides the ability to make performance predictions for machines such as BlueGene/L. We also explore the programming environments for several planned applications on the machines including Finite Element Method (FEM) simulation.
Workshop on Principles of Advanced and Distributed Simulation (PADS'05), 2005
Parallel discrete event simulation (PDES) of mod-els with fine-grained computation remains a chal... more Parallel discrete event simulation (PDES) of mod-els with fine-grained computation remains a challeng-ing problem. We explore the usage of POSE, our Par-allel Object-oriented Simulation Environment, for ap-plication performance prediction on large parallel ma-chines such ...
2004 IEEE International Conference on Cluster Computing (IEEE Cat. No.04EX935), 2004
As high performance clusters continue to grow in size, the mean time hetween failure shrinks. Thu... more As high performance clusters continue to grow in size, the mean time hetween failure shrinks. Thus, the issues of fault tolerance and reliahility are becoming one of the challenging factors for application scalahiliQ. The traditional disk-based method of dealing with faults is to checkpoint the state of the entire application periodically ro reliahle storage and resfart f m m the recent checkpoint. The recovery of the application f m m faults involves (often manually) restarting applications on all pmcessors and having it read the data from disks on all processors. The restart can therefore take minutes after if has been initiated. Such a strategy requires that the failed pmcessor can he replaced so that the number of processors at checkpoint-time and recovetytime are the same. We present FTC-Charm++, a fault-folerant runtime based on a scheme forfast and scalable in-memory checkpoint and restart. At restart, when there is no extra processus the program can continue lo run on the remaining processors while minimizing the performance penalty due to losing processors. The method is useful for applications whose memoryfoutprint is small at the checkpoint state, while a variafion ofthis schemein-disk checkpointhestart can be applied to applications with large memory footprint. The scheme does not require any individual component to be fault-free. We have implemented this scheme for Charm++ and AMPI (an adaptive version of MPIJ. This paper descrihes the scheme and shows perfonance data on a cluster using 128processors. CLUSTER 2004
2006 International Conference on Parallel Processing Workshops (ICPPW'06), 2006
Many important parallel applications require multiple flows of control to run on a single process... more Many important parallel applications require multiple flows of control to run on a single processor. In this paper, we present a study of four flow-of-control mechanisms: processes, kernel threads, user-level threads and event-driven objects. Through experiments, we demonstrate the practical performance and limitations of these techniques on a variety of platforms. We also examine migration of these flows-of-control with focus on thread migration, which is critical for application-independent dynamic load balancing in parallel computing applications. Thread migration, however, is challenging due to the complexity of both user and system state involved. In this paper, we present several techniques to support migratable threads and compare the performance of these techniques.
Many parallel applications require a large vol- ume of transient memory to hold data from com- mu... more Many parallel applications require a large vol- ume of transient memory to hold data from com- munication, therefore demonstrating a pattern of communication-induced memory usage fluc- tuation. Even though these applications' per- sistent working data might fit in physical mem- ory, the transient peak memory usage could still lead to disk swapping or even out-of-memory er- ror. In this paper, we present a solution to the above problems by runtime support for control- ling the communication-induced memory fluctu- ation. The idea consists of imposing runtime flow control for large data transfers and thus con- trolling the peak transient memory consumed by communication. We explore the idea with both send-based and fetch-based low level communica- tion primitives. We develop a runtime support based on the Charm++ integrated runtime envi- ronment. We test this runtime system with a set of real applications and show considerable per- formance improvements.
Automatic, adaptive load balancing is essential for handling load imbalance that may occur during... more Automatic, adaptive load balancing is essential for handling load imbalance that may occur during parallel finite element simulations involving mesh adaptivity, nonlinear material behavior and other localized effects. This paper demonstrates the successful application of a measurement-based dynamic load balancing concept to the finite element analysis of elasto-plastic wave propagation and dynamic fracture events. The simulations are performed with the aid of a parallel framework for unstructured meshes called ParFUM, which is based on Charm++ and Adaptive MPI (AMPI) and involves migratable user-level threads. The performance was analyzed using Projections, a performance analysis and post factum visualization tool. The bottlenecks to scalability are identified and eliminated using a variety of strategies resulting in performance gains ranging from moderate to highly significant.
Adaptive MPI is an implementation of the Message Passing Interface (MPI) standard. AMPI benefits ... more Adaptive MPI is an implementation of the Message Passing Interface (MPI) standard. AMPI benefits MPI programs with features such as dynamic load balancing, virtualization, and checkpointing. AMPI runs each MPI process in a user-level thread, therefore causing problems when an MPI program has global variables. Manually removing the global variables in the program is tedious and error-prone. In this paper, we present a tool that automates this task with a source-to-source transformation that supports Fortran. We evaluate our tool on a real-world large-scale FLASH code and present preliminary results of running FLASH on AMPI. Our results demonstrate that the tool makes it easier to use AMPI.
Proceedings of the eleventh ACM SIGPLAN symposium on Principles and practice of parallel programming - PPoPP '06, 2006
Processor virtualization via migratable objects is a powerful technique that enables the runtime ... more Processor virtualization via migratable objects is a powerful technique that enables the runtime system to carry out intelligent adaptive optimizations like dynamic resource management. CHARM++ is an early language/system that supports migratable objects. This paper describes Adaptive MPI (or AMPI), an MPI implementation and extension, that supports processor virtualization. AMPI implements virtual MPI processes (VPs), several of which may be mapped to a single physical processor. AMPI includes a powerful runtime support system that takes advantage of the degree of freedom afforded by allowing it to assign VPs onto processors. With this runtime system, AMPI supports such features as automatic adaptive overlapping of communication and computation, automatic load balancing, flexibility of running on arbitrary number of processors, and checkpoint/restart support. It also inherits communication optimization from CHARM++ framework. This paper describes AMPI, illustrates its performance benefits through a series of benchmarks, and shows that AMPI is a portable and mature MPI implementation that offers various performance benefits to dynamic applications.
NAMD is a fully featured, production molecular dynamics program for high performance simulation o... more NAMD is a fully featured, production molecular dynamics program for high performance simulation of large biomolecular systems. We have previously, at SC2000, presented scaling results for simulations with cutoff electrostatics on up to 2048 processors of the ASCI Red machine, achieved with an object-based hybrid force and spatial decomposition scheme and an aggressive measurement-based predictive load balancing framework. We extend this work by demonstrating similar scaling on the much faster processors of the PSC Lemieux Alpha cluster, and for simulations employing efficient (order N log N) particle mesh Ewald full electrostatics. This unprecedented scalability in a biomolecular simulation code has been attained through latency tolerance, adaptation to multiprocessor nodes, and the direct use of the Quadrics Elan library in place of MPI by the Charm++/Converse parallel runtime system.
ChaNGa is an N-body cosmology simulation application implemented using Charm++. In this paper, we... more ChaNGa is an N-body cosmology simulation application implemented using Charm++. In this paper, we present the parallel design of ChaNGa and address many challenges arising due to the high dynamic ranges of clustered datasets. We propose optimizations based on adaptive techniques. We evaluate the performance of ChaNGa on highly clustered datasets: a snapshot of a 2 billion particle realization of a 25 Mpc volume, and a 52 million particle multi-resolution realization of a dwarf galaxy. For the 25 Mpc volume, we show strong scaling on up to 128 K cores of Blue Waters. We also demonstrate scaling up to 128 K cores of a multi-stepping run of the 2 billion particle simulation. While the scaling of the multi-stepping run is not as good as single stepping, the throughput at 128 K cores is greater by a factor of 2. We also demonstrate strong scaling on up to 512 K cores of Blue Waters for two large, uniform datasets with 12 and 24 billion particles.
NAMD is a portable parallel application for biomolecular simulations. NAMD pioneered the use of h... more NAMD is a portable parallel application for biomolecular simulations. NAMD pioneered the use of hybrid spatial and force decomposition, a technique used by most scalable programs for biomolecular simulations, including Blue Matter and Desmond which were described at Supercomputing 2006. This paper describes parallel techniques and optimizations developed to enhance NAMD's scalability, to exploit recent large parallel machines. NAMD is developed using Charm++ and benefits from its adaptive communication-computation overlap and dynamic load balancing, as demonstrated in this paper. We describe some recent optimizations including: pencil decomposition of the Particle Mesh Ewald method, reduction of memory footprint, and topology sensitive load balancing. Unlike most other MD programs, NAMD not only runs on a wide variety of platforms ranging from commodity clusters to supercomputers, but also scales to thousands of processors. We present results for up to 32,000 processors on machi...
International Journal of Parallel Programming, 2005
... (7) recipient of a 2002 Gordon Bell Award, is a production quality paral-lel molecular dynami... more ... (7) recipient of a 2002 Gordon Bell Award, is a production quality paral-lel molecular dynamics code designed for high-performance simulation of large biomolecular systems. ... Page 15.Simulation-Based Performance Prediction for Large Parallel Machines 197 ...
Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis on - SC '11, 2011
A 100-million-atom biomolecular simulation with NAMD is one of the three benchmarks for the NSF-f... more A 100-million-atom biomolecular simulation with NAMD is one of the three benchmarks for the NSF-funded sustainable petascale machine. Simulating this large molecular system on a petascale machine presents great challenges, including handling I/O, large memory footprint and getting good strong-scaling results. In this paper, we present parallel I/O techniques to enable the simulation. A new SMP model is designed to efficiently utilize ubiquitous wide multicore clusters by extending the CHARM++ asynchronous message-driven runtime. We exploit node-aware techniques to optimize both the application and the underlying SMP runtime. Hierarchical load balancing is further exploited to scale NAMD to the full Jaguar PF Cray XT5 (224,076 cores) at Oak Ridge National Laboratory, both with and without PME full electrostatics, achieving 93% parallel efficiency (vs 6720 cores) at 9 ms per step for a simple cutoff calculation. Excellent scaling is also obtained on 65,536 cores of the Intrepid Blue Gene/P at Argonne National Laboratory.
Proceedings of the 8th Workshop on Parallel and Distributed Systems Testing, Analysis, and Debugging - PADTAD '10, 2010
With the advent of increasingly larger parallel machines, debugging is becoming more and more cha... more With the advent of increasingly larger parallel machines, debugging is becoming more and more challenging. In particular, applications at this scale tend to behave non-deterministically, leading to race condition bugs. Furthermore, gaining access to these large machines for long debugging sessions is generally infeasible. In this paper, we present a 3-step algorithm to perform what we call "processor extraction": a
Proceedings of the 2010 TeraGrid Conference on - TG '10, 2010
... in Figure 9 (lines) using the same kNeighbor benchmark running on a 8-core multi-core desktop... more ... in Figure 9 (lines) using the same kNeighbor benchmark running on a 8-core multi-core desktop using ... Intuitively, mapping communicating threads to closer cores in the memory hierarchy w/o CPU affinitywith CPU affinity 16 137.1 110.3 72.28 61.88 64 138.5 109.9 73.92 62.96 ...
The NAMD User's Guide describes how to run and use the various features of the molecular dynamics... more The NAMD User's Guide describes how to run and use the various features of the molecular dynamics program NAMD. This guide includes the capabilities of the program, how to use these capabilities, the necessary input files and formats, and how to run the program both on uniprocessor machines and in parallel.
2012 IEEE 26th International Parallel and Distributed Processing Symposium, 2012
Gemini, the network for the new Cray XE/XK systems, features low latency, high bandwidth and stro... more Gemini, the network for the new Cray XE/XK systems, features low latency, high bandwidth and strong scalability. Its hardware support for remote direct memory access enables efficient implementation of the global address space programming languages. Although the user Generic Network Interface (uGNI) provides a low-level interface for Gemini with support to the message-passing programming model (MPI), it remains challenging to port alternative programming models with scalable performance.
18th International Parallel and Distributed Processing Symposium, 2004. Proceedings., 2004
We present a performance modeling and programming environment for petaflops computers and the Blu... more We present a performance modeling and programming environment for petaflops computers and the Blue Gene machine. It consists of a parallel simulator, BigSim, for predicting performance of machines with a very large number of processors, and BigNetSim, an ongoing effort to incorporate a pluggable module of a detailed contentionbased network model. It provides the ability to make performance predictions for machines such as BlueGene/L. We also explore the programming environments for several planned applications on the machines including Finite Element Method (FEM) simulation.
Workshop on Principles of Advanced and Distributed Simulation (PADS'05), 2005
Parallel discrete event simulation (PDES) of mod-els with fine-grained computation remains a chal... more Parallel discrete event simulation (PDES) of mod-els with fine-grained computation remains a challeng-ing problem. We explore the usage of POSE, our Par-allel Object-oriented Simulation Environment, for ap-plication performance prediction on large parallel ma-chines such ...
2004 IEEE International Conference on Cluster Computing (IEEE Cat. No.04EX935), 2004
As high performance clusters continue to grow in size, the mean time hetween failure shrinks. Thu... more As high performance clusters continue to grow in size, the mean time hetween failure shrinks. Thus, the issues of fault tolerance and reliahility are becoming one of the challenging factors for application scalahiliQ. The traditional disk-based method of dealing with faults is to checkpoint the state of the entire application periodically ro reliahle storage and resfart f m m the recent checkpoint. The recovery of the application f m m faults involves (often manually) restarting applications on all pmcessors and having it read the data from disks on all processors. The restart can therefore take minutes after if has been initiated. Such a strategy requires that the failed pmcessor can he replaced so that the number of processors at checkpoint-time and recovetytime are the same. We present FTC-Charm++, a fault-folerant runtime based on a scheme forfast and scalable in-memory checkpoint and restart. At restart, when there is no extra processus the program can continue lo run on the remaining processors while minimizing the performance penalty due to losing processors. The method is useful for applications whose memoryfoutprint is small at the checkpoint state, while a variafion ofthis schemein-disk checkpointhestart can be applied to applications with large memory footprint. The scheme does not require any individual component to be fault-free. We have implemented this scheme for Charm++ and AMPI (an adaptive version of MPIJ. This paper descrihes the scheme and shows perfonance data on a cluster using 128processors. CLUSTER 2004
2006 International Conference on Parallel Processing Workshops (ICPPW'06), 2006
Many important parallel applications require multiple flows of control to run on a single process... more Many important parallel applications require multiple flows of control to run on a single processor. In this paper, we present a study of four flow-of-control mechanisms: processes, kernel threads, user-level threads and event-driven objects. Through experiments, we demonstrate the practical performance and limitations of these techniques on a variety of platforms. We also examine migration of these flows-of-control with focus on thread migration, which is critical for application-independent dynamic load balancing in parallel computing applications. Thread migration, however, is challenging due to the complexity of both user and system state involved. In this paper, we present several techniques to support migratable threads and compare the performance of these techniques.
Many parallel applications require a large vol- ume of transient memory to hold data from com- mu... more Many parallel applications require a large vol- ume of transient memory to hold data from com- munication, therefore demonstrating a pattern of communication-induced memory usage fluc- tuation. Even though these applications' per- sistent working data might fit in physical mem- ory, the transient peak memory usage could still lead to disk swapping or even out-of-memory er- ror. In this paper, we present a solution to the above problems by runtime support for control- ling the communication-induced memory fluctu- ation. The idea consists of imposing runtime flow control for large data transfers and thus con- trolling the peak transient memory consumed by communication. We explore the idea with both send-based and fetch-based low level communica- tion primitives. We develop a runtime support based on the Charm++ integrated runtime envi- ronment. We test this runtime system with a set of real applications and show considerable per- formance improvements.
Automatic, adaptive load balancing is essential for handling load imbalance that may occur during... more Automatic, adaptive load balancing is essential for handling load imbalance that may occur during parallel finite element simulations involving mesh adaptivity, nonlinear material behavior and other localized effects. This paper demonstrates the successful application of a measurement-based dynamic load balancing concept to the finite element analysis of elasto-plastic wave propagation and dynamic fracture events. The simulations are performed with the aid of a parallel framework for unstructured meshes called ParFUM, which is based on Charm++ and Adaptive MPI (AMPI) and involves migratable user-level threads. The performance was analyzed using Projections, a performance analysis and post factum visualization tool. The bottlenecks to scalability are identified and eliminated using a variety of strategies resulting in performance gains ranging from moderate to highly significant.
Adaptive MPI is an implementation of the Message Passing Interface (MPI) standard. AMPI benefits ... more Adaptive MPI is an implementation of the Message Passing Interface (MPI) standard. AMPI benefits MPI programs with features such as dynamic load balancing, virtualization, and checkpointing. AMPI runs each MPI process in a user-level thread, therefore causing problems when an MPI program has global variables. Manually removing the global variables in the program is tedious and error-prone. In this paper, we present a tool that automates this task with a source-to-source transformation that supports Fortran. We evaluate our tool on a real-world large-scale FLASH code and present preliminary results of running FLASH on AMPI. Our results demonstrate that the tool makes it easier to use AMPI.
Proceedings of the eleventh ACM SIGPLAN symposium on Principles and practice of parallel programming - PPoPP '06, 2006
Processor virtualization via migratable objects is a powerful technique that enables the runtime ... more Processor virtualization via migratable objects is a powerful technique that enables the runtime system to carry out intelligent adaptive optimizations like dynamic resource management. CHARM++ is an early language/system that supports migratable objects. This paper describes Adaptive MPI (or AMPI), an MPI implementation and extension, that supports processor virtualization. AMPI implements virtual MPI processes (VPs), several of which may be mapped to a single physical processor. AMPI includes a powerful runtime support system that takes advantage of the degree of freedom afforded by allowing it to assign VPs onto processors. With this runtime system, AMPI supports such features as automatic adaptive overlapping of communication and computation, automatic load balancing, flexibility of running on arbitrary number of processors, and checkpoint/restart support. It also inherits communication optimization from CHARM++ framework. This paper describes AMPI, illustrates its performance benefits through a series of benchmarks, and shows that AMPI is a portable and mature MPI implementation that offers various performance benefits to dynamic applications.
NAMD is a fully featured, production molecular dynamics program for high performance simulation o... more NAMD is a fully featured, production molecular dynamics program for high performance simulation of large biomolecular systems. We have previously, at SC2000, presented scaling results for simulations with cutoff electrostatics on up to 2048 processors of the ASCI Red machine, achieved with an object-based hybrid force and spatial decomposition scheme and an aggressive measurement-based predictive load balancing framework. We extend this work by demonstrating similar scaling on the much faster processors of the PSC Lemieux Alpha cluster, and for simulations employing efficient (order N log N) particle mesh Ewald full electrostatics. This unprecedented scalability in a biomolecular simulation code has been attained through latency tolerance, adaptation to multiprocessor nodes, and the direct use of the Quadrics Elan library in place of MPI by the Charm++/Converse parallel runtime system.
ChaNGa is an N-body cosmology simulation application implemented using Charm++. In this paper, we... more ChaNGa is an N-body cosmology simulation application implemented using Charm++. In this paper, we present the parallel design of ChaNGa and address many challenges arising due to the high dynamic ranges of clustered datasets. We propose optimizations based on adaptive techniques. We evaluate the performance of ChaNGa on highly clustered datasets: a snapshot of a 2 billion particle realization of a 25 Mpc volume, and a 52 million particle multi-resolution realization of a dwarf galaxy. For the 25 Mpc volume, we show strong scaling on up to 128 K cores of Blue Waters. We also demonstrate scaling up to 128 K cores of a multi-stepping run of the 2 billion particle simulation. While the scaling of the multi-stepping run is not as good as single stepping, the throughput at 128 K cores is greater by a factor of 2. We also demonstrate strong scaling on up to 512 K cores of Blue Waters for two large, uniform datasets with 12 and 24 billion particles.
NAMD is a portable parallel application for biomolecular simulations. NAMD pioneered the use of h... more NAMD is a portable parallel application for biomolecular simulations. NAMD pioneered the use of hybrid spatial and force decomposition, a technique used by most scalable programs for biomolecular simulations, including Blue Matter and Desmond which were described at Supercomputing 2006. This paper describes parallel techniques and optimizations developed to enhance NAMD's scalability, to exploit recent large parallel machines. NAMD is developed using Charm++ and benefits from its adaptive communication-computation overlap and dynamic load balancing, as demonstrated in this paper. We describe some recent optimizations including: pencil decomposition of the Particle Mesh Ewald method, reduction of memory footprint, and topology sensitive load balancing. Unlike most other MD programs, NAMD not only runs on a wide variety of platforms ranging from commodity clusters to supercomputers, but also scales to thousands of processors. We present results for up to 32,000 processors on machi...
International Journal of Parallel Programming, 2005
... (7) recipient of a 2002 Gordon Bell Award, is a production quality paral-lel molecular dynami... more ... (7) recipient of a 2002 Gordon Bell Award, is a production quality paral-lel molecular dynamics code designed for high-performance simulation of large biomolecular systems. ... Page 15.Simulation-Based Performance Prediction for Large Parallel Machines 197 ...
Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis on - SC '11, 2011
A 100-million-atom biomolecular simulation with NAMD is one of the three benchmarks for the NSF-f... more A 100-million-atom biomolecular simulation with NAMD is one of the three benchmarks for the NSF-funded sustainable petascale machine. Simulating this large molecular system on a petascale machine presents great challenges, including handling I/O, large memory footprint and getting good strong-scaling results. In this paper, we present parallel I/O techniques to enable the simulation. A new SMP model is designed to efficiently utilize ubiquitous wide multicore clusters by extending the CHARM++ asynchronous message-driven runtime. We exploit node-aware techniques to optimize both the application and the underlying SMP runtime. Hierarchical load balancing is further exploited to scale NAMD to the full Jaguar PF Cray XT5 (224,076 cores) at Oak Ridge National Laboratory, both with and without PME full electrostatics, achieving 93% parallel efficiency (vs 6720 cores) at 9 ms per step for a simple cutoff calculation. Excellent scaling is also obtained on 65,536 cores of the Intrepid Blue Gene/P at Argonne National Laboratory.
Proceedings of the 8th Workshop on Parallel and Distributed Systems Testing, Analysis, and Debugging - PADTAD '10, 2010
With the advent of increasingly larger parallel machines, debugging is becoming more and more cha... more With the advent of increasingly larger parallel machines, debugging is becoming more and more challenging. In particular, applications at this scale tend to behave non-deterministically, leading to race condition bugs. Furthermore, gaining access to these large machines for long debugging sessions is generally infeasible. In this paper, we present a 3-step algorithm to perform what we call "processor extraction": a
Proceedings of the 2010 TeraGrid Conference on - TG '10, 2010
... in Figure 9 (lines) using the same kNeighbor benchmark running on a 8-core multi-core desktop... more ... in Figure 9 (lines) using the same kNeighbor benchmark running on a 8-core multi-core desktop using ... Intuitively, mapping communicating threads to closer cores in the memory hierarchy w/o CPU affinitywith CPU affinity 16 137.1 110.3 72.28 61.88 64 138.5 109.9 73.92 62.96 ...
Uploads
Papers by Gengbin Zheng