Academia.edu no longer supports Internet Explorer.
To browse Academia.edu and the wider internet faster and more securely, please take a few seconds to upgrade your browser.
2009, Proceedings of the 18th International Meshing Roundtable
Mesh generation is a critical component for many (bio-)engineering applications. However, parallel mesh generation codes, which are essential for these applications to take the fullest advantage of the high-end computing platforms, belong to the broader class of adaptive and irregular problems, and are among the most complex, challenging, and labor intensive to develop and maintain. As a result, parallel mesh generation is one of the last applications to be installed on new parallel architectures. In this paper we present a way to remedy this problem for new highly-scalable architectures. We present a multi-layered tetrahedral/triangular mesh generation approach capable of delivering and sustaining close to 10 18 of concurrent work units. We achieve this by leveraging concurrency at different granularity levels using a hybrid algorithm, and by carefully matching these levels to the hierarchy of the hardware architecture. This paper makes two contributions: (1) a new evolutionary path for developing multi-layered parallel mesh generation codes capable of increasing the concurrency of the state-of-the-art parallel mesh generation methods by at least 10 orders of magnitude and (2) a new abstraction for multilayered runtime systems that target parallel mesh generation codes, to efficiently orchestrate intra-and inter-layer data movement and load balancing for current and emerging multi-layered architectures with deep memory and network hierarchies.
Proceedings of the 19th …, 2005
Given the importance of parallel mesh generation in large-scale scientific applications and the proliferation of multilevel SMTbased architectures, it is imperative to obtain insight on the interaction between meshing algorithms and these systems. We focus on Parallel Constrained Delaunay Mesh (PCDM) generation. We exploit coarse-grain parallelism at the subdomain level and fine-grain at the element level. This multigrain data parallel approach targets clusters built from low-end, commercially available SMTs. Our experimental evaluation shows that current SMTs are not capable of executing fine-grain parallelism in PCDM. However, experiments on a simulated SMT indicate that with modest hardware support it is possible to exploit fine-grain parallelism opportunities. The exploitation of fine-grain parallelism results to higher performance than a pure MPI implementation and closes the gap between the performance of PCDM and the state-of-the-art sequential mesher on a single physical processor. Our findings extend to other adaptive and irregular multigrain, parallel algorithms.
Journal of Parallel and …, 2009
This article focuses on the optimization of PCDM, a parallel, two-dimensional (2D) Delaunay mesh generation application, and its interaction with parallel architectures based on simultaneous multithreading (SMT) processors. We first present the step-by-step effect of a series of optimizations on performance. These optimizations improve the performance of PCDM by up to a factor of six. They target issues that very often limit the performance of scientific computing codes. We then evaluate the interaction of PCDM with a real SMT-based SMP system, using both high-level metrics, such as execution time, and low-level information from hardware performance counters.
Proceedings of the 19th annual international conference on Supercomputing - ICS '05, 2005
Abstract Given the importance of parallel mesh generation in large-scale scientific applications and the proliferation of multilevel SMT-based architectures, it is imperative to obtain insight on the interaction between meshing algorithms and these systems. We focus on Parallel Constrained Delaunay Mesh (PCDM) generation. We exploit coarse-grain parallelism at the subdomain level and fine-grain at the element level. This multigrain data parallel approach targets clusters built from low-end, commercially available SMTs. Our ...
2015
In this poster we focus and present our preliminary results pertinent to the integration of multiple parallel Delaunay mesh generation methods into a coherent hierarchical framework. The goal of this project is to study our telescopic approach and to develop Delaunay-based methods to explore concurrency at all hardware layers using abstractions at (a) medium-grain level for many cores within a single chip and (b) coarse-grain level, i.e., sub-domain level using proper error metricand application-specific continuous decomposition methods. © 2015 The Authors. Published by Elsevier Ltd. Peer-review under responsibility of organizing committee of the 24th International Meshing Roundtable (IMR24).
Journal of Parallel and …, 2009
Given the proliferation of layered, multicore-and SMT-based architectures, it is imperative to deploy and evaluate important, multi-level, scientific computing codes, such as meshing algorithms, on these systems. We focus on Parallel Constrained Delaunay Mesh (PCDM) generation. We exploit coarse-grain parallelism at the subdomain level, medium-grain at the cavity level and fine-grain at the element level. This multi-grain data parallel approach targets clusters built from commercially available SMTs and multicore processors. The exploitation of the coarser degree of granularity facilitates scalability both in terms of execution time and problem size on loosely-coupled clusters. The exploitation of medium-grain parallelism allows performance improvement at the single node level. Our experimental evaluation shows that the first generation of SMT cores is not capable of taking advantage of fine-grain parallelism in PCDM. Many of our experimental findings with PCDM extend to other adaptive and irregular multigrain parallel algorithms as well.
53rd AIAA Aerospace Sciences Meeting, 2015
Despite great advancements in the parallelization of numerical simulation codes over the last 20 years, it is still common to perform grid generation in serial. Generating large scale grids in serial often requires using special "grid generation" compute machines that can have more than ten times the memory of average machines. While some parallel mesh generation techniques have been proposed, generating very large meshes for LES or aeroacoustic simulations is still a challenging problem. An automated method for the parallel generation of very large scale off-body hierarchical meshes is presented here. This work enables large scale parallel generation of off-body meshes by using a novel combination of parallel grid generation techniques and a hybrid "top down" and "bottom up" oct-tree method. Meshes are generated using hardware commonly found in parallel compute clusters. The capability to generate very large meshes is demonstrated by the generation of off-body meshes surrounding complex aerospace geometries. Results are shown including a one billion cell mesh generated around a Predator Unmanned Aerial Vehicle geometry, which was generated on 64 processors in under 45 minutes.
Proceedings of the 16th International Meshing Roundtable, 2008
This paper describes a distributed-memory, embarrassingly parallel hexahedral mesh generator, pCAMAL (parallel CUBIT Adaptive Mesh Algorithm Library). pCAMAL utilizes the sweeping method following a serial step of geometry decomposition conducted in the CUBIT geometry preparation and mesh generation tool. The utility of pCAMAL in generating large meshes is illustrated, and linear speed-up under load-balanced conditions is demonstrated.
2015
In this paper, we describe an array-based hierarchical mesh generation capability through uniform refinement of unstructured meshes for efficient solution of PDE’s using finite element methods and multigrid solvers. A multi-degree, multi-dimensional and multi-level framework is designed to generate the nested hierarchies from an initial mesh that can be used for a number of purposes such as multi-level methods to generating large meshes. The capability is developed under the parallel mesh framework “Mesh Oriented dAtaBase” a.k.a MOAB [16]. We describe the underlying data structures and algorithms to generate such hierarchies and present numerical results for computational efficiency and mesh quality. We also present results to demonstrate the applicability of the developed capability to a multigrid finite-element solver. c © 2015 The Authors. Published by Elsevier Ltd. Peer-review under responsibility of organizing committee of the 24th International Meshing Roundtable (IMR24).
In the march towards exascale, supercomputer architectures are undergoing a significant change. Limited by power consumption and heat dissipation, future supercomputers are likely to be built around a lower-power many-core model. This shift in supercomputer design will require sweeping code changes in order to take advantage of the highly-parallel architectures. Evolving or rewriting legacy applications to perform well on these machines is a significant challenge.
2011
Scientists commonly turn to supercomputers or Clusters of Workstations with hundreds (even thousands) of nodes to generate meshes for large-scale simulations. Parallel mesh generation software is then used to decompose the original mesh generation problem into smaller sub-problems that can be solved (meshed) in parallel. The size of the final mesh is limited by the amount of aggregate memory of the parallel machine. Also, requesting many compute nodes on a shared computing resource may result in a long waiting, far surpassing the time it takes to solve the problem. These two problems (i.e., insufficient memory when computing on a small number of nodes, and long waiting times when using many nodes from a shared computing resource) can be addressed by using out-of-core algorithms. These are algorithms that keep most of the dataset out-of-core (i.e., outside of memory, on disk) and load only a portion in-core (i.e., into memory) at a time. We explored two approaches to out-of-core comp...
International Journal for Numerical …, 2003
We present the results of an evaluation study on the re-structuring of a latency-bound mesh generation algorithm into a latency-tolerant parallel kernel. We use concurrency at a ÿne-grain level to tolerate long, variable, and unpredictable latencies of remote data gather operations required for parallel guaranteed quality Delaunay triangulations. Our performance data from a 16 node SP2 and 32 node Cluster of Sparc Workstations suggest that more than 90% of the latency from remote data gather operations can be masked e ectively at the cost of increasing communication overhead between 2 and 20% of the total run time. Despite the increase in the communication overhead the latency-tolerant mesh generation kernel we present in this paper can generate tetrahedral meshes for parallel ÿeld solvers eight to nine times faster than the traditional approach.
Journal of Experimental Algorithmics, 2011
We present two cost-effective and high-performance outof-core parallel mesh generation algorithms and their implementation on Cluster of Workstations (CoWs). The total wall-clock time including wait-in-queue delays for the out-of-core methods on a small cluster (16 processors) is three times shorter than the total wall-clock time for the incore generation of the same size mesh (about a billion elements) using 121 processors. Our best out-of-core method, for mesh sizes that fit completely in the core of the CoWs, is about 5% slower than its in-core parallel counterpart method. This is a modest performance penalty for savings of many hours in response time. Both the in-core and outof-core methods use the best publicly available off-the-shelf sequential in-core Delaunay mesh generator.
2005 IEEE Intelligent Data Acquisition and Advanced Computing Systems: Technology and Applications, 2005
In this paper we present two approaches for parallel out-of-core mesh generation. The first approach is based on a traditional prioritized page replacement algorithm using prioritized version of accepted LRU replacement scheme proposed by Salmon et al. for nbody calculations. The second approach is based on the percolation model proposed for the HTMT petaflops design. We evaluate both approaches using the parallel constrained Delaunay mesh generation method. Our preliminary data suggest that for problem sizes up to half a billion element meshes the traditional approach is very effective. However for larger problem sizes (in the order of billions of elements) the traditional approach becomes prohibitively expensive, but it appears from our preliminary data that the non-traditional percolation approach is a good alternative.
2006
Parallel supercomputing has traditionally focused on the inner kernel of scientific simulations: the solver. The front and back ends of the simulation pipeline - problem description and interpretation of the output - have taken a back seat to the solver when it comes to attention paid to scalability and performance, and are often relegated to offline, sequential computation. As the largest simulations move beyond the realm of the terascale and into the petascale, this decomposition in tasks and platforms becomes increasingly untenable. We propose an end-to-end approach in which all simulation components - meshing, partitioning, solver, and visualization - are tightly coupled and execute in parallel with shared data structures and no intermediate I/O. We present our implementation of this new approach in the context of octree-based finite element simulation of earthquake ground motion. Performance evaluation on up to 2048 processors demonstrates the ability of the end-to-end approach to overcome the scalability bottlenecks of the traditional approach
Lecture Notes in Computational Science and Engineering
Parallel mesh generation is a relatively new research area between the boundaries of two scientific computing disciplines: computational geometry and parallel computing. In this chapter we present a survey of parallel unstructured mesh generation methods. Parallel mesh generation methods decompose the original mesh generation problem into smaller subproblems which are meshed in parallel. We organize the parallel mesh generation methods in terms of two basic attributes: (1) the sequential technique used for meshing the individual subproblems and (2) the degree of coupling between the subproblems. This survey shows that without compromising in the stability of parallel mesh generation methods it is possible to develop parallel meshing software using off-the-shelf sequential meshing codes. However, more research is required for the efficient use of the state-of-the-art codes which can scale from emerging chip multiprocessors (CMPs) to clusters built from CMPs.
Procedia Engineering, 2015
In this paper, we present a scalable three dimensional hybrid parallel Delaunay image-to-mesh conversion algorithm (PDR.PODM) for distributed shared memory architectures. PDR.PODM is able to explore parallelism early in the mesh generation process because of the aggressive speculative approach employed by the Parallel Optimistic Delaunay Mesh generation algorithm (PODM). In addition, it decreases the communication overhead and improves data locality by making use of a data partitioning scheme offered by the Parallel Delaunay Refinement algorithm (PDR). PDR.PODM utilizes an octree structure to decompose the initial mesh and to distribute the bad elements to different octree leaves (subregions). A set of independent subregions are selected and refined in parallel without any synchronization among them. In each subregion, a group of threads is assigned to insert or delete multiple points based on the refinement rules offered by PODM. We tested PDR.PODM on Blacklight, a distributed shared memory (DSM) machine in the Pittsburgh Supercomputing Center, and observed a weak scaling speedup of 163.8 and above for up to 256 cores as opposed to PODM whose weak scaling speedup is only 44.7 on 256 cores. The end result is that we can generate 18 million elements per second as opposed to 14 million per second in our earlier work. To the best of our knowledge, PDR.PODM exhibits the best scalability among parallel guaranteed quality Delaunay mesh generation algorithms running on DSM supercomputers.
… on Numerical Grid …, 2007
Scalable and locality-aware multiprocessor memory allocators are critical for harnessing the potential of emerging multithreaded and multicore architectures. This paper evaluates two state-of-the-art generic multithreaded allocators designed for both scalability and locality, against custom allocators, written to optimize the multithreaded implementation of parallel mesh generation algorithms. We use three different algorithms in terms of communication/synchronization requirements. The implementations of all three algorithms are heavily dependent on dynamically allocated pointer-based data structures and all three use optimized internal memory allocators based on application-specific knowledge. For our study we used memory allocators which are implemented and evaluated on two real multiprocessors with a multi-SMT (quad Hyperthreaded Intel) and a multi-CMP/SMT (dual IBM Power5) organization. Our results indicate that properly engineered generic memory allocators can come close or sometimes exceed (in sequential allocation) the performance of custom multi-threaded allocators. These results suggest that in the near future we should be able to develop generic multi-threaded allocators that can adapt to application charac-teristics and increase productivity without compromising performance.
Advances in Engineering Software, 2013
This work describes a techni que for generating two-dimensional triangular meshes using distributed memory parallel computers, based on a master/slaves model. This techni que uses a coarse quadtree to decompo se the domain and a serial advancing front techni que to generate the mesh in each subdomain concurrently. In order to advance the front to a neighboring subdomain, each subdomain suffers a shift to a Cartesian direction, and the same advancing front approach is performed on the shi fted subdomain. This shift-and-remesh procedure is repeatedly applied until no more mesh can be generated, shifting the subdomains to different directions each turn. A finer quadtree is also employed in this work to help estimate the processing load associated with each subdomain. This load estimati on technique produces results that accurately represent the numbe r of elements to be generated in each subdomain, leading to proper runtime prediction and to a well-balanced algorithm. The meshes generated with the parallel technique have the same quality as those generated serially, within acceptable limits. Although the presented approach is two-dimensional, the idea can be easily extended to three dimensions.
SIAM Workshop on Combinatorial Scientific Computing, 2004
Meshes of high quality are an important ingredient for many applications in scientific computing. An accurate discretization of the problem geometry with elements of good aspect ratio is required by many numerical methods. In the finite element method, for example, interpolation error is related to the largest element angle in the mesh [1]. There is a critical need for algorithms that can generate meshes of provably high quality. For large-scale problems that require frequent remeshing (such as problems with evolving geometry), these algorithms must run in parallel on distributed memory machines. Whereas in recent years great strides have been made in parallel solvers, automatic parallel mesh generation for arbitrary domains remains an unsolved problem. Delaunay Refinement has proven useful for generating meshes of good aspect ratio. Provably good working algorithms that generate meshes for arbitrary domains exist in two dimensions. Efficient sequential implementations are available [2,3]. In three dimensions the problem is more challenging. Recent theoretical results [4] suggest algorithms to solve the general three dimensional meshing problem, but all sequential implementations available today can only cope with input that respects large angle bounds.
Loading Preview
Sorry, preview is currently unavailable. You can download the paper by clicking the button above.