Academia.edu no longer supports Internet Explorer.
To browse Academia.edu and the wider internet faster and more securely, please take a few seconds to upgrade your browser.
2005, Proceedings of the 19th annual international conference on Supercomputing - ICS '05
Abstract Given the importance of parallel mesh generation in large-scale scientific applications and the proliferation of multilevel SMT-based architectures, it is imperative to obtain insight on the interaction between meshing algorithms and these systems. We focus on Parallel Constrained Delaunay Mesh (PCDM) generation. We exploit coarse-grain parallelism at the subdomain level and fine-grain at the element level. This multigrain data parallel approach targets clusters built from low-end, commercially available SMTs. Our ...
Proceedings of the 19th …, 2005
Given the importance of parallel mesh generation in large-scale scientific applications and the proliferation of multilevel SMTbased architectures, it is imperative to obtain insight on the interaction between meshing algorithms and these systems. We focus on Parallel Constrained Delaunay Mesh (PCDM) generation. We exploit coarse-grain parallelism at the subdomain level and fine-grain at the element level. This multigrain data parallel approach targets clusters built from low-end, commercially available SMTs. Our experimental evaluation shows that current SMTs are not capable of executing fine-grain parallelism in PCDM. However, experiments on a simulated SMT indicate that with modest hardware support it is possible to exploit fine-grain parallelism opportunities. The exploitation of fine-grain parallelism results to higher performance than a pure MPI implementation and closes the gap between the performance of PCDM and the state-of-the-art sequential mesher on a single physical processor. Our findings extend to other adaptive and irregular multigrain, parallel algorithms.
Journal of Parallel and …, 2009
Given the proliferation of layered, multicore-and SMT-based architectures, it is imperative to deploy and evaluate important, multi-level, scientific computing codes, such as meshing algorithms, on these systems. We focus on Parallel Constrained Delaunay Mesh (PCDM) generation. We exploit coarse-grain parallelism at the subdomain level, medium-grain at the cavity level and fine-grain at the element level. This multi-grain data parallel approach targets clusters built from commercially available SMTs and multicore processors. The exploitation of the coarser degree of granularity facilitates scalability both in terms of execution time and problem size on loosely-coupled clusters. The exploitation of medium-grain parallelism allows performance improvement at the single node level. Our experimental evaluation shows that the first generation of SMT cores is not capable of taking advantage of fine-grain parallelism in PCDM. Many of our experimental findings with PCDM extend to other adaptive and irregular multigrain parallel algorithms as well.
Proceedings of the 18th International Meshing Roundtable, 2009
Mesh generation is a critical component for many (bio-)engineering applications. However, parallel mesh generation codes, which are essential for these applications to take the fullest advantage of the high-end computing platforms, belong to the broader class of adaptive and irregular problems, and are among the most complex, challenging, and labor intensive to develop and maintain. As a result, parallel mesh generation is one of the last applications to be installed on new parallel architectures. In this paper we present a way to remedy this problem for new highly-scalable architectures. We present a multi-layered tetrahedral/triangular mesh generation approach capable of delivering and sustaining close to 10 18 of concurrent work units. We achieve this by leveraging concurrency at different granularity levels using a hybrid algorithm, and by carefully matching these levels to the hierarchy of the hardware architecture. This paper makes two contributions: (1) a new evolutionary path for developing multi-layered parallel mesh generation codes capable of increasing the concurrency of the state-of-the-art parallel mesh generation methods by at least 10 orders of magnitude and (2) a new abstraction for multilayered runtime systems that target parallel mesh generation codes, to efficiently orchestrate intra-and inter-layer data movement and load balancing for current and emerging multi-layered architectures with deep memory and network hierarchies.
2005 IEEE Intelligent Data Acquisition and Advanced Computing Systems: Technology and Applications, 2005
In this paper we present two approaches for parallel out-of-core mesh generation. The first approach is based on a traditional prioritized page replacement algorithm using prioritized version of accepted LRU replacement scheme proposed by Salmon et al. for nbody calculations. The second approach is based on the percolation model proposed for the HTMT petaflops design. We evaluate both approaches using the parallel constrained Delaunay mesh generation method. Our preliminary data suggest that for problem sizes up to half a billion element meshes the traditional approach is very effective. However for larger problem sizes (in the order of billions of elements) the traditional approach becomes prohibitively expensive, but it appears from our preliminary data that the non-traditional percolation approach is a good alternative.
Journal of Experimental Algorithmics, 2011
We present two cost-effective and high-performance outof-core parallel mesh generation algorithms and their implementation on Cluster of Workstations (CoWs). The total wall-clock time including wait-in-queue delays for the out-of-core methods on a small cluster (16 processors) is three times shorter than the total wall-clock time for the incore generation of the same size mesh (about a billion elements) using 121 processors. Our best out-of-core method, for mesh sizes that fit completely in the core of the CoWs, is about 5% slower than its in-core parallel counterpart method. This is a modest performance penalty for savings of many hours in response time. Both the in-core and outof-core methods use the best publicly available off-the-shelf sequential in-core Delaunay mesh generator.
Proceedings of the 15th International Meshing Roundtable, 2006
The contribution of the current paper is threefold. First, we generalize the existing sequential point placement strategies for guaranteed quality Delaunay refinement: instead of a specific position for a new point, we derive a selection disk inside the circumdisk of a poor quality triangle. We prove that any point placement algorithm that inserts a point inside the selection disk of a poor quality triangle will terminate and produce a size-optimal mesh. Second, we extend our theoretical foundation for the parallel Delaunay refinement. Our new parallel algorithm can be used in conjunction with any sequential point placement strategy that chooses a point within the selection disk. Third, we implemented our algorithm in C++ for shared memory architectures and present the experimental results. Our data show that even on workstations with a few cores, which are now in common use, our implementation is significantly faster the best sequential counterpart.
2006
Parallel supercomputing has traditionally focused on the inner kernel of scientific simulations: the solver. The front and back ends of the simulation pipeline - problem description and interpretation of the output - have taken a back seat to the solver when it comes to attention paid to scalability and performance, and are often relegated to offline, sequential computation. As the largest simulations move beyond the realm of the terascale and into the petascale, this decomposition in tasks and platforms becomes increasingly untenable. We propose an end-to-end approach in which all simulation components - meshing, partitioning, solver, and visualization - are tightly coupled and execute in parallel with shared data structures and no intermediate I/O. We present our implementation of this new approach in the context of octree-based finite element simulation of earthquake ground motion. Performance evaluation on up to 2048 processors demonstrates the ability of the end-to-end approach to overcome the scalability bottlenecks of the traditional approach
Advances in Engineering Software, 2013
This work describes a techni que for generating two-dimensional triangular meshes using distributed memory parallel computers, based on a master/slaves model. This techni que uses a coarse quadtree to decompo se the domain and a serial advancing front techni que to generate the mesh in each subdomain concurrently. In order to advance the front to a neighboring subdomain, each subdomain suffers a shift to a Cartesian direction, and the same advancing front approach is performed on the shi fted subdomain. This shift-and-remesh procedure is repeatedly applied until no more mesh can be generated, shifting the subdomains to different directions each turn. A finer quadtree is also employed in this work to help estimate the processing load associated with each subdomain. This load estimati on technique produces results that accurately represent the numbe r of elements to be generated in each subdomain, leading to proper runtime prediction and to a well-balanced algorithm. The meshes generated with the parallel technique have the same quality as those generated serially, within acceptable limits. Although the presented approach is two-dimensional, the idea can be easily extended to three dimensions.
2008
Generating finite-element meshes is a serious bottleneck for large parallel simulations. When mesh generation is limited to serial machines and element counts approach a billion, this bottleneck becomes a roadblock. pamgen is a parallel mesh generation library that allows on-the-fly scalable generation of hexahedral and quadrilateral finite element meshes for several simple geometries. It has been used to generate more that 1.1 billion elements on 17,576 processors. pamgen generates an unstructured finite element mesh on each processor at the start of a simulation. The mesh is specified by commands passed to the library as a "C"-programming language string. The resulting mesh geometry, topology, and communication information can then be queried through an API. pamgen allows specification of boundary condition application regions using sidesets (element faces) and nodesets (collections of nodes). It supports several simple geometry types. It has multiple alternatives for mesh grading. It has several alternatives for the initial domain decompositon. pamgen makes it easy to change details of the finite element mesh and is very usesful for performance studies and scoping calculations.
Parallel Computing, 2016
In this paper, we propose a three dimensional two-level Locality-Aware Parallel Delaunay image-to-mesh conversion algorithm (LAPD). The algorithm exploits two levels of parallelism at different granularities: coarse-grain parallelism at the region level (which is mapped to a node with multiple cores) and mediumgrain parallelism at the cavity level (which is mapped to a single core). We employ a data locality-aware mesh refinement process to reduce the latency caused by the remote memory access. We evaluated LAPD on Blacklight, a cache-coherent NUMA distributed shared memory (DSM) machine in the Pittsburgh Supercomputing Center, and observed a weak scaling efficiency of almost 70% for roughly 200 cores, compared to only 30% for the previous algorithm, Parallel Optimistic Mesh Generation algorithm (PODM). To the best of our knowledge, LAPD exhibits the best scalability for parallel Delaunay mesh generation algorithms running on NUMA DSM supercomputers.
SIAM Workshop on Combinatorial Scientific Computing, 2004
Meshes of high quality are an important ingredient for many applications in scientific computing. An accurate discretization of the problem geometry with elements of good aspect ratio is required by many numerical methods. In the finite element method, for example, interpolation error is related to the largest element angle in the mesh [1]. There is a critical need for algorithms that can generate meshes of provably high quality. For large-scale problems that require frequent remeshing (such as problems with evolving geometry), these algorithms must run in parallel on distributed memory machines. Whereas in recent years great strides have been made in parallel solvers, automatic parallel mesh generation for arbitrary domains remains an unsolved problem. Delaunay Refinement has proven useful for generating meshes of good aspect ratio. Provably good working algorithms that generate meshes for arbitrary domains exist in two dimensions. Efficient sequential implementations are available [2,3]. In three dimensions the problem is more challenging. Recent theoretical results [4] suggest algorithms to solve the general three dimensional meshing problem, but all sequential implementations available today can only cope with input that respects large angle bounds.
2015
In this poster we focus and present our preliminary results pertinent to the integration of multiple parallel Delaunay mesh generation methods into a coherent hierarchical framework. The goal of this project is to study our telescopic approach and to develop Delaunay-based methods to explore concurrency at all hardware layers using abstractions at (a) medium-grain level for many cores within a single chip and (b) coarse-grain level, i.e., sub-domain level using proper error metricand application-specific continuous decomposition methods. © 2015 The Authors. Published by Elsevier Ltd. Peer-review under responsibility of organizing committee of the 24th International Meshing Roundtable (IMR24).
Journal of Parallel and …, 2009
This article focuses on the optimization of PCDM, a parallel, two-dimensional (2D) Delaunay mesh generation application, and its interaction with parallel architectures based on simultaneous multithreading (SMT) processors. We first present the step-by-step effect of a series of optimizations on performance. These optimizations improve the performance of PCDM by up to a factor of six. They target issues that very often limit the performance of scientific computing codes. We then evaluate the interaction of PCDM with a real SMT-based SMP system, using both high-level metrics, such as execution time, and low-level information from hardware performance counters.
International Journal for Numerical …, 2003
We present the results of an evaluation study on the re-structuring of a latency-bound mesh generation algorithm into a latency-tolerant parallel kernel. We use concurrency at a ÿne-grain level to tolerate long, variable, and unpredictable latencies of remote data gather operations required for parallel guaranteed quality Delaunay triangulations. Our performance data from a 16 node SP2 and 32 node Cluster of Sparc Workstations suggest that more than 90% of the latency from remote data gather operations can be masked e ectively at the cost of increasing communication overhead between 2 and 20% of the total run time. Despite the increase in the communication overhead the latency-tolerant mesh generation kernel we present in this paper can generate tetrahedral meshes for parallel ÿeld solvers eight to nine times faster than the traditional approach.
2011
Scientists commonly turn to supercomputers or Clusters of Workstations with hundreds (even thousands) of nodes to generate meshes for large-scale simulations. Parallel mesh generation software is then used to decompose the original mesh generation problem into smaller sub-problems that can be solved (meshed) in parallel. The size of the final mesh is limited by the amount of aggregate memory of the parallel machine. Also, requesting many compute nodes on a shared computing resource may result in a long waiting, far surpassing the time it takes to solve the problem. These two problems (i.e., insufficient memory when computing on a small number of nodes, and long waiting times when using many nodes from a shared computing resource) can be addressed by using out-of-core algorithms. These are algorithms that keep most of the dataset out-of-core (i.e., outside of memory, on disk) and load only a portion in-core (i.e., into memory) at a time. We explored two approaches to out-of-core comp...
2015
In this paper, we describe an array-based hierarchical mesh generation capability through uniform refinement of unstructured meshes for efficient solution of PDE’s using finite element methods and multigrid solvers. A multi-degree, multi-dimensional and multi-level framework is designed to generate the nested hierarchies from an initial mesh that can be used for a number of purposes such as multi-level methods to generating large meshes. The capability is developed under the parallel mesh framework “Mesh Oriented dAtaBase” a.k.a MOAB [16]. We describe the underlying data structures and algorithms to generate such hierarchies and present numerical results for computational efficiency and mesh quality. We also present results to demonstrate the applicability of the developed capability to a multigrid finite-element solver. c © 2015 The Authors. Published by Elsevier Ltd. Peer-review under responsibility of organizing committee of the 24th International Meshing Roundtable (IMR24).
2013
The ever-growing demand for higher accuracy in scientific simulations based on the discretization of equations given on physical domains is typically coupled with an increase in the number of mesh elements. Conventional mesh generation tools struggle to keep up with the increased workload, as they do not scale with the availability of, for example, multi-core CPUs. We present a parallel mesh generation approach for multi-core and distributed computing environments based on our generic meshing library ViennaMesh and on the Advancing Front mesh generation algorithm. Our approach is discussed in detail and performance results are shown.
SIAM Journal on Scientific Computing, 2006
We present a theoretical framework for developing parallel guaranteed quality Delaunay mesh generation software, that allows us to use commercial off-the-shelf sequential Delaunay meshers for two-dimensional geometries. In this paper, we describe our approach for constructing uniform meshes, in other words, the meshes in which all elements have approximately the same size. Our uniform distributed-and shared-memory implementations are based on a simple (block) coarse-grained mesh decomposition. Our method requires only local communication, which is bulk and structured as opposed to fine and unpredictable communication of the other existing practical parallel guaranteed quality mesh generation and refinement techniques. Our experimental data show that on a cluster of more than 100 workstations we can generate about 0.9 billion elements in less than 5 minutes in the absence of work-load imbalances. Preliminary results for this paper were presented in [6]. Our work in progress includes extending the presented approach, which can efficiently generate only uniform meshes, to nonuniform graded meshes.
Lecture Notes in Computational Science and Engineering
Parallel mesh generation is a relatively new research area between the boundaries of two scientific computing disciplines: computational geometry and parallel computing. In this chapter we present a survey of parallel unstructured mesh generation methods. Parallel mesh generation methods decompose the original mesh generation problem into smaller subproblems which are meshed in parallel. We organize the parallel mesh generation methods in terms of two basic attributes: (1) the sequential technique used for meshing the individual subproblems and (2) the degree of coupling between the subproblems. This survey shows that without compromising in the stability of parallel mesh generation methods it is possible to develop parallel meshing software using off-the-shelf sequential meshing codes. However, more research is required for the efficient use of the state-of-the-art codes which can scale from emerging chip multiprocessors (CMPs) to clusters built from CMPs.
Large scale simulation is moving towards sustained teraflop rates. This simulation power potentiates the most cutting edge large-scale resources to solve large and challenging problems in science and engineering. For finite element simulations, a significant challenge is how to generate meshes with billions of nodes and elements and to deliver such meshes to processors of such large-scale systems. In this work, we discuss some strategies ranging from parametric mesh generators, suitable for simple geometries, to general approaches using octree based meshes for immersed boundary geometries.
Loading Preview
Sorry, preview is currently unavailable. You can download the paper by clicking the button above.