Academia.edu no longer supports Internet Explorer.
To browse Academia.edu and the wider internet faster and more securely, please take a few seconds to upgrade your browser.
In this paper we propose a fast broadcast scheme called segmented broadcast, and its performance i s a n a l y z e d and compared to known schemes. Experiments are p erformed to verify the validity of it on IBM SP2 and Cray T3E under MPI{1 environment. The results turn out that the speedup enhancement for long messages is up to two-fold or more, as the number of processors grows. It can also be applied to broadcast under MPI{2 or similar message passing environments without any modi cation. The concept is being extended to apply to other communication functions such as scatter, gather, and scan.
Proceedinbgs of the 24th ACM symposium on Parallelism in algorithms and architectures - SPAA '12, 2012
Many-core chips with more than 1000 cores are expected by the end of the decade. To overcome scalability issues related to cache coherence at such a scale, one of the main research directions is to leverage the message-passing programming model. The Intel Single-Chip Cloud Computer (SCC) is a prototype of a message-passing many-core chip. It offers the ability to move data between on-chip Message Passing Buffers (MPB) using Remote Memory Access (RMA). Performance of message-passing applications is directly affected by efficiency of collective operations, such as broadcast. In this paper, we study how to make use of the MPBs to implement an efficient broadcast algorithm for the SCC. We propose OC-Bcast (On-Chip Broadcast), a pipelined k-ary tree algorithm tailored to exploit the parallelism provided by on-chip RMA. Using a LogP-based model, we present an analytical evaluation that compares our algorithm with the state-of-the-art broadcast algorithms implemented for the SCC. As predicted by the model, experimental results show that OC-Bcast attains almost three times better throughput, and improves latency by at least 27%. Furthermore, the analytical evaluation highlights the benefits of our approach: OC-Bcast takes direct advantage of RMA, unlike the other considered broadcast algorithms, which are based on a higher-level send/receive interface. This leads us to the conclusion that RMA-based collective operations are needed to take full advantage of hardware features of future message-passing many-core architectures.
2015 44th International Conference on Parallel Processing Workshops, 2015
The efficiency and scalability of MPI collective operations, in particular the broadcast operation, plays an integral part in high performance computing applications. MPICH, as one of the contemporary widely-used MPI software stacks, implements the broadcast operation based on point-to-point operation. Depending on the parameters, such as message size and process count, the library chooses to use different algorithms, as for instance binomial dissemination, recursive-doubling exchange or ring all-to-all broadcast (allgather). However, the existing broadcast design in latest release of MPICH does not provide good performance for large messages (lmsg) or medium messages with non-power-of-two process counts (mmsg-npof2) due to the inner suboptimal ring allgather algorithm. In this paper, based on the native broadcast design in MPICH, we propose a tuned broadcast approach with bandwidth-saving in mind catering to the case of lmsg and mmsg-npof2. Several comparisons of the native and tuned broadcast designs are made for different data sizes and program sizes on Cray XC40 cluster. The results show that the performance of the tuned broadcast design can get improved by a range from 2% to 54% for lmsg and mmsg-npof2 in terms of user-level testing.
Theoretical Computer Science, 2015
Gurion University of the Negev, Israel
Discrete Applied Mathematics, 1994
We investigate the problem of broadcasting multiple messages in a message-passing system that supports simultaneous send and receive. The system consists of n processors, one of which has m messages to broadcast to the other n -1 processors. The processors communicate in rounds. In each round, a processor can simultaneously send a message to one processor and receive a message from another processor. The goal is to broadcast the m messages among the n processors in the minimal number of communication rounds. The lower bound on the number of rounds required is (m -1) + [log nl. We present an algorithm for this problem that requires at most m + [log n1 communication rounds, for any values of m and n.
Networks, 1997
Broadcasting is a widely used operation in many message-passing systems. Most existing broadcasting algorithms, however, do not address several emerging trends in distributed-memory parallel computers and high-speed communication networks. These trends include (i) treating the system as a fully connected collection of processors, (ii) packetizing large data into sequences of messages, and (iii) tolerating communication latencies. This paper explores the broadcasting problem in the postal naodel that addresses these issues. We provide two algorithms for broadcasting m messages in a message-passing system with 7) processors and communication latency A. A lower bound on the time for this problem is (m -1) + J x ( n ) , where is the optimal time for broadcasting one message. We present Algorithm PARTITION that takes at most 2nt+ fx(n) +O(X) time, and Algorithm D-D-TREES that takes at most nt + 2fx(n) + O(X) time.
IEEE Transactions on Parallel and Distributed Systems, 1997
We present efficient algorithms for two all-to-all communication operations in message-passing systems: index (or all-toall personalized communication) and concatenation (or all-to-all broadcast). We assume a model of a fully connected messagepassing system, in which the performance of any point-to-point communication is independent of the sender-receiver pair. We also assume that each processor has k ≥ 1 ports, through which it can send and receive k messages in every communication round. The complexity measures we use are independent of the particular system topology and are based on the communication start-up time, and on the communication bandwidth.
IEEE Transactions on Parallel and Distributed Systems, 1995
We consider the problem where broadcast requests are generated at random time instants at each node of a multiprocessor network. In particular, in our model packets arrive at each node of a network according to a Poisson process, and each packet has to be broadcast to all the other nodes. We propose an on-line decentralized routing scheme to execute the broadcasts in this dynamic environment. A related, although static, communication task is the partial multinode broadcast task, where M < N arbitrary nodes of an N-processor network broadcast a packet to all the other nodes. The results that we obtain for the dynamic broadcasting scheme apply to any topology, regular or not, for which partial multinode broadcast algorithms with certain properties can be found. For the dynamic scheme we find an upper bound on the average delay required to serve a broadcast request, and we evaluate its stability region. As an application we give a near-optimal partial multinode broadcast algorithm for the hypercube network. The stability region of the corresponding hypercube dynamic scheme tends to the maximum possible as the number of nodes of the hypercube tends to infinity. Furthermore, for any fixed load in the stability region, the average delay is of the order of the diameter of the hypercube.
1992
We present efficient algorithms for broadcasting multiple messages. We assume n processors, one of which contains m packets that it must broadcast to each of the remaining n -1 processors. The processors communicate in rounds. In one round each processor is able to send one packet to any other processor and receive one packet from any other processor. We give a broadcasting algorithm which requires m + log n + 310g log n + 15 rounds. In addition, we show a simple lower bound of m + pog n1-1 rounds for broadcasting in this model.
Proceedings 1998 International Conference on Parallel and Distributed Systems (Cat. No.98TB100250), 1998
The dynamic broadcast problem is the communication problem where source packets to be broadcast to all the other nodes are generated at each node of a parallel computer according to a certain random process, such as a Poisson process. The lower bounds on the average reception delay required by any oblivious dynamic broadcast algorithm in a d-dimensional hypercube are Ωd + 1 1,ρ when packets are generated according to a Poisson process, where ρ is the load factor. The best previous algorithms for hypercubes only achieve Ω d 1,ρ average reception delay. In this paper, we propose dynamic broadcast algorithms that require optimal Od + 1 1,ρ average reception delay in d-dimensional hypercubes and n 1 n 2 n d tori with n i = O1. We apply the proposed broadcast scheme to a variety of other network topologies for efficient dynamic broadcast and present several methods for assigning priority classes to packets.
2012 International Conference on High Performance Computing & Simulation (HPCS), 2012
The delay of instructions broadcast has a significant impact on the performance of Single Instruction Multiple Data (SIMD) architecture. This is especially true for massively parallel processing Systems-on-Chip (mppSoC), where the processing stage and that of setting up the communication mechanism need several clock periods. Subnetting is the strategy used to partition a single physical network into more than one smaller logical sub-networks (subnets). This technique better controls the broadcast instructions domain and the data traffic between network nodes. Furthermore, it allows to separate synchronous communications from asynchronous processing which maintains reliable communications and rapid processing through parallel processors. This paper describes the design of a communication model called broadcast with mask. This model is dedicated to mppSoC architecture with a huge number of processor elements because it maintains performances even when the number of processors increases. Simulation results and an FPGA implementation validate our approach.
Microprocessors and Microsystems, 2002
A general-purpose multicomputer network must be able to ef®ciently handle broadcast communication because it is required by many parallel applications. Previous studies have focused mainly on the design of ef®cient broadcast algorithms. As a result, there has been hardly any work that assesses the performance of existing networks to handle this kind of communication. In an effort to ®ll this gap, this paper examines the relative performance merits of the low-dimensional k-ary n-cube and the hypermesh in the presence of broadcast traf®c. The former network has been one of the widely popular networks in current practical multicomputers, while the latter has recently been proposed, and shown to exhibit attractive topological properties. The results reveal that owing to its hypergraph topology, the hypermesh represents a potential candidate as a high-performance network for future multicomputers as it provides better support for broadcast communication than the k-ary n-cube.
Networks, 1995
Broadcasting refers to the process of dissemination of a set of messages originating from one node to all other nodes in a communication network. We assume that, at any given time, a node can transmit a message along at most one incident link and simultaneously receive a message along at most one incident link. We first present an algorithm for determining the amount of time needed to broadcast k messages in an arbitrary tree. Second, we show that, for every n, There exists a graph with n nodes whose k-message broadcast time matches the trivial lower bound ⌈ log n⌉ + k − 1 by designing a broadcast scheme for complete graphs. We call those graphs minimal broadcast graphs. Finally, we construct an n node minimal broadcast graph with fewer than (⌈log n⌉ + 1)2⌈ log n⌉ −1 edges.
HAL (Le Centre pour la Communication Scientifique Directe), 2016
2002
Data broadcasting as a means of efficient data dissemination is a key technology facilitating ubiquitous computing. For this reason, broadcast scheduling algorithms have received a lot of attention. However, all existing algorithms make the core assumption that the data items to be broadcast are immediately available in the transmitter's queue, ignoring the key role that the disk subsystem and the cache management play in the overall broadcast system performance. With this paper we contribute a comprehensive system's perspective towards the development of high performance broadcast systems, taking into account how broadcast scheduling, disk scheduling, and cache management algorithms affect the overall performance. We contribute novel techniques that ensure an efficient interplay between broadcast, cache management, and disk scheduling. We study comprehensively the performance of the broadcast server, as it consists of the broadcast scheduling, the disk scheduling, the cache management algorithms, and the transmitter. Our results show that the contributed algorithms yield considerably higher performance. Furthermore, one of our algorithms is shown to enjoy considerably higher performance, under all values of the problem and system parameters. A key contribution is the result that broadcast scheduling algorithms have only a small effect on the overall system performance, which necessitates the definition of different focal points for efforts towards high performance data broadcasting.
IEEE Transactions on Computers, 2000
The paper addresses ways in which one can use "broadcast communication" in distributed algorithms and the relevant issues of design and complexity. We present an algorithm for merging k sorted lists of n/k elements using k processors and prove its worst case complexity to be 2n, regardless of the number of processors, while neglecting the cost arising from possible conflicts on the broadcast channel. We also show that this algorithm is optimal under single-channel broadcast communication. In a variation of the algorithm, we show that by using an extra local memory of 0(k) the number of broadcasts is reduced to n. When the algorithm is used for sorting n elements with k processors, where each processor sorts its own list first and then merging, it has a complexity of 0(n/k log(n/k) + n), and is thus asymptotically optimal for large n. We also discuss the cost incurred by the channel access scheme and prove that resolving conflicts whenever k processors are involved introduces a cost factor of at least log k.
Many scientific applications running on distributed memory systems consume a substantial fraction of their total execution time exchanging data between processes. Thus, improving the performance of communication routines can significantly speed up overall application execution. In this context, we investigate the usage of a high-speed compression algorithm for floating-point data in the transmission of long messages, in particular broadcasts. We incorporated this algorithm in the broadcast primitive of an MPI library and evaluate the performance on different types of messages for up to 4096 processing cores. The results show that compression can significantly accelerate broadcasts.
2004
Modern high performance applications require efficient and scalable collective communication operations. Currently, most collective operations are implemented based on point-to-point operations. In this paper, we propose to use hardware multicast in InfiniBand to design fast and scalable broadcast operations in MPI. InfiniBand supports multicast with Unreliable Datagram (UD) transport service. This makes it hard to be directly used by an upper layer such as MPI. To bridge the semantic gap between MPI Bcast and InfiniBand hardware multicast, we have designed and implemented a substrate on top of InfiniBand which provides functionalities such as reliability, in-order delivery and large message handling. By using a sliding-window based design, we improve MPI Bcast latency by removing most of the overhead in the substrate out of the communication critical path. By using optimizations such as a new co-root based scheme and lazy ACK, we can further balance and reduce the overhead. We have also addressed many detailed design issues such as buffer management, efficient handling of out-of-order and duplicate messages, timeout and retransmission, flow control and RDMA based ACK communication.
Recent Advances in Parallel Virtual Machine and …, 2004
Improving communication performance is an important issue in cluster systems. This paper investigates the possibility of accelerating group communication at the level of message passing libraries. A new algorithm for implementing the broadcast communication primitive will be introduced. It enhances the performance of fully-switched cluster systems by using message decomposition and asynchronous communication. The new algorithm shows the dynamism and the portability of the software solutions, while it has a constant asymptotic time complexity achieved only with hardware support before. Test measurements show that the algorithm really has a constant time complexity, and in certain cases it can outperform the widely used binary tree approach by 100 percent. The presented algorithm can be used to increase the performance of broadcasting, and can also indirectly speed up various group communication primitives used in standard message passing libraries.
2005 International Conference on Parallel Processing Workshops (ICPPW'05), 2005
Broadcast Communication is among the most primitive collective capabilities of any message passing network. Broadcast algorithms for the mesh have been widely reported in the literature. However, most existing algorithms have been studied within limited conditions, such as light traffic load and fixed network sizes. In other words, most of these algorithms have not been studied at different Quality of Service (QoS) levels. In contrast, this study examines the broadcast operation, taking into account the scalability, parallelism, a wide range of traffic loads through the propagation of broadcast messages. To the best of our knowledge, this study is the first to consider the issue of broadcast latency at both the network and node levels across different traffic loads. Results are shown from a comparative analysis confirming that the coded-path based broadcast algorithms exhibit superior performance characteristics over some existing algorithms.
Journal of Systems Architecture, 2005
Maximising the performance of parallel systems requires matching message-passing algorithms and application characteristics with a suitable underling interconnection network. Broadcast algorithms for wormhole-switched meshes have been widely reported in the literature. However, most of these algorithms handle broadcast in a sequential manner and do not scale well with the network size. As a consequence, many parallel applications cannot be efficiently supported using existing techniques. Motivated by these observations, this paper presents a new efficient broadcast algorithm for the mesh, called the Plane-Based (PB) algorithm. The main feature of this approach is its ability to perform broadcast operation with a high degree of scalability and parallelism. Furthermore, performance is insensitive to the network size, i.e., only three message-passing steps are required to implement a broadcast operation irrespective of the network size. Results from a comparative analysis demonstrate that the PB algorithm exhibits superior performance characteristics over those of the well-known Recursive Doubling and Extending Dominating Node algorithms.
Loading Preview
Sorry, preview is currently unavailable. You can download the paper by clicking the button above.