Scaling alltoall collective on multi-core systems

Rahul Kumar; Amith Mamidala; D. K. Panda

Scaling alltoall collective on multi-core systems

Rahul Kumar

2008, 2008 IEEE International Symposium on Parallel and Distributed Processing

Sign up for access to the world's latest research

checkGet notified about relevant papers

checkSave papers to use in your research

checkJoin the discussion with peers

checkTrack your impact

Abstract

MPI Alltoall is one of the most communication intense collective operation used in many parallel applications. Recently, the supercomputing arena has witnessed phenomenal growth of commodity clusters built using InfiniBand and multi-core systems. In this context, it is important to optimize this operation for these emerging clusters to allow for good application scaling. However, optimizing MPI Alltoall on these emerging systems is not a trivial task. InfiniBand architecture allows for varying implementations of the network protocol stack. For example, the protocol can be totally on-loaded to a host processing core or it can be off-loaded onto the NIC or can use any combination of the two. Understanding the characteristics of these different implementations is critical in optimizing a communication intense operation such as MPI Alltoall. In this paper, we systematically study these different architectures and propose new schemes for MPI Alltoall tailored to these architectures. Specifically, we demonstrate that we cannot use one common scheme which performs optimally on each of these varying architectures. For example, on-loaded implementations can exploit multiple cores to achieve better network utilization, and in offload interfaces aggregation can be used to avoid congestion on multi-core systems. We employ shared memory aggregation techniques in these schemes and elucidate the impact of these schemes on multi-core systems. The proposed design achieves a reduction in MPI Alltoall time by 55% for 512Byte messages and speeds up the CPMD application by 33%.

Amith R Mamidala

2006

MPI_Allgather is an important collective operation which is used in applications such as matrix multiplication and in basic linear algebra operations. With the next generation systems going multi-core, the clusters deployed would enable a high process count per node. The traditional implementations of Allgather use two separate channels, namely network channel for communication across the nodes and shared memory channel for intra-node communication. An important drawback of this approach is the lack of sharing of communication buffers across these channels. This results in extra copying of data within a node yielding sub-optimal performance. This is true especially for a collective involving large number of processes with a high process density per node. In the approach proposed in the paper, we propose a solution which eliminates the extra copy costs by sharing the communication buffers for both intra and inter node communication. Further, we optimize the performance by allowing overlap of network operations with intra-node shared memory copies. On a 32, 2-way node cluster, we observe an improvement upto a factor of two for MPI_Allgather compared to the original implementation. Also, we observe overlap benefits upto 43% for 32x2 process configuration.

Log In

Scaling alltoall collective on multi-core systems

Sign up for access to the world's latest research

Abstract

Related papers

Related topics