0% found this document useful (0 votes)
52 views9 pages

Shared Memory-Aware Latency-Sensitive Message Aggregation For Fine-Grained Communication

Uploaded by

a65853766
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
52 views9 pages

Shared Memory-Aware Latency-Sensitive Message Aggregation For Fine-Grained Communication

Uploaded by

a65853766
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

Shared Memory-Aware Latency-Sensitive Message

Aggregation for Fine-Grained Communication


1st Kavitha Chandrasekar 2nd Laxmikant Kale
Dept. of Computer Science Dept. of Computer Science
Univ of Illinois at Urbana-Champaign Univ of Illinois at Urbana-Champaign
Urbana, USA Urbana, USA
[email protected] [email protected]
arXiv:2411.03533v1 [cs.DC] 5 Nov 2024

Abstract—Message aggregation is often used with a goal to parallel libraries such as OpenMP or Kokkos. Charm++ does
reduce communication cost in HPC applications. The difference not require a separate programming model for within-process
in the order of overhead of sending a message and cost of per communication, but utilizes the shared memory for optimizing
byte transferred motivates the need for message aggregation, for
several irregular fine-grained messaging applications like graph inter-object communication and cooperation. It is this SMP
algorithms and parallel discrete event simulation (PDES). While mode that is the focus of our work. This SMP scenario
message aggregation is frequently utilized in “MPI-everywhere” amplifies the need for message aggregation further.
model, to coalesce messages between processes mapped to cores, For the MPI-everywhere mode, systems like YGM [3]
such aggregation across threads in a process, say in MPI+X have recently looked into implementing node-aware schemes
models or Charm++ SMP (Shared Memory Parallelism) mode,
is often avoided. Within-process coalescing is likely to require for message aggregation. YGM’s mechanism for coalescing
synchronization across threads and lead to performance issues messages at node level uses a phase of node-local exchange
from contention. However, as a result, SMP-unaware aggregation which relies on underlying MPI implementation’s method of
mechanisms may not fully utilize aggregation opportunities achieving cheaper within node communication (eg. xpmem or
available to applications in SMP mode. Additionally, while the cma). But the context we aim at is where worker threads divide
benefit of message aggregation is often analyzed in terms of
reducing the overhead, specifically the per message cost, we also the application’s work among cores, but need to exchange
analyze different schemes that can aid in reducing the message many short messages with other workers, in a shared memory
latency, ie. the time from when a message is sent to the time when setting.
it is received. Message latency can affect several applications like The SMP mode, in Charm++ as well as MPI, presents sev-
PDES with speculative execution where reducing message latency eral potential advantages to the applications. Worker threads
could result in fewer rollbacks. To address these challenges, in our
work, we demonstrate the effectiveness of shared memory-aware can share and divide work more efficiently, for better load
message aggregation schemes for a range of proxy applications balance (using dynamic schedules in OPENMP or stealable
with respect to messaging overhead and latency. tasks and object migration in Charm++. Large read-only data
Index Terms—Message aggregation, SMP, Charm++, runtime structures can be shared among workers without making mul-
tiple copies. Even dynamic data structures, such as software
particle (or tree) caches in astronomy codes can save duplicate
I. I NTRODUCTION external (across-node) communications. These orthogonal ben-
Message aggregation libraries aim to provide an interface efits may make SMP mode attractive independent of message-
for applications for consolidating messages to reduce com- aggregation considerations.
munication cost associated with processing many messages. The questions we explore in this paper are: Given that an
Previous schemes for aggregation like TRAM [1] in Charm++ application wants to operate in SMP mode, what options are
and streaming library Active Pebbles [2] use topology-aware there for organzining message aggregation and how do they
routing schemes but these are less beneficial for modern compare with each other? Does the SMP mode aggregation
topologies like fat-trees. by itself provides benefits over non-SMP aggregation that
Modern machines have many tens of cores per physical makes SMP mode attractive even for applications that don’t
node (also called “host”). Applications tend to use them particularly require SMP mode? Finally, what metrics are
in two modes: in one mode there is a separate process important for such comparisons?
bound to each core that is used by the application; this is The main metric that has motivated research on message
called MPI-everywhere for MPI applications, and non-SMP aggregation has been the overhead. The cost associated with
mode in Charm++ applications. The other mode uses multiple sending a message over network can be computed using the
processes per node, but each process owns multiple cores, alpha-beta communication model as α + N β, where N is the
and facilitates shared memory parallelism (SMP) across cores number of bytes. Here α is the latency per message and β
within a process. For MPI applications, parallelism within a (inverse of bandwidth) is the cost per byte transferred. To
process and across cores is managed by one of the thread- illustrate the costs, we measure the time taken to send a

Copyright © 2024 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future
media, including reprinting/republishing this material for advertising or promotional purposes, creating new collective works, for resale or redistribution to
servers or lists, or reuse of any copyrighted component of this work in other works.
message (roundtrip time/2) between two physical nodes on Observing the behavior of many such applications, and by
the Delta supercomputer using a ping-pong benchmark. As virtue of the co-design of our new aggregation library with
we can see in Figure 1, the time for small message sizes a graph algorithm, we inferred that, aside from effects on
is dominated by latency cost (α) and is in the order of communication cost using message aggregation, a key metric
microseconds. The cost of sending an additional byte of data that is influenced by message aggregation is latency. Different
(β), on the other hand, is about 0.1 nanosecond (indicating a applications categories are differently affected by latency. We
bandwidth of about 12 GB/s). The primary motivating factor define latency here as the time from when an item is generated
for message aggregation is this discrepancy in the order of by the application to when it is delivered to the application
α and β values. While α is in microseconds, β is less than on another processor. While aggregation reduces the cost of
a nanosecond per byte, motivating the need for fine-grained sending items, it increases the latency associated with each
messaging applications to aggregate many small messages into item. As items are buffered, latency is higher for messages than
fewer large messages. without any aggregation. In extreme cases, for applications that
are highly latency sensitive, message aggregation may have to
be avoided. But in practice, the large α cost of short messages
200 means that we have to deal with the dual goals of reducing
RTT/2
latency while simultaneously reducing the overhead.
150 The paper presents multiple schemes for aggregation in
the context of SMP applications (where cores on a node
Time (µs)

are organized into multiple processes with multiple cores


100
assigned to each), and compares them on the dual metrics
of overhead and latency, for multiple applications represented
50 by benchmarks. The schemes vary based on the level at
which the aggregation is done: core/worker, or process, on
0 both sending and receiving side. We analyze the benefits of
SMP-aware message aggregation with different schemes, for
1
4
16
64
12
25
1K
4K
16
64 B
0. B
1M MB
2M
25
8
6

K
K
B
B

B
B

different types of applications. The contributions of this paper


Message bytes are as follows.
• We analyze message aggregation which is SMP-aware,
Ping-pong between two physical nodes
with one buffer per destination node (or process), or one
(Delta supercomputer)
buffer per source node or both.
• We analyze the above schemes and how buffer sizes affect
Fig. 1. Time to send a message does not change for small byte count,
suggesting time is dominated by latency α of µseconds latency and overhead
• We describe a series of features that are engendered by
For clarity in usage of terms, we will use “items” to refer application-co-design, which make the our library (called
to short messages the application wishes to send, and reserve TramLib ) versatile and responsive to the varying needs
“message” to refer to aggregated data that the aggregation of the applications.
libraries send via the underlying communication mechanism. We characterize latency and overhead of the different
The use-cases for message aggregation range from all-to-all schemes considered using the following: We use two common
communication in MPI, where every rank wishes to send short benchmarks, namely histogramming and index-gather,
a relatively small number of items to every other rank, to which allow isolated measurements of overhead and latency.
streaming scenarios encountered in graph algorithms or in We also demonstrate usefulness of our library on irregular
the well-known random-access benchmark from the HPC applications including a graph applications and PDES, via
Challenge suite, where each worker continually generates a benchmarks representative of those.
stream of items to others. While in the former case, one
can anticipate an end point where the application signifies II. BACKGROUND
the contribution from a worker has ended, in the latter case, The previous implementation of aggregation in Charm++
although such an end point may exist at some time, one must [1] focused on topology aware routing, aggregated items at PE
design the library for continuous flow of information. The or core level, and routed to the destination PE using a multi-
distinction between the two cases is only quantitative, but may dimensional routing scheme. In contrast, the new SMP-aware
require different aggregation techniques. In contrast to both aggregation algorithm that we discuss in the paper, embodied
these, consider a parallel discrete event simulation with an in a library called tramlib, aims to reduce latency of messages
optimistic (or speculative) protocol: here, items are events with being aggregated.
a payload and a time-stamp associated with them. Destination Our work is developed in the context of the Charm++
(logical processes or agents) that receive items in time-stamp parallel programming system, and uses its features, including
order are efficient, but out-of-order delivery leads to cascading message-driven execution. Charm++ divides work into units
rollbacks with high overhead. called chares. Overdecomposition, a key feature in Charm++,
allows multiple chares to be mapped to a processing element
(PE), typically a core. Allowing multiple chares on a PE allows
for computation and communication overlap. It also allows the
Charm++ runtime to migrate chares to provide functionality
like load balancing, shrink/expand, fault tolerance and others.
Shared memory parallelism with a large number of cores
presents a new set of challenges for message aggregation.
Existing systems like Active Pebbles do not support SMP to
avoid dealing with overhead caused by atomics, locks and
contention within node. Some systems like Conveyors [4]
make the assumption that communication within a process
Fig. 2. PingAck benchmark
(with multithreading) will happen using shared memory and
focus on messages sent from process-to-process. YGM [5]
allows for coalescing of messages from source cores in a The SMP version with 64 PEs is about 5× slower that the
node or to destination cores in a node. It does so by per- non-SMP version!
forming node local sends and receives which rely on the To understand this (and the remaining bars), note that
efficient underlying implementation of local sends/receives Charm++ uses one dedicated core within each process for
instead of directly utilizing shared memory features. The older communication. This design has worked well in the broader
TRAM system in Charm++ was also extended to support context of “normal” applications, which do significant compu-
SMP and multiple cores, by assigning cores in a physical tation for each word of communication. However, as confirmed
nodes as an extra dimension in its multi-dimensional routing with additional experiments using PingAck, for fine-grained
scheme. However, grouping each core into a hyperplane does applications, if the amount of work per word of communica-
not help reduce internode messages. This was evidenced by tion was less than 167 nanoseconds, the communication thread
the suboptimal performance in many SMP benchmarks and (commThread) itself becomes a serializing bottleneck. As we
applications. Other works rely on posix shared memory [6] use more processes per node, each with its own dedicated
for communicating between processes on the same physical commThread, performance significantly improves (as seen in
node. the figure). The number of messages sent by each PE on
node 0 is adjusted with each configuration, so that the total
III. AGGREGATION S CHEMES number of messages from Node 0 to Node 1 remains the
A. SMP woes for fine-grained computations same. This alone was not sufficient: operating system daemons
and activities such as gpgpu callbacks affect 1 core in each
In spite of the potential benefits of SMP mode, initial process, and in fine grained applications, 1 core needs to be
experiments in using message aggregation in SMP mode set aside using explicit core-mapping primitives. With these
demonstrated a huge performance gap: SMP mode was more two mechanisms, it becomes possible to overcome most of
than 10 times slower than non-SMP in the simple histogram- the communication related disadvantages of SMP mode for
ming benchmark of the Bale suite [7]. fine-grained applications.
To analyze this issue, we wrote a simple benchmark called Recent work by Zambre et al. [8], [9] has identified
PingAck, which runs on two physical nodes. Each worker PE hindrances in the use of multiple NICs and concurrency in
on the first physical node sends 1000 messages of a given communication as an additional issue. Using many processes
size to the corresponding PE or core on the second physical per node also partially mitigates those concerns. Although
node. Each PE on the second node sends an ack to PE-0 after we have identified additional optimizations for enhancing
receiving all its messages. We measure the total time as time communication performance of SMP mode, the above suffice
from the start of sends on the PE-0 to the time ack is received. for our current purpose.
Figure 2 shows the communication pattern between chares
on PEs in the PingAck benchmark. With this benchmark, we B. Basic Aggregation Schemes for aggregation in SMP context
expect that each PE on node-0 simultaneously sending 1000 The basic scheme is one in which the aggregation is done
messages to PEs on node-1 would create a heavy load for the at the level of a worker (i.e. a PE in charm terminology,
underlying communication processing, but only for sending corresponding to a Pthread bound to a core) at both source
on node-0, and for receiving on node-1, allowing for better and destination, shown in Figure 4. Concretely, if there are w
analysis. workers, each worker maintains w-1 buffers for items going
Next, we compare the performance of PingAck benchmark to every other worker. A buffer is sent to the destination as a
in SMP and non-SMP mode in Figure 3, performed on 2 message when it is full or when the application asks TramLib
physical nodes on the Delta supercomputer at NCSA. Multiple to flush accumulated items. This scheme, denoted WW, does
trials including initial warm-up ones are used to generate error not leverage the shared memory across workers.
bars. In the figure, the first and second histogram bars show In the second scheme, each worker aggregates messages
the non-SMP version with 64 PEs and SMP with 64 PEs. by destination process (maintaining a buffer for each of the
<item,
dest_w>

Worker Worker Worker Worker Worker Worker Worker Worker

Process Process

Fig. 6. WsP: Source worker maintains a buffer per destination process

<item,
dest_w>

Fig. 3. PingAck benchmark SMP (different process counts) vs non-SMP on


2 physical nodes

p processes), shown in Figure 5. Since the items need to


be delivered to individual workers, it is necessary to sort Worker Worker Worker Worker Worker Worker Worker Worker

(or group) items within the message at some stage. We Process Process

considered two variants for this purpose. In the first variant,


the sorting (denoted by s) or grouping of items is performed Fig. 7. PP: Source process maintains a buffer per destination process
at the destination process (Figure 5), denoted by WPs. In the
other variant, shown in Figure 6, the sorting or grouping is
performed at the source worker, denoted by WsP. hubs in a routing network and help reduce overhead further.
In the next scheme, denoted PP, we aggregate messages Although practically useful, we omit the discussion of these
on each source process, by destination process. I.e. on each higher level aggregation schemes from this paper, as they are
process there is only one buffer for each target process to orthogonal to the SMP considerations that are the focus of this
which all the workers within the process contribute. This paper.
coalescing in the source process is achieved using atomics. In all of the schemes, a few basic optimizations are applied:
This is illustrated in Figure 7. messages sent by flush operations are resized, so that empty
portions of the aggregation buffer are not sent. Buffers can
be flushed, optionally, when the processor is idle, or when
<item>
triggered by the application, or by a timeout. Charm++’s
expedited methods are used to prioritize TramLib messages
over other ordinary application messages.
The aggregation capability is implemented as a Charm++
library. At initialization time, the user code passes a pointer
Worker Worker Worker Worker Worker Worker Worker Worker
to the charm++ object and function to which data needs to be
Process
Process
delivered.
When items are inserted, TramLib checks if data item count
Fig. 4. WW: Source worker maintains a buffer per destination worker
for the destination has reached the buffer size. If so, the
buffer of aggregated data items is sent to the receiver chare in
<item, TramLib. The receiver then returns the data to the application.
dest_w>
There are few variations in this implementation depending on
the type of scheme.

C. Analysis
Below we analyze the following costs associated with node-
Worker Worker Worker Worker Worker Worker Worker Worker
aware TramLib:
Process Process
1) Memory overhead
Fig. 5. WPs: Source worker maintains a buffer per destination process
2) Number of messages sent
3) Message send cost
The same grouping techniques can be extended one level 4) Processing delays at receiver
up to the physical node, if it houses multiple processes. Also, Memory overhead: We use g to denote number of items in
in the past, multi-dimensional routing across nodes had been TRAM buffer and m to denote the size of each item, similar to
considered, where messages going from one physical node to notation used in previous work TRAM [1]. With node-aware
another, go via one or more intermediary nodes, which act like TramLib, we do not use topological multi-dimensional routing
of previous TRAM. So, given the total number of processes and node-aware schemes can further improve latencies, for
N , and assuming t workers per process (i.e. Pthreads bound latency-sensitive applications.
to disjoint cores), the memory overheads related to buffer Processing delays: When coalescing messages, there are a
allocation for the different schemes are: few delays like higher cost per byte and buffer fill delays.
• WW: Aggregating messages on each source PE per des- Another delay is the overhead O that is added once per
tination PE = g ∗ m ∗ N ∗ t per core or g ∗ m ∗ N ∗ t2 per aggregated message sent or received. On the sending pro-
process cessor, the overhead is from contention when we maintain
• WPs, WsP: Aggregating messages on each source PE per common buffers per physical node. At destination node or
destination node = g ∗ m ∗ N per core or g ∗ m ∗ N ∗ t process, the receiving PE sorts or groups messages by PE and
per process communicates the coalesced message locally to each PE. The
• PP: Aggregating messages at each source process (in- sorting or grouping of items by worker, for t workers per
stead of PE) per destination process = g ∗ m ∗ N per process, results in an additional delay of O(g + t) to output a
source process grouped coalesced message of the same size as the buffer g.
Local sends are typically fast in shared memory systems like
However, the overall impact of these alternatives is com- Charm++.
plicated by the fact that the optimal buffer size g would be
different for each of the schemes, and also because for some D. Benchmarks and Applications
applications, latency of individual messages may be important.
Number of messages sent: For a buffer size g, if z items All applications are written in Charm++ and use the
are send from each source PE, the number of messages sent API from the TramLib aggregation library implemented as a
for different schemes is: Charm++ library module.
Histogram: The histogram benchmark is based on the
• WW: The number of messages sent per source PE will
one from the Bale benchmark suite [7]. It uses a distributed
have a lower bound of z/g and an upper bound of z/g +
histogram table on all PEs, and all PEs send a given number
N ∗t
of updates to the distributed histogram. Each PE invokes the
• WPs, WsP: The number of messages sent per source PE
flush call at the end of all updates to flush any remaining
will have a lower bound of z/g and an upper bound of
updates in the TramLib buffers. Since there are no dependent
z/g + N
communication items created as a result of an histogram
• PP: With source process aggregation, messages sent per
update sent, latency is not directly relevant this benchmark.
source process will have a lower bound of z/g and an
Instead, it can be used to measure overhead, in isolation.
upper bound of z/g + N
Index-gather (IG): This example is also based on Bale
This assumes no intermittent flushing is triggered by the ap- benchmark suite. Each PE sends a given number of requests
plication except once at the end. The lower bound is achieved to other PEs which send a response upon receiving the request.
when partially filled buffers have to be flushed at the end of the This benchmark helps us measure latency for a message since
communication phase. It can be seen that for long streaming we can measure the time between a request sent and response
applications, where z >> g, the second term in the upper received on a given PE. While latency can be measured in
bound is vanishingly small, and so the schemes are almost other benchmarks as well, measuring time between sends
identical in number of messages sent. But for short streams of and receives on different nodes can include clock skew error.
items, such as in a short all-to-all operation, aggregating by the Hence we use IG for analyzing latency for various aggregation
destination process (i.e. using one buffer for all items going schemes.
to different destination PEs belonging to the same destination SSSP: The single source shortest path graph algorithm used
process) has the least number of messages. in this work has vertices distributed across chares, with one
Message send cost: Assuming alpha-beta model, sending chare per PE. The algorithm performs speculative execution,
z items would take z ∗ (α + β ∗ b) for communication cost by computing distances based on available distances of neigh-
for sending each item as a message, assuming b bytes per bors of vertices to the source vertex. If a PE receives an
item are sent. With coalescing, provided buffer size is g, the updated distance for a vertex that is smaller than previously
communication cost for z items is (z/g) ∗ (α + β ∗ b ∗ g) or known distance, the value is updated. The updated value is sent
(z/g)∗α+β∗b∗z, essentially reducing the α component in the to neighbor vertices of this vertex, which then may update their
message cost by g. This has implications on latency however, distance similarly. The neighbor vertices maybe on remote PEs
where now the average message latency is increased based on or local PEs. A threshold value is used in this benchmark
buffer fill rate. Assuming buffer fill rate is r, the latency of that helps prioritize updates with smaller distances in order
an item in the buffer can increase by up to g/r. However, to minimize wasted updates. This is because updates with
on the other hand, in some cases with several tiny messages, larger updates are likely to become wasted updates since
aggregating messages can also improve latency of messages they are likely to updated with smaller values later. The
since the sender PE is blocked less for future messages, hence program terminates when all updates have been consumed.
improving overall latencies of messages. Smaller buffer sizes This benchmark is latency sensitive since higher latency of
messages is likely to result in more wasted updates being with 1 million updates per PE (or worker). This is an example
created. of weak scaling as the amount of work per PE (1 million
PDES - optimistic: This refers to the optimistic con- updates) remains the same. We observe that the WPs scheme
currency control engine in parallel discrete event simulation scales well for all 64 nodes. PP and WsP also scale till 64
(PDES). Similar to SSSP, this benchmark will help us analyze nodes, but PP has overheads from atomics and in case of WsP
rollbacks resulting from higher latency in PP compared to which performs sorting or grouping before sends, we observe
the node-aware schemes. PHOLD is a well-known synthetic poorer scaling than WPs. WW on the other hand stops scaling
benchmarks used in PDES. We implemented a synthetic after 16 nodes. This is likely due to the buffer size of 1024,
PHOLD with a place-holder simulation engine to analyze how being not filled up per worker at 32 nodes onwards. For 32
latency affects rollbacks. In this engine, we do not perform real nodes, with 8 processes each with 8 worker, the total number
rollbacks; instead we only keep track of out-of-order messages of destinations per source worker is 2048, which means for
received. an update count of 1 million, each destination worker would
receive only 500 values. Therefore, the sends are dominated
IV. E VALUATION by flush costs resulting in 2048 messages per source worker
A. System and benchmarks compared to a flush for PP, say, requiring only 32 messages.
Our experiments are performed on the Delta supercomputer
Histogram - 1M updates/PE
at NCSA. We use the SMP build of Charm++ for our runs
with OFI as the network layer. Each Delta node is an AMD 10
WW
EPYC 776, dual-socket with 128 cores per node, 1 hardware WPs
thread per core with a clock rate of 2.45GHz and 256 GB PP

Total time (seconds)


WsP
RAM. We use 8 processes per physical node, to allow for non-SMP
power of 2 processes, since we need 1 core for communication 1
thread per process. Remaining cores are left unused. Hence on
each Delta node, we have 64 worker cores distributed across
8 processes (with 8 ppn each) and 8 communication thread
cores.
Our initial experiment is to address the communication 0.1
thread bottleneck in SMP mode, detailed in section III-A. 2nodes 4nodes 8nodes 16nodes 32nodes 64nodes
For our purposes, we use multiple processes per node to Number of physical nodes
mitigate communication thread performance issues. Varying
Fig. 9. Histogram 1M example
the number of worker threads per process for histogram
benchmark, in Figure 8, we observe that setting 8 workers
We briefly analyze effects of buffer size, in Figure 10 and
or ppn for smp process in WPs scheme performs on par with
find that the node-level aggregation schemes perform better
the non-smp implementation for histogram benchmark. For
with increasing buffer sizes for histogramming. WW requires
the remaining experiments, we use ppn of 8 on Delta, with 8
flushes at 4k buffer size for 1M updates per PE, hence overall
process per node.
time gets worse with increase in buffer size beyond 2k.
Histogram - 1M updates/PE
Histogram - 1M updates/PE
WPs (ppn 32) 0.35
2 WPs (ppn 16) WW
WPs (ppn 8) 0.3 WPs
Total time (seconds)

Total time (seconds)

WPs (ppn 4) PP
1.5 non-SMP 0.25

0.2
1
0.15
0.5
0.1

0 0.05
2nodes 4nodes 8nodes 16nodes 512-buffer 1024-buffer 2048-buffer 4096-buffer
Number of physical nodes Buffer size

Fig. 8. Histogram 1M smp vs non-smp Fig. 10. Histogram 1M example - varying buffer size for 8 node runs

To assess the effectiveness of our smp-aware aggregation We can think of applications, on one end being streaming
schemes, in Figure 9, we measure overall execution time for applications and on the other end latency sensitive applications
different schemes with increasing node counts for histogram that have fewer updates requiring frequent flushes. To mimic
applications with frequent flushes, we analyze performance of Index-Gather - 8M request/PE
histograming benchmark with only 128K updates generated 1.8
WW
per PE. This is shown in Figure 11. 1.6
WPs
1.4 PP

Latency (seconds)
Histogram - 128k updates/PE 1.2
0.3 1
WW (512 buffer) 0.8
0.25 WPs(1k buffer)
PP (1k buffer) 0.6
Total time (seconds)

0.2 WsP (1k buffer) 0.4


0.2
0.15
0
2nodes 4nodes 8nodes 16nodes
0.1
Number of physical nodes
0.05
Fig. 12. IG 8m - latency
0
2nodes 4nodes 8nodes 16nodes
Index-Gather - 8M request/PE
Number of physical nodes
30
WW
Fig. 11. Histo 128K updates/PE, higher costs from flushes for WW for 8 WPs
nodes and above 25
PP

Total time (seconds)


20
For very small updates, eg. 128k updates/PE, the destination
PE buffers do not fill up in case of PP for 8 nodes onwards, 15
hence the execution time is dominated by flush calls. Since
10
flush would empty all destination buffers, the total number
of messages in PP is higher than for WPs and PP where the 5
number of buffers is reduced by a factor of PEs per process. In
0
figure 11, the difference in overall time is significant between 2nodes 4nodes 8nodes 16nodes
schemes with WPs and PP performing significantly better. PP
Number of physical nodes
likely incurs overhead of atomics, hence overall execution time
does not improve compared to WPs. We have tuned the buffer Fig. 13. IG 8M - total time
size depending on application performance and latency in this
set of experiments, with a buffer size of 512 for WW and 1024
for the other schemes. plan to explore this further using PAPI performance counters
As mentioned before, latency is a key metric in many and other analysis techniques.
applications. It can be affected by buffer size, overhead of Finally, for PHOLD synthetic benchmark, we run our
message cost or aggregation scheme used. We measure latency experiments with a higher ppn of 32 to see benefits of
using the IG benchmark, using a buffer size of 1024 for all message consolidation in node-aware schemes compared to
the aggregation schemes. Figure 12 shows that latency of PP node-unaware aggregation schemes. The PHOLD benchmark’s
< WPs < WW. For overall cost, however, for IG, in Figure 13, rejected updates are more than 5% fewer for node-aware PP
the overhead of sorting in WPs and the overhead of atomics scheme, in Figure 18. For PHOLD, WW’s execution time was
in PP’s seems to affect overall execution time at 16 nodes. much higher (over 5x) compared to other schemes. This could
However, for IG, we are mainly concerned with using the be due to inefficient and frequent flush operations called from
benchmark to measure latency of different schemes, which the synthetic benchmark, which we are further investigating.
we will later apply to other applications. The greater than 5% improved wasted updates for PP can be
For latency-sensitive applications such as SSSP, with significant for PDES simulations where rollbacks are expen-
smaller problem sizes, more PEs also mean higher chances of sive.
wasted updates while PEs wait on updates from other PEs. We
show how different schemes affect wasted updates for small V. R ELATED W ORK
problem size in Figures 14 and 15. In terms of wasted updates, Several works have shown the benefits of message aggrega-
we observe that PP < WPs < WW. tion. Early work was in the context of implementation of all-to-
For SSSP with large problem sizes, which scale well across all collective by Thakur and Choudhary [10]. Sameer Kumar
multiple nodes, we do not see a significant difference in wasted et. al [11] studied performance of all-to-all messaging with
updates across different schemes as seen in Figure 16 amd 17. routing and aggregation in 2-D mesh, 3-D grid and hypercube
However, WPs performs considerably better than WW, which virtual topologies. In contrast to all-to-all, where all data is
is likely from frequent flush calls and memory footprint. We handed over to the library at once at the beginning, streaming
SSSP - 8M vertices SSSP - 62M vertices
1.4 6
WW WW
1.2 WPs 5 WPs
PP
Total time (seconds)

Total time (seconds)


1
4
0.8
3
0.6
2
0.4

0.2 1

0 0
8 16 32 1node 2nodes 4nodes 8nodes
Number of processes (i.e logical nodes) Number of physical nodes

Fig. 14. SSSP - small problem size - Time Fig. 16. SSSP - large input size example, time

SSSP - 8M vertices SSSP - 62M vertices

WW WW
30 30 WPs
WPs
PP 25
25

Wasted updates
Wasted updates

20
20
15
15
10
10
5
5
0
0 1node 2nodes 4nodes 8nodes
8 16 32
Number of physical nodes
Number of processes (i.e logical nodes)
Fig. 17. SSSP - large input size example, wasted updates (normalized)
Fig. 15. SSSP - small problem size - Wasted updates(normalized)

Some systems like Conveyors [4] make the assumption that


applications need to support continuous stream of short mes- communication within a process (with multithreading) will
sages. TRAM [1] in Charm++ supports streaming as well happen using shared memory and focus on messages sent
as all-to-all, performing routing through an N-dimensional from process-to-process. YGM [5] allows for coalescing of
virtual topology and dynamically aggregating messages to messages from source processes, mapped to cores in MPI-
reduce message count. The all-to-all performance from TRAM everywhere model, in a node or to destination processes,
paper shows up to an order improvement in performance from mapped to cores, in a destination node. It does so by per-
routing and aggregation. Additionally, in a run on 64 nodes, forming node local sends and receives which relies on the
increasing the buffer size up to a certain size, here 8K bytes, efficient underlying implementation of local sends/receives
is best performing, while the best dimensionality for virtual instead of directly utilizing SMP constructs. In TRAM, to
topology can be seen to depend on data size contributed for support SMP cores, assigning physical nodes as a dimension
each destination. The minimum latency of an item remains the would still require routing/aggregation within a node and add
same, but the number of messages is reduced by the size of an extra dimension, leading to sub-optimal performance. Other
the buffer. Active Pebbles [2] is a parallel programming model works rely on posix shared memory [6] for communicating
for data- driven applications. It aggregates messages meant for between processes on the same physical node. Given the SMP
a destination process (on a destination node), performs local advantage outlined earlier (section I), we aim to overcome
reductions if needed and routes the messages through a virtual these challenges and support node-aware message aggregation
topology. Active Pebbles however does not support a shared within SMP context.
memory programming model to avoid within node contention.
Shared memory parallelism with large number of cores VI. C ONCLUSION AND F UTURE W ORK
presents a new set of challenges for message aggregation. We have discussed the design of various message aggre-
Existing system like Active Pebbles do not support SMP to gation schemes suited for today’s large supercomputer nodes.
avoid dealing with atomics, locks and contention within node. We have demonstrated the effectiveness of SMP-aware aggre-
PHOLD - synthetic benchmark [8] R. Zambre, A. Chandramowlishwaran, and P. Balaji, “Scalable com-
16 munication endpoints for mpi+threads applications,” in 2018 IEEE 24th
WW International Conference on Parallel and Distributed Systems (ICPADS),
15
WPs
Wasted updates (in millions)
2018, pp. 803–812.
14 PP [9] R. Zambre and A. Chandramowlishwaran, “Lessons learned on
mpi+threads communication,” in Proceedings of the International Con-
13
ference on High Performance Computing, Networking, Storage and
12 Analysis, ser. SC ’22. IEEE Press, 2022.
11 [10] R. Thakur and A. Choudhary, “All-to-all communication on meshes
with wormhole routing,” in Proceedings of 8th International Parallel
10 Processing Symposium, 1994, pp. 561–565.
9 [11] L. V. Kale, S. Kumar, and K. Vardarajan, “A framework for collective
personalized communication, communicated to ipdps 2003,” Parallel
8 Programming Laboratory, Department of Computer Science, University
7 of Illinois at Urbana-Champaign, Tech. Rep. 02-10, 2002.
2procs 4procs
Number of processes (i.e logical nodes)

Fig. 18. PHOLD synthetic - Wasted updates(in millions)

gation schemes that utilize process-local sends and atomics to


coalesce messages in a shared memory setting. We analyzed
the schemes using two key metrics, namely overhead and
latency. While message aggregation aims to lower overall
execution time by reducing communication cost, aggregation
increases latency of items being aggregated. For latency sen-
sitive applications, we have shown benefits resulting from
lower latency aggregation schemes. In the future, we plan to
support prioritization of items, which should help latency or
cost sensitive applications such SSSP and PDES even more
directly. Also, we plan on further analyzing benefits of node-
aware aggregation across processes within a physical node,
for instance using XPMEM for across process shared memory
benefits, in addition to within-process message aggregation.
R EFERENCES
[1] L. Wesolowski, R. Venkataraman, A. Gupta, J.-S. Yeom, K. Bisset,
Y. Sun, P. Jetley, T. R. Quinn, and L. V. Kale, “TRAM: Optimizing Fine-
grained Communication with Topological Routing and Aggregation of
Messages,” in Proceedings of the International Conference on Parallel
Processing, ser. ICPP ’14, Minneapolis, MN, September 2014.
[2] J. J. Willcock, T. Hoefler, N. G. Edmonds, and A. Lumsdaine,
“Active pebbles: parallel programming for data-driven applications,” in
Proceedings of the International Conference on Supercomputing, ser.
ICS ’11. New York, NY, USA: Association for Computing Machinery,
2011, p. 235–244. [Online]. Available: https://doi.org/10.1145/1995896.
1995934
[3] B. Priest, T. Steil, G. Sanders, and R. Pearce, “You’ve got mail (ygm):
Building missing asynchronous communication primitives,” in 2019
IEEE International Parallel and Distributed Processing Symposium
Workshops (IPDPSW), 2019, pp. 221–230.
[4] F. M. Maley and J. G. DeVinney, “Conveyors for streaming many-to-
many communication,” in 2019 IEEE/ACM 9th Workshop on Irregular
Applications: Architectures and Algorithms (IA3), 2019, pp. 1–8.
[5] T. Steil, T. Reza, B. Priest, and R. Pearce, “Embracing irregular
parallelism in hpc with ygm,” in Proceedings of the International
Conference for High Performance Computing, Networking, Storage
and Analysis, ser. SC ’23. New York, NY, USA: Association for
Computing Machinery, 2023. [Online]. Available: https://doi.org/10.
1145/3581784.3607103
[6] T. Hoefler, J. Dinan, D. Buntinas, P. Balaji, B. Barrett, R. Brightwell,
W. Gropp, V. Kale, and R. Thakur, “Mpi + mpi: A new hybrid
approach to parallel programming with mpi plus shared memory,”
Computing, vol. 95, no. 12, p. 1121–1136, dec 2013. [Online].
Available: https://doi.org/10.1007/s00607-013-0324-2
[7] F. M. Maley and J. G. DeVinney, “Bale benchmark suite,” https://github.
com/jdevinney/bale, [Online; accessed 10-Aug-2024].

You might also like