Shared Memory-Aware Latency-Sensitive Message Aggregation For Fine-Grained Communication
Shared Memory-Aware Latency-Sensitive Message Aggregation For Fine-Grained Communication
Abstract—Message aggregation is often used with a goal to parallel libraries such as OpenMP or Kokkos. Charm++ does
reduce communication cost in HPC applications. The difference not require a separate programming model for within-process
in the order of overhead of sending a message and cost of per communication, but utilizes the shared memory for optimizing
byte transferred motivates the need for message aggregation, for
several irregular fine-grained messaging applications like graph inter-object communication and cooperation. It is this SMP
algorithms and parallel discrete event simulation (PDES). While mode that is the focus of our work. This SMP scenario
message aggregation is frequently utilized in “MPI-everywhere” amplifies the need for message aggregation further.
model, to coalesce messages between processes mapped to cores, For the MPI-everywhere mode, systems like YGM [3]
such aggregation across threads in a process, say in MPI+X have recently looked into implementing node-aware schemes
models or Charm++ SMP (Shared Memory Parallelism) mode,
is often avoided. Within-process coalescing is likely to require for message aggregation. YGM’s mechanism for coalescing
synchronization across threads and lead to performance issues messages at node level uses a phase of node-local exchange
from contention. However, as a result, SMP-unaware aggregation which relies on underlying MPI implementation’s method of
mechanisms may not fully utilize aggregation opportunities achieving cheaper within node communication (eg. xpmem or
available to applications in SMP mode. Additionally, while the cma). But the context we aim at is where worker threads divide
benefit of message aggregation is often analyzed in terms of
reducing the overhead, specifically the per message cost, we also the application’s work among cores, but need to exchange
analyze different schemes that can aid in reducing the message many short messages with other workers, in a shared memory
latency, ie. the time from when a message is sent to the time when setting.
it is received. Message latency can affect several applications like The SMP mode, in Charm++ as well as MPI, presents sev-
PDES with speculative execution where reducing message latency eral potential advantages to the applications. Worker threads
could result in fewer rollbacks. To address these challenges, in our
work, we demonstrate the effectiveness of shared memory-aware can share and divide work more efficiently, for better load
message aggregation schemes for a range of proxy applications balance (using dynamic schedules in OPENMP or stealable
with respect to messaging overhead and latency. tasks and object migration in Charm++. Large read-only data
Index Terms—Message aggregation, SMP, Charm++, runtime structures can be shared among workers without making mul-
tiple copies. Even dynamic data structures, such as software
particle (or tree) caches in astronomy codes can save duplicate
I. I NTRODUCTION external (across-node) communications. These orthogonal ben-
Message aggregation libraries aim to provide an interface efits may make SMP mode attractive independent of message-
for applications for consolidating messages to reduce com- aggregation considerations.
munication cost associated with processing many messages. The questions we explore in this paper are: Given that an
Previous schemes for aggregation like TRAM [1] in Charm++ application wants to operate in SMP mode, what options are
and streaming library Active Pebbles [2] use topology-aware there for organzining message aggregation and how do they
routing schemes but these are less beneficial for modern compare with each other? Does the SMP mode aggregation
topologies like fat-trees. by itself provides benefits over non-SMP aggregation that
Modern machines have many tens of cores per physical makes SMP mode attractive even for applications that don’t
node (also called “host”). Applications tend to use them particularly require SMP mode? Finally, what metrics are
in two modes: in one mode there is a separate process important for such comparisons?
bound to each core that is used by the application; this is The main metric that has motivated research on message
called MPI-everywhere for MPI applications, and non-SMP aggregation has been the overhead. The cost associated with
mode in Charm++ applications. The other mode uses multiple sending a message over network can be computed using the
processes per node, but each process owns multiple cores, alpha-beta communication model as α + N β, where N is the
and facilitates shared memory parallelism (SMP) across cores number of bytes. Here α is the latency per message and β
within a process. For MPI applications, parallelism within a (inverse of bandwidth) is the cost per byte transferred. To
process and across cores is managed by one of the thread- illustrate the costs, we measure the time taken to send a
Copyright © 2024 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future
media, including reprinting/republishing this material for advertising or promotional purposes, creating new collective works, for resale or redistribution to
servers or lists, or reuse of any copyrighted component of this work in other works.
message (roundtrip time/2) between two physical nodes on Observing the behavior of many such applications, and by
the Delta supercomputer using a ping-pong benchmark. As virtue of the co-design of our new aggregation library with
we can see in Figure 1, the time for small message sizes a graph algorithm, we inferred that, aside from effects on
is dominated by latency cost (α) and is in the order of communication cost using message aggregation, a key metric
microseconds. The cost of sending an additional byte of data that is influenced by message aggregation is latency. Different
(β), on the other hand, is about 0.1 nanosecond (indicating a applications categories are differently affected by latency. We
bandwidth of about 12 GB/s). The primary motivating factor define latency here as the time from when an item is generated
for message aggregation is this discrepancy in the order of by the application to when it is delivered to the application
α and β values. While α is in microseconds, β is less than on another processor. While aggregation reduces the cost of
a nanosecond per byte, motivating the need for fine-grained sending items, it increases the latency associated with each
messaging applications to aggregate many small messages into item. As items are buffered, latency is higher for messages than
fewer large messages. without any aggregation. In extreme cases, for applications that
are highly latency sensitive, message aggregation may have to
be avoided. But in practice, the large α cost of short messages
200 means that we have to deal with the dual goals of reducing
RTT/2
latency while simultaneously reducing the overhead.
150 The paper presents multiple schemes for aggregation in
the context of SMP applications (where cores on a node
Time (µs)
K
K
B
B
B
B
Process Process
<item,
dest_w>
(or group) items within the message at some stage. We Process Process
C. Analysis
Below we analyze the following costs associated with node-
Worker Worker Worker Worker Worker Worker Worker Worker
aware TramLib:
Process Process
1) Memory overhead
Fig. 5. WPs: Source worker maintains a buffer per destination process
2) Number of messages sent
3) Message send cost
The same grouping techniques can be extended one level 4) Processing delays at receiver
up to the physical node, if it houses multiple processes. Also, Memory overhead: We use g to denote number of items in
in the past, multi-dimensional routing across nodes had been TRAM buffer and m to denote the size of each item, similar to
considered, where messages going from one physical node to notation used in previous work TRAM [1]. With node-aware
another, go via one or more intermediary nodes, which act like TramLib, we do not use topological multi-dimensional routing
of previous TRAM. So, given the total number of processes and node-aware schemes can further improve latencies, for
N , and assuming t workers per process (i.e. Pthreads bound latency-sensitive applications.
to disjoint cores), the memory overheads related to buffer Processing delays: When coalescing messages, there are a
allocation for the different schemes are: few delays like higher cost per byte and buffer fill delays.
• WW: Aggregating messages on each source PE per des- Another delay is the overhead O that is added once per
tination PE = g ∗ m ∗ N ∗ t per core or g ∗ m ∗ N ∗ t2 per aggregated message sent or received. On the sending pro-
process cessor, the overhead is from contention when we maintain
• WPs, WsP: Aggregating messages on each source PE per common buffers per physical node. At destination node or
destination node = g ∗ m ∗ N per core or g ∗ m ∗ N ∗ t process, the receiving PE sorts or groups messages by PE and
per process communicates the coalesced message locally to each PE. The
• PP: Aggregating messages at each source process (in- sorting or grouping of items by worker, for t workers per
stead of PE) per destination process = g ∗ m ∗ N per process, results in an additional delay of O(g + t) to output a
source process grouped coalesced message of the same size as the buffer g.
Local sends are typically fast in shared memory systems like
However, the overall impact of these alternatives is com- Charm++.
plicated by the fact that the optimal buffer size g would be
different for each of the schemes, and also because for some D. Benchmarks and Applications
applications, latency of individual messages may be important.
Number of messages sent: For a buffer size g, if z items All applications are written in Charm++ and use the
are send from each source PE, the number of messages sent API from the TramLib aggregation library implemented as a
for different schemes is: Charm++ library module.
Histogram: The histogram benchmark is based on the
• WW: The number of messages sent per source PE will
one from the Bale benchmark suite [7]. It uses a distributed
have a lower bound of z/g and an upper bound of z/g +
histogram table on all PEs, and all PEs send a given number
N ∗t
of updates to the distributed histogram. Each PE invokes the
• WPs, WsP: The number of messages sent per source PE
flush call at the end of all updates to flush any remaining
will have a lower bound of z/g and an upper bound of
updates in the TramLib buffers. Since there are no dependent
z/g + N
communication items created as a result of an histogram
• PP: With source process aggregation, messages sent per
update sent, latency is not directly relevant this benchmark.
source process will have a lower bound of z/g and an
Instead, it can be used to measure overhead, in isolation.
upper bound of z/g + N
Index-gather (IG): This example is also based on Bale
This assumes no intermittent flushing is triggered by the ap- benchmark suite. Each PE sends a given number of requests
plication except once at the end. The lower bound is achieved to other PEs which send a response upon receiving the request.
when partially filled buffers have to be flushed at the end of the This benchmark helps us measure latency for a message since
communication phase. It can be seen that for long streaming we can measure the time between a request sent and response
applications, where z >> g, the second term in the upper received on a given PE. While latency can be measured in
bound is vanishingly small, and so the schemes are almost other benchmarks as well, measuring time between sends
identical in number of messages sent. But for short streams of and receives on different nodes can include clock skew error.
items, such as in a short all-to-all operation, aggregating by the Hence we use IG for analyzing latency for various aggregation
destination process (i.e. using one buffer for all items going schemes.
to different destination PEs belonging to the same destination SSSP: The single source shortest path graph algorithm used
process) has the least number of messages. in this work has vertices distributed across chares, with one
Message send cost: Assuming alpha-beta model, sending chare per PE. The algorithm performs speculative execution,
z items would take z ∗ (α + β ∗ b) for communication cost by computing distances based on available distances of neigh-
for sending each item as a message, assuming b bytes per bors of vertices to the source vertex. If a PE receives an
item are sent. With coalescing, provided buffer size is g, the updated distance for a vertex that is smaller than previously
communication cost for z items is (z/g) ∗ (α + β ∗ b ∗ g) or known distance, the value is updated. The updated value is sent
(z/g)∗α+β∗b∗z, essentially reducing the α component in the to neighbor vertices of this vertex, which then may update their
message cost by g. This has implications on latency however, distance similarly. The neighbor vertices maybe on remote PEs
where now the average message latency is increased based on or local PEs. A threshold value is used in this benchmark
buffer fill rate. Assuming buffer fill rate is r, the latency of that helps prioritize updates with smaller distances in order
an item in the buffer can increase by up to g/r. However, to minimize wasted updates. This is because updates with
on the other hand, in some cases with several tiny messages, larger updates are likely to become wasted updates since
aggregating messages can also improve latency of messages they are likely to updated with smaller values later. The
since the sender PE is blocked less for future messages, hence program terminates when all updates have been consumed.
improving overall latencies of messages. Smaller buffer sizes This benchmark is latency sensitive since higher latency of
messages is likely to result in more wasted updates being with 1 million updates per PE (or worker). This is an example
created. of weak scaling as the amount of work per PE (1 million
PDES - optimistic: This refers to the optimistic con- updates) remains the same. We observe that the WPs scheme
currency control engine in parallel discrete event simulation scales well for all 64 nodes. PP and WsP also scale till 64
(PDES). Similar to SSSP, this benchmark will help us analyze nodes, but PP has overheads from atomics and in case of WsP
rollbacks resulting from higher latency in PP compared to which performs sorting or grouping before sends, we observe
the node-aware schemes. PHOLD is a well-known synthetic poorer scaling than WPs. WW on the other hand stops scaling
benchmarks used in PDES. We implemented a synthetic after 16 nodes. This is likely due to the buffer size of 1024,
PHOLD with a place-holder simulation engine to analyze how being not filled up per worker at 32 nodes onwards. For 32
latency affects rollbacks. In this engine, we do not perform real nodes, with 8 processes each with 8 worker, the total number
rollbacks; instead we only keep track of out-of-order messages of destinations per source worker is 2048, which means for
received. an update count of 1 million, each destination worker would
receive only 500 values. Therefore, the sends are dominated
IV. E VALUATION by flush costs resulting in 2048 messages per source worker
A. System and benchmarks compared to a flush for PP, say, requiring only 32 messages.
Our experiments are performed on the Delta supercomputer
Histogram - 1M updates/PE
at NCSA. We use the SMP build of Charm++ for our runs
with OFI as the network layer. Each Delta node is an AMD 10
WW
EPYC 776, dual-socket with 128 cores per node, 1 hardware WPs
thread per core with a clock rate of 2.45GHz and 256 GB PP
WPs (ppn 4) PP
1.5 non-SMP 0.25
0.2
1
0.15
0.5
0.1
0 0.05
2nodes 4nodes 8nodes 16nodes 512-buffer 1024-buffer 2048-buffer 4096-buffer
Number of physical nodes Buffer size
Fig. 8. Histogram 1M smp vs non-smp Fig. 10. Histogram 1M example - varying buffer size for 8 node runs
To assess the effectiveness of our smp-aware aggregation We can think of applications, on one end being streaming
schemes, in Figure 9, we measure overall execution time for applications and on the other end latency sensitive applications
different schemes with increasing node counts for histogram that have fewer updates requiring frequent flushes. To mimic
applications with frequent flushes, we analyze performance of Index-Gather - 8M request/PE
histograming benchmark with only 128K updates generated 1.8
WW
per PE. This is shown in Figure 11. 1.6
WPs
1.4 PP
Latency (seconds)
Histogram - 128k updates/PE 1.2
0.3 1
WW (512 buffer) 0.8
0.25 WPs(1k buffer)
PP (1k buffer) 0.6
Total time (seconds)
0.2 1
0 0
8 16 32 1node 2nodes 4nodes 8nodes
Number of processes (i.e logical nodes) Number of physical nodes
Fig. 14. SSSP - small problem size - Time Fig. 16. SSSP - large input size example, time
WW WW
30 30 WPs
WPs
PP 25
25
Wasted updates
Wasted updates
20
20
15
15
10
10
5
5
0
0 1node 2nodes 4nodes 8nodes
8 16 32
Number of physical nodes
Number of processes (i.e logical nodes)
Fig. 17. SSSP - large input size example, wasted updates (normalized)
Fig. 15. SSSP - small problem size - Wasted updates(normalized)