A Performance Study To Guide RDMA Programming Decisions
A Performance Study To Guide RDMA Programming Decisions
Abstract—This paper describes a performance study of directly, without any particular application or environment
Remote Direct Memory Access (RDMA) programming tech- in mind, and provide guidance on general design options
niques. Its goal is to use these results as a guide for making faced by anyone directly using RDMA.
“best practice” RDMA programming decisions.
Infiniband RDMA is widely used in scientific high per- A. Background
formance computing (HPC) clusters as a low-latency, high-
bandwidth, reliable interconnect accessed via MPI. Recently Three RDMA technologies are in use today: Infiniband
it is gaining adherents outside scientific HPC as high-speed (IB), Internet Wide-Area RDMA Protocol (iWARP), and
clusters appear in other application areas for which MPI is RDMA over Converged Ethernet (RoCE). Infiniband [8], [9]
not suitable. RDMA enables user applications to move data
directly between virtual memory on different nodes without
defines a completely self-contained protocol stack, utilizing
operating system intervention, so there is a need to know how its own interface adapters, switches, and cables. iWARP
to incorporate RDMA access into high-level programs. But defines three thin protocol layers [10]–[12] on top of the
RDMA offers more options to a programmer than traditional existing TCP/IP protocol (i.e., standard Internet). RoCE [13]
sockets programming, and it is not always obvious what the simply replaces the physical and data-link layers of the In-
performance tradeoffs of these options might be. This study is
intended to provide some answers.
finiband protocol stack with Ethernet. All three technologies
are packaged as self-contained interface adapters and drivers,
Keywords-RDMA; Infiniband; OFA; OFED; HPC; and there are software-only versions for both iWARP [14]
and RoCE [15].
I. I NTRODUCTION The OpenFabrics Alliance (OFA) [16] publishes and
As networks grow faster, Remote Direct Memory Access maintains a common user-level Application Programming
(RDMA) is rapidly gaining popularity outside its traditional Interface (API) for all three RDMA technologies, It provides
application area of scientific HPC. RDMA allows application direct, efficient user-level access to all features supported by
programmers to directly transfer data between user-space each RDMA technology. OFA also provides open access to a
virtual memories on different machines without kernel in- reference implementation of this API, along with useful util-
tervention, thereby bypassing extra copying and processing ities, called the OpenFabrics Enterprise Distribution (OFED)
that reduce performance in conventional networking tech- [17]. This API is used throughout this study.
nologies. RDMA is completely message-oriented, so all In RDMA, actions are specified by verbs which con-
application messages are sent and received as units, unlike vey requests to the network adapter. Each verb, such as
TCP/IP, which treats network communication as a stream of post_send, is represented in the OFED API as a library
bytes. function, ibv_post_send, with associated parameters
User-level RDMA programming offers many more op- and data structures. To initiate a transfer, ibv_post_send
tions and is more complex than traditional socket program- places a work request data structure describing the trans-
ming, as it requires the programmer to directly manipulate fer onto a network adapter queue. Data transfers are all
functions and data structures defined by the network inter- asynchronous: once a work request has been posted, control
face in order to directly control all aspects of RDMA mes- returns to the user-space application which must later use
sage transmission. Therefore, the programmer must make the ibv_poll_cq function to remove a work completion
many decisions which may drastically affect performance. data structure from a network adapter’s completion queue.
The goal of this paper is to evaluate the performance of This completion contains the status for the finished transfer
numerous methods of directly using the application-level and tells the application it can again safely access the virtual
RDMA features in practice. memory used in the transfer.
Similar performance evaluations have been done for spe- RDMA provides four sets of work request opcodes to
cific applications and tools that use RDMA, such as MPI [1], describe a data transfer. The SEND/RECV set superficially
[2], FTP [3], GridFTP [4], NFS [5], AMPQ [6], PVFS [7], resembles a normal socket transfer. A receiver first posts
etc. The importance of our work is that we evaluate RDMA a RECV work request that describes a virtual memory
779
100 14 1000
RDMA_WRITE_WITH_IMM with INLINE
SEND/RECV with INLINE
RDMA_WRITE_WITH_IMM 12
80 SEND/RECV 100
10
6
40 1
SEND/RECV notify RDMA_WRITE_WITH_IMM with INLINE busy
RDMA_WRITE_WITH_IMM notify SEND/RECV with INLINE busy
4 SEND/RECV with INLINE notify RDMA_WRITE_WITH_IMM busy
RDMA_WRITE_WITH_IMM with INLINE notify SEND/RECV busy
20 SEND/RECV busy 0.1 RDMA_WRITE_WITH_IMM with INLINE notify
2 RDMA_WRITE_WITH_IMM busy SEND/RECV with INLINE notify
SEND/RECV with INLINE busy RDMA_WRITE_WITH_IMM notify
RDMA_WRITE_WITH_IMM with INLINE busy SEND/RECV notify
0 0 0.01
1 4 16 64 256 1024 1 4 16 64 256 1024 1 4 16 64 256 1024
Message size (bytes) Message size (bytes) Message size (bytes)
(a) CPU usage (event notification only). (b) Average one-way time. (c) Average throughput.
Figure 1. Ping with each completion detection strategy for small messages.
16 64000 8
RDMA_READ notify 32000 RDMA_READ busy RDMA_WRITE_WITH_IMM busy
SEND/RECV notify 16000 RDMA_WRITE_WITH_IMM busy RDMA_WRITE busy
14 RDMA_WRITE notify RDMA_WRITE busy 7 RDMA_WRITE_WITH_IMM and INLINE busy
8000
RDMA_WRITE_WITH_IMM notify SEND/RECV busy RDMA_WRITE with INLINE busy
12 RDMA_WRITE_WITH_IMM busy SEND/RECV notify 6
SEND/RECV busy RDMA_WRITE_WITH_IMM notify
Megabits per second
Microseconds
10 5
8 100 4
6 3
10
4 2
1
2 1
0 0.1 0
1 2 4 8 16 32 64 128 256 512 1 2 4 8 16 32 64 128 256 512 1 4 16 64 256 1024
Message size (bytes) Message size (bytes) Message size (bytes)
(a) Average one-way time. (b) Average throughput. (c) Average one-way time with and without inline,
busy polling only.
Figure 2. Blast with each opcode set and completion detection strategy for small messages using one buffer.
has completed. In the blast tool, which can run with each Average CPU utilization is the sum of user and system CPU
of the 4 opcode sets, a client sends data to a server as fast time reported by the POSIX getrusage function divided
as possible, but the server does not acknowledge it. by elapsed time.
Tests are run between two identical nodes, each consisting
III. P ERFORMANCE R ESULTS
of twin 6-core Intel Westmere-EP 2.93GHz processors with
6 GB of RAM and PCIe-2.0x8 running OFED 1.5.4 on A. Ping example, small messages
Scientific Linux 6.1. Each node has a dual-port Mellanox Ping is the application generally used to measure round
MT26428 adapter with 256 byte cache line, one port con- trip time. It repeatedly sends a fixed-size message back
figured for Infiniband 4xQDR ConnectX VPI with 4096-byte and forth between client and server. In our tests, message
MTUs, the other for 10 Gbps Ethernet with 9000-byte jumbo size varies by powers of 2 from 1 to 1024 bytes. Fig-
frames. With these configurations, each Infiniband or RoCE ure 1 shows a total of 8 combinations of SEND/RECV
frame can carry up to 4096 bytes of user data. Nodes are and RDMA WRITE WITH IMM/RECV, with and without
connected back-to-back on both ports, and all transfers use inline, and with busy polling and event notification.
Reliable Connection (RC) transport mode, which fragments Figure 1a shows slight differences between the CPU
and reassembles large user messages. usage by opcode sets, with inline requiring more cycles
We measure 3 performance metrics, all based on elapsed for messages less than 32 bytes due to the extra data copy
time, which is measured from just before the first message involved. Only event notification cases are graphed, as busy
transfer is posted to just after the last transfer completes. polling always has 100% CPU usage. Figure 1b shows
Average throughput is the number of messages times the clearly that busy polling produces lower one-way times than
size of each message divided by elapsed time. Average one- event notification, and transfers with inline perform better
way time per message for blast is the elapsed time divided than those without it (6.2 microseconds for messages smaller
by the number of messages; for ping it is half this value. than the 256 byte cache line). Figure 1c shows that for each
780
6 8 64000
912-byte messages 912-byte messages 32000
256-byte messages 30000 256-byte messages 16000
64-byte messages 20000 7 64-byte messages
5 10000 8000
16-byte messages 16-byte messages
4-byte messages 6 4-byte messages
1-byte messages 1-byte messages
Microseconds
5
3 100 4 100
3
2 10 10
912-byte messages 912-byte messages
256-byte messages 2 256-byte messages
1 1 64-byte messages 1 64-byte messages
16-byte messages 1 16-byte messages
4-byte messages 4-byte messages
1-byte messages 1-byte messages
0 0 0.1
1 2 4 8 16 32 64 1 2 4 8 16 32 64 1 2 4 8 16 32 64 1 2 4 8 16 32 64
Buffer count Buffer count Buffer count Buffer count
(a) Average one-way time with sin- (b) Average throughput with single (c) Average one-way time with mul- (d) Average throughput with multi-
gle work request per posting. work request per posting. tiple work requests per posting. ple work requests per posting.
Figure 3. Blast with RDMA WRITE with busy polling for small messages with inline using multiple buffers.
100 40 64000
1KiB message 64KiB message, notify 32000
8KiB message 64KiB message, busy 16000
32KiB message 35 32KiB message, notify 8000
80 64KiB message 32KiB message, busy
30 8KiB message, notify
8KiB message, busy
25
60
Percent
20 100
40 15
10
64KiB message, busy
10 64KiB message, notify
20 8KiB message, busy
1
5 8KiB message, notify
1KiB message, busy
1KiB message, notify
0 0 0.1
1 2 4 8 16 32 64 128 1 2 4 8 16 32 64 128 1 2 4 8 16 32 64 128
Buffer count Buffer count Buffer count
(a) Active side CPU usage (event notification only). (b) Average one-way time. (c) Average throughput.
Figure 4. Blast with RDMA WRITE for each completion detection strategy for large messages using multiple buffers and multiple work requests per
posting.
opcode set throughput increases proportionally with message as it performs much better than event notification for small
size and is slightly better for busy polling at any given size. messages. Figure 2c shows that one-way time is lowest for
There is little difference in throughput between opcode sets, both opcodes when using inline and messages smaller than
with low total throughput for them all, since small messages the 256 byte cache line, although our adapters accepted up
cannot maximize throughput. Using either SEND/RECV or to 912 bytes of inline data.
RDMA WRITE WITH IMM/RECV with inline and busy
polling gives the best one-way time and marginally better C. Blast example, small messages, multiple buffers
throughput for small messages, but suffers from 100% CPU Next consider the use of multiple outstanding buffers.
utilization. We initially post an RDMA WRITE for every buffer, then
repost each buffer as soon as we get the completion of its
B. Blast example, small messages, single buffer previous transfer. This way, the interface adapter processes
Next is a blast study using small messages. It compares posted work queue entries in parallel with the application
all transfer opcode sets for both busy polling and event code processing completions.
notification. As with ping, busy polling cases have lower In Figure 3a, we vary the buffer count for several message
one-way time and higher throughput, as shown in Figure 2a sizes using RDMA WRITE with inline and busy polling.
and Figure 2b respectively. CPU usage for event notification One-way time for messages less than or equal to 64 bytes is
cases is not significantly different between the opcode sets. only about 300 nanoseconds when using 8 or 16 buffers, and
As expected, RDMA READ with event notification performs is less than 1 microsecond when using 2 or 4 buffers. For
poorly. However, RDMA READ with busy polling gives larger buffer counts, one-way time for messages smaller than
slightly better performance than other opcodes, which is odd, 64 bytes increases, but remains around 300 nanoseconds
as RDMA READ is expected to perform worse because data for 64 byte messages. Throughput, shown in Figure 3b,
must flow from the responder back to the requester, which increases proportionally with message size, except that
requires a full round-trip in order to deliver the first bit. throughput of 64 byte messages slightly exceeds that of 256
Figure 2c examines the use of inline in WRITE operations byte messages for 8 or more buffers, while throughput for
for small message blast. This is only done for busy polling smaller messages drops noticeably for 32 or 64 buffers.
781
100 100 100000 64000
SEND/RECV SEND/RECV 32000
RDMA_WRITE RDMA_WRITE 16000
RDMA_WRITE_WITH_IMM RDMA_WRITE_WITH_IMM 10000 8000
80 RDMA_READ 80 RDMA_READ
1000
Microseconds
60 60 100
Percent
Percent
100
10
40 40 RDMA_READ, notify RDMA_WRITE, busy
SEND/RECV, notify 10 RDMA_WRITE_WITH_IMM, busy
1 RDMA_WRITE_WITH_IMM, notify SEND/RECV, busy
RDMA_WRITE, notify RDMA_READ, busy
20 20 RDMA_READ, busy SEND/RECV, notify
1
0.1 SEND/RECV, busy RDMA_WRITE, notify
RDMA_WRITE_WITH_IMM, busy RDMA_WRITE_WITH_IMM, notify
RDMA_WRITE, busy RDMA_READ, notify
0 0 0.01 0.1
1KiB 8KiB 64KiB 512KiB 4MiB 32MiB 256MiB 1KiB 8KiB 64KiB 512KiB 4MiB 32MiB 256MiB 1KiB 8KiB 64KiB 512KiB 4MiB 32MiB 256MiB 1KiB 8KiB 64KiB 512KiB 4MiB 32MiB 256MiB
Message size (bytes) Message size (bytes) Message size Message size
(a) Active side CPU usage (event (b) Passive side CPU usage (event (c) Average one-way time. (d) Average throughput.
notification only). notification only).
Figure 5. Blast with each opcode set and each completion detection strategy for large messages using four buffers with multiple work requests per posting.
8 64000 16
multiple WR per post, periodic signaling 32000 single WR per post, full signaling signaled READ/signaled WRITE notify
multiple WR per post, full signaling 16000 single WR per post, periodic signaling unsignaled READ/signaled WRITE notify
7 single WR per post, periodic signaling multiple WR per post, full signaling 14 signaled READ/signaled WRITE busy
8000
single WR per post, full signaling multiple WR per post, periodic signaling unsignaled READ/signaled WRITE busy
6 12
Megabits per second
1000
Microseconds
Microseconds
5 10
4 100 8
3 6
10
2 4
1
1 2
0 0.1 0
1 2 4 8 16 32 64 128 256 1 2 4 8 16 32 64 128 256 1 2 4 8 16 32 64 128 256 512
Buffer count Buffer count Message size (bytes)
(a) Average one-way time for blast with (b) Average throughput for blast with (c) Average one-way time for ping with client
RDMA WRITE and busy polling for 16 byte RDMA WRITE and busy polling for 16 byte issuing RDMA WRITE and RDMA READ,
messages, with each work request submission messages, with each work request submission with each completion detection strategy for
strategy using multiple buffers with multiple strategy using multiple buffers with multiple small messages.
work requests per posting. work requests per posting.
D. Blast example, small messages, multiple buffers, multiple messages). We vary the buffer count from 1 to 128 for
work requests per posting 1, 8, 32, and 64 kibibyte messages, and post multiple
The next study is identical to the previous, except instead work requests per list. Figure 4a shows that 64 kibibyte
of posting each work request as we process its previous messages have the lowest CPU utilization when using event
completion, we place it into a list and post that list after notification, and, for all message sizes examined, using more
processing all available completions. Comparing one-way than 4 buffers has little or no effect on CPU utilization. The
times in Figure 3c with those in Figure 3a shows that times one-way time, shown in Figure 4b, and throughput, shown in
for 256 and 912 byte messages are unchanged, but for 64 Figure 4c, both increase as message size increases. Also, for
byte and smaller messages they increase when using more both one-way time and throughput, busy polling and event
than 2 buffers, and for more than 16 buffers they increase to notification results converge given enough buffers (ranging
that of 256-byte messages. For messages of 64 bytes or less from 2 or more buffers for 64KiB to 5 or more for 1KiB).
with 4, 8 or 16 buffers Figure 3d does not show the increase Next we study the effect of each opcode set for large
in throughput seen in Figure 3b. Perhaps the time needed message transfers. We vary the message size from 1 kibibyte
to process a large number of completions before posting a to 256 mibibytes and use only 4 buffers, as it was just shown
single list of new work requests causes the adapter’s work that using more buffers produces no performance gains. The
queue to empty. In all cases, posting multiple work requests CPU usage for the active and passive side of each transfer
produces less dependence on the number of buffers than is shown in Figure 5a and Figure 5b, respectively. Active
does single posting of work requests. side CPU utilization generally decreases with message size,
although there is a bump around 8KiB. Passive side CPU uti-
E. Blast example, large messages, multiple buffers lization is always 0 for RDMA WRITE and RDMA READ,
We next examine the effects of the buffer count and but is similar to the active side for SEND/RECV and
message size on large messages, using RDMA WRITE RDMA WRITE WITH IMM/RECV. One-way time, shown
without inline (since inline can be used only with small in Figure 5c, and throughput, shown in Figure 5d, both
782
32000 250
25600 RDMA_READ, SDR
SEND/RECV, SDR
RDMA_WRITE, SDR
16000 200 RDMA_WRITE_WITH_IMM/RECV, SDR
RDMA_READ, DDR
SEND/RECV, DDR
Megabits per second
RDMA_WRITE, DDR
RDMA_WRITE_WITH_IMM/RECV, DDR
Microseconds
8000 150 RDMA_READ, QDR
RDMA_WRITE, QDR
RDMA_WRITE_WITH_IMM/RECV, QDR SEND/RECV, QDR
SEND/RECV, QDR RDMA_WRITE, QDR
RDMA_READ, QDR RDMA_WRITE_WITH_IMM/RECV, QDR
4000 100
RDMA_WRITE, DDR
RDMA_WRITE_WITH_IMM, DDR
SEND/RECV, DDR
RDMA_READ, DDR
2000 RDMA_WRITE, SDR 50
RDMA_WRITE_WITH_IMM/RECV, SDR
SEND/RECV, SDR
RDMA_READ, SDR
1000 0
1 2 4 8 16 1 2 4 8 16
Buffer count Buffer count
(a) Average throughput. (b) Average one-way time.
Figure 7. Blast with each opcode set and each Infiniband speed with busy polling for 64KiB messages using multiple buffers with multiple work requests
per posting.
8 100
RDMA_WRITE RoCE RDMA_READ RoCE 64KiB message RoCE
SEND/RECV RoCE 30000 RDMA_WRITE_WITH_IMM RoCE 64KiB message IB 30000
7 RDMA_WRITE_WITH_IMM RoCE 20000 20000
RDMA_WRITE RoCE 8KiB message RoCE
RDMA_READ RoCE 10000 SEND/RECV RoCE 8KiB message IB 10000
80
6 RDMA_WRITE IB RDMA_READ IB 1KiB message RoCE
SEND/RECV IB RDMA_WRITE IB 1KiB message IB
Megabits per second
Microseconds
5 1000 1000
60
4
100 40 100
3
64KiB message, IB
2 8KiB message IB
10 20 10 64KiB message, RoCE
1 8KiB message RoCE
1KiB message IB
1KiB message RoCE
0 1 0 1
1 2 4 8 16 32 64 128 256 512 1 2 4 8 16 32 64 128 256 512 1 2 4 8 16 32 64 128 1 2 4 8 16 32 64 128
Message size (bytes) Message size (bytes) Buffer count Buffer count
(a) Average one-way time with each (b) Average throughput with each (c) Average one-way time with (d) Average throughput with
opcode set for small messages using opcode set for small messages using RDMA WRITE for large messages RDMA WRITE for large messages
one buffer. one buffer. using multiple buffers with multiple using multiple buffers with multiple
work requests per posting. work requests per posting.
Figure 8. Comparison of QDR Infiniband and RoCE for blast with busy polling.
perform best for busy polling up to about 16 kibibytes, at in the blast example all buffers need processing after they
which point there is no difference between busy polling and are transferred, so not signaling for a completion just delays
event notification. At 32 kibibytes and above, the transfer that processing until a signaled completion occurs, at which
operation also has no effect on performance. point all buffers transferred up to that time are processed in
a big batch, breaking the flow of new transfer postings.
F. Completion signaling
An example without this batching effect is ping with
All tests so far used full signaling. In this test we an active client using RDMA WRITE to push messages
use blast with RDMA WRITE and 16 byte messages to and RDMA READ to pull them back from a passive
compare full signaling, where every request is signaled, server. Every RDMA WRITE can be unsignaled because
against periodic signaling, where only one work request out its buffer does not need any processing after the transfer.
of every nbuf f ers/2 is signaled. Effects are visible only Figure 6c shows that one-way time is always lower when
when using more than 2 buffers, since otherwise we signal the RDMA WRITE is unsignaled, more so with event
every buffer. Both one-way time in Figure 6a and throughput notification, less so with busy polling. This figure also shows
in Figure 6b show much better performance when using one remarkably little variation with message size.
work request per post than when using multiple requests
per post. But there is no performance difference between G. Infiniband speed comparison
full and periodic signaling with one work request per post, All previous tests were done at QDR speed, the maximum
and with multiple work requests per post the only effect of supported by our adapter. However, Infiniband adapters can
periodic signaling is to decrease performance when there be configured to run at several speeds, as shown in Gigabits
are 4, 8 or 16 buffers. We believe this is due to the fact that per second (Gbps) in Table I. Usable Gbps is 20% lower than
783
raw Gbps due to 8b/10b encoding on serial lines. Figure 7a message sizes, compared to about 20% with event notifica-
shows that throughput doubles from near the SDR maximum tion for small messages up to 512 bytes and 0% for messages
of 8 usable Gbps to near the DDR maximum of 16, but it of 4 mibibytes or more.
does not double again from DDR to QDR. The observed Regardless of whether busy polling or event notification
25.6 Gbps for QDR is about 20% lower than the expected is employed, messages smaller than the cache line size give
32 Gbps due to the overhead of the PCIe-2 bus, an impact noticeably better one-way times for opcode sets that use
also noted in [2] and [18]. PCIe-2 overhead also shows up inline, with a slight improvement in throughput but slightly
in Figure 7b, where the observed DDR one-way time of higher CPU utilization with event notification. Although the
33 microseconds is half the 66 observed for SDR, but the amount of inline data allowed depends on the implementa-
observed QDR time of 21 microseconds is 20% greater than tion, it always makes sense to use inline whenever adapters
the expected 16.5. support it.
An application must use the opcode best for its needs.
raw usable Gbps over In situations such as ping, where both sides need to know
Designation
Gbps Gbps PCIe-2
IB 4X SDR 10 8 8 when data arrives, the choice is limited to SEND/RECV and
RoCE 12.5 10 10 RDMA WRITE WITH IMM/RECV, both of which perform
IB 4X DDR 20 16 16 essentially equally. For each of these, busy polling and inline
IB 4X QDR 40 32 25.6
always give better one-way times and throughput, but higher
Table I CPU utilization.
I NFINIBAND AND RO CE SPEEDS .
For all message sizes the choices for completion detection
H. Infiniband and RoCE comparison and inline are more significant factors in determining perfor-
We next compare 10 Gbps RoCE with 25.6 Gbps QDR mance than the opcode, which may be limited by application
Infiniband. Looking first at small messages for each opcode demands. For example, RDMA READ and RDMA WRITE
set, Figure 8a shows one-way times for Infiniband are less result in no CPU utilization on the passive side, which may
than a microsecond lower than those for RoCE, and Fig- be important for passive side scalability. More often, both
ure 8b shows that although QDR Infiniband has much greater sides of a transfer need to know when it completes, in which
maximum throughput than RoCE, there is little observable case SEND/RECV or RDMA WRITE WITH IMM/RECV
difference for small messages. is best, and if messages are small enough to allow inline
Examining larger messages transferred with then the SEND or RDMA WRITE WITH IMM could also
RDMA WRITE, the differences between Infiniband be unsignaled.
and RoCE are greater. One-way time and throughput for It is best to have between 3 and 8 transfers simulta-
1, 8, and 64 kibibyte messages are shown in Figure 8c neously queued on the adapter. With small messages this
and Figure 8d. For all large message sizes, one-way time number should be closer to 8; for large messages closer to
is roughly proportional to message size and is essentially 3. Using more buffers gives no performance increase, so
independent of the buffer count for both technologies, staying within these limits avoids consuming extra adapter
but as message size increases, one-way time for RoCE resources. Our studies on buffer numbers do not consider
increases faster than for Infiniband. When using one the additional delay introduced when communicating nodes
buffer, the throughput increases with message size for both are separated by any significant distance. Clearly a longer
technologies, but as the buffer count increases, all of the communications channel can store more buffers in transit, so
RoCE curves converge near its maximum 10 Gbps, whereas that more simultaneously queued buffers would be necessary
for Infiniband, the 8 and 64 kibibyte curves converge near to keep the channel full, especially when small messages are
the higher QDR maximum of 25.6 Gbps and only the 1 being transmitted.
kibibyte curve converges at 10 Gbps. In general, rather than collecting work requests into lists
it is better to post them individually as soon as possible and
IV. C ONCLUSIONS let the adapter queue them. This ensures that the connection
In all situations performance is much more sensitive to is kept busy. A list might be used if several work requests
the choice of RDMA options when using small messages must be created and sent together.
than when using large messages. Completion signaling has a small performance impact in
For all 4 opcode sets with small messages up to 4 specialized circumstances. Full signaling should be used if
kibibytes, much lower one-way time and much higher there is any need to process a transfer’s completion, but in
throughput are achieved by using busy polling rather than a situation such as ping with RDMA WRITE followed by
event notification to wait for completions. For messages RDMA READ, performance improves if the RDMA WRITE
of 16 kibibytes and larger both busy polling and event is not signaled.
notification produce the same one-way time and throughput. For small messages there is little performance differ-
But busy polling also causes 100% CPU utilization for all ence between RoCE and Infiniband. For larger messages,
784
QDR Infiniband’s 25.6 Gbps outperforms RoCE’s 10 Gbps. [10] P. Culley, U. Elzur, R. Recio, S. Bailey, and J. Carrier,
Therefore, if an application only transfers small messages or “Marker PDU Aligned Framing for TCP Specification,” RFC
network equipment cost is a significant factor, then RoCE 5044, Oct. 2007. [Online]. Available: http://www.ietf.org/rfc/
rfc5044.txt
may be appropriate, as it runs over Ethernet wires and
switches that may already be installed. Since the API is [11] H. Shah, J. Pinkerton, R. Recio, and P. Culley, “Direct Data
identical across technologies, an application could be written Placement over Reliable Transports,” RFC 5041, Oct. 2007.
and tested on RoCE, then migrated to Infiniband. This means [Online]. Available: http://www.ietf.org/rfc/rfc5041.txt
that RoCE may be good for initial RDMA programming or
[12] R. Recio, B. Metzler, P. Culley, J. Hilland, and D. Garcia,
development, or in applications where high throughput is not “A Remote Direct Memory Access Protocol Specification,”
necessary but the other benefits of RDMA are still desired. RFC 5040, Oct. 2007. [Online]. Available: http://www.ietf.
Non-RDMA factors are also important. Platforms with org/rfc/rfc5040.txt
PCIe-2 cannot fully utilize an Infiniband 4X QDR link,
although fabric switches should be able to handle full traffic [13] Infiniband Trade Association, “Supplement to Infiniband Ar-
chitecture Specification Volume 1, Release 1.2.1: Annex A16:
volume, even if each endpoint has limited throughput. RDMA over Converged Ethernet (RoCE),” Apr. 2010.
ACKNOWLEDGMENT [14] B. Metzler, F. Neeser, and P. Frey, “Softiwarp:
This research is supported in part by National Science A Software iWARP Driver for OpenFabrics,”
www.openfabrics.org/archives/spring2009sonoma/monday/sof
Foundation grant OCI-1127228.
tiwrp.pdf, 2009.
R EFERENCES
[15] System Fabric Works, “Soft RoCE,”
[1] M. Koop, T. Jones, and D. Panda, “Reducing Connection www.systemfabricworks.com/downloads/roce, 2011.
Memory Requirements of MPI for InfiniBand Clusters: A
Message Coalescing Approach,” in Seventh IEEE Interna- [16] OpenFabrics Alliance, “http://www.openfabrics.org.”
tional Symposium on Cluster Computing and the Grid, 2007.
[17] OpenFabrics Enterprise Distribution, “www.
[2] M. Koop, W. Huang, K. Gopalakrishan, and D. Panda, mellanox.com/pdf/products/software/OFED PB 1.pdf,”
“Performance Analysis and Evaluation of PCI 2.0 and Quad- 2008.
Data Rate InfiniBand,” in Sixteenth IEEE Symposium on High
Performance Interconnects, 2008. [18] National Instruments, “PCI Express – An
Overview of the PCI Express Standard,”
[3] P. Lai, H. Subramoni, S. Narravula, A. Mamidala, and http://zone.ni.com/devzone/cda/tut/p/id/3767, 2009.
D. Panda, “Designing Efficient FTP Mechanisms for High
Performance Data-Transfer over InfiniBand,” in International
Conference on Parallel Processing, 2009.
785