0% found this document useful (0 votes)
213 views8 pages

A Performance Study To Guide RDMA Programming Decisions

This document summarizes a performance study of Remote Direct Memory Access (RDMA) programming techniques to provide guidance on design decisions. It describes 7 aspects of RDMA programming tested: 1) transfer methods, 2) completion detection strategies, 3) message sizes, 4) connection types, 5) simultaneous operations, 6) work request submission lists, and 7) completion signaling. The goal is to evaluate different methods of directly using RDMA features and provide recommendations for "best practices" in RDMA programming.

Uploaded by

wangyt0821
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
213 views8 pages

A Performance Study To Guide RDMA Programming Decisions

This document summarizes a performance study of Remote Direct Memory Access (RDMA) programming techniques to provide guidance on design decisions. It describes 7 aspects of RDMA programming tested: 1) transfer methods, 2) completion detection strategies, 3) message sizes, 4) connection types, 5) simultaneous operations, 6) work request submission lists, and 7) completion signaling. The goal is to evaluate different methods of directly using RDMA features and provide recommendations for "best practices" in RDMA programming.

Uploaded by

wangyt0821
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

2012 IEEE 14th International Conference on High Performance Computing and Communications

A Performance Study to Guide RDMA Programming Decisions

Patrick MacArthur, Robert D. Russell


Computer Science Department
University of New Hampshire
Durham, New Hampshire 03824-3591, USA
[email protected], [email protected]

Abstract—This paper describes a performance study of directly, without any particular application or environment
Remote Direct Memory Access (RDMA) programming tech- in mind, and provide guidance on general design options
niques. Its goal is to use these results as a guide for making faced by anyone directly using RDMA.
“best practice” RDMA programming decisions.
Infiniband RDMA is widely used in scientific high per- A. Background
formance computing (HPC) clusters as a low-latency, high-
bandwidth, reliable interconnect accessed via MPI. Recently Three RDMA technologies are in use today: Infiniband
it is gaining adherents outside scientific HPC as high-speed (IB), Internet Wide-Area RDMA Protocol (iWARP), and
clusters appear in other application areas for which MPI is RDMA over Converged Ethernet (RoCE). Infiniband [8], [9]
not suitable. RDMA enables user applications to move data
directly between virtual memory on different nodes without
defines a completely self-contained protocol stack, utilizing
operating system intervention, so there is a need to know how its own interface adapters, switches, and cables. iWARP
to incorporate RDMA access into high-level programs. But defines three thin protocol layers [10]–[12] on top of the
RDMA offers more options to a programmer than traditional existing TCP/IP protocol (i.e., standard Internet). RoCE [13]
sockets programming, and it is not always obvious what the simply replaces the physical and data-link layers of the In-
performance tradeoffs of these options might be. This study is
intended to provide some answers.
finiband protocol stack with Ethernet. All three technologies
are packaged as self-contained interface adapters and drivers,
Keywords-RDMA; Infiniband; OFA; OFED; HPC; and there are software-only versions for both iWARP [14]
and RoCE [15].
I. I NTRODUCTION The OpenFabrics Alliance (OFA) [16] publishes and
As networks grow faster, Remote Direct Memory Access maintains a common user-level Application Programming
(RDMA) is rapidly gaining popularity outside its traditional Interface (API) for all three RDMA technologies, It provides
application area of scientific HPC. RDMA allows application direct, efficient user-level access to all features supported by
programmers to directly transfer data between user-space each RDMA technology. OFA also provides open access to a
virtual memories on different machines without kernel in- reference implementation of this API, along with useful util-
tervention, thereby bypassing extra copying and processing ities, called the OpenFabrics Enterprise Distribution (OFED)
that reduce performance in conventional networking tech- [17]. This API is used throughout this study.
nologies. RDMA is completely message-oriented, so all In RDMA, actions are specified by verbs which con-
application messages are sent and received as units, unlike vey requests to the network adapter. Each verb, such as
TCP/IP, which treats network communication as a stream of post_send, is represented in the OFED API as a library
bytes. function, ibv_post_send, with associated parameters
User-level RDMA programming offers many more op- and data structures. To initiate a transfer, ibv_post_send
tions and is more complex than traditional socket program- places a work request data structure describing the trans-
ming, as it requires the programmer to directly manipulate fer onto a network adapter queue. Data transfers are all
functions and data structures defined by the network inter- asynchronous: once a work request has been posted, control
face in order to directly control all aspects of RDMA mes- returns to the user-space application which must later use
sage transmission. Therefore, the programmer must make the ibv_poll_cq function to remove a work completion
many decisions which may drastically affect performance. data structure from a network adapter’s completion queue.
The goal of this paper is to evaluate the performance of This completion contains the status for the finished transfer
numerous methods of directly using the application-level and tells the application it can again safely access the virtual
RDMA features in practice. memory used in the transfer.
Similar performance evaluations have been done for spe- RDMA provides four sets of work request opcodes to
cific applications and tools that use RDMA, such as MPI [1], describe a data transfer. The SEND/RECV set superficially
[2], FTP [3], GridFTP [4], NFS [5], AMPQ [6], PVFS [7], resembles a normal socket transfer. A receiver first posts
etc. The importance of our work is that we evaluate RDMA a RECV work request that describes a virtual memory

978-0-7695-4749-7/12 $26.00 © 2012 IEEE 778


DOI 10.1109/HPCC.2012.110
area into which the adapter should place a single message. There are two strategies which an application can employ
The sender then posts a SEND work request describing a to determine when to pick up a work completion.
virtual memory area containing the message to be sent. The The first completion detection strategy, called “busy
network adapters transfer data directly from the sender’s polling”, is to repeatedly poll the completion queue until a
virtual memory area to the receiver’s virtual memory area completion becomes available. It allows immediate reaction
without any intermediate copies. Since both sides of the to completions at the cost of very high CPU utilization, but
transfer are required to post work requests, this is called requires no operating system intervention.
a “two-sided” transfer. The second strategy, called “event notification”, is to
The second set is a “one-sided” transfer in which a sender set up a completion channel that allows an application to
posts a RDMA WRITE request that “pushes” a message wait until the interface adapter signals a notification on this
directly into a virtual memory area that the receiving side channel, at which time the application obtains the work
previously described to the sender. The receiving side’s CPU completion by polling. It requires the application to wait
is completely “passive” during the transfer, which is why this for the notification by transferring to the operating system,
is called “one-sided.” but reduces CPU utilization significantly.
The third set is also a “one-sided” transfer in which 5) Simultaneous Operations (Multiple Buffers): We set
the receiver posts a RDMA READ request that “pulls” a up the use of simultaneous operations per connection by
message directly from the sending side’s virtual memory, posting multiple buffers in a round-robin fashion so that the
and the sending side’s CPU is completely passive. interface adapter queues them.
Because the passive side in a “one-sided” transfer does 6) Work Request Submission Lists: The functions that
not know when that transfer completes, there is another post work requests take a linked list of work requests as
“two-sided” opcode set in which the sender posts a an argument. We compare the performance of creating a list
RDMA WRITE WITH IMM request to “push” a message of work requests and submitting them in a single posting
directly into the receiving side’s virtual memory, as for (“multiple work requests per post”) with that of posting
RDMA WRITE, but the send work request also includes individual work requests as single element lists (“single
4 bytes of immediate (out-of-band) data that is delivered work request per post”).
to the receiver on completion of the transfer. The receiving 7) Completion Signaling: For all transfer opcodes except
side posts a RECV work request to catch these 4 bytes, and RECV, a work completion is generated only if a “signaled”
the work completion for the RECV indicates the status and flag is set in the work request. If this flag is not set, the
amount of data transferred in the message. “unsignaled” work request still consumes completion queue
resources but does not generate a work completion data
B. Features Evaluated structure or notification event. To avoid depleting comple-
tion queue resources, applications must periodically post a
1) Work Request Opcode Set: Several RDMA features signaled work request and process the generated completion.
were evaluated for this study. The most obvious feature We compare sequences containing only signaled work
evaluated was the work request opcode set being used requests (“full signaling”) against sequences containing both
for the transfer, although in practice this choice is often signaled and unsignaled work requests (“periodic signal-
limited by the requirements of the application regardless of ing”). SEND or RDMA WRITE WITH IMM with inline
performance. are good examples of where unsignaled work requests could
2) Message Size: The second item considered is the be used because the data area is no longer needed by the
message size, which was arbitrarily categorized into small adapter once the request is posted, allowing the application
messages containing 512 bytes or less and large messages to reuse it without first receiving a work completion.
containing more. This size was chosen since 512 bytes is a 8) Infiniband Wire Speed: Infiniband hardware supports
standard disk sector; it is not part of any RDMA standard. several different wire transmission speeds, and we compare
3) Inline Data: The API provides an optional “inline” the effect of these speeds on various performance measures.
feature that allows an interface adapter to copy the data from 9) RoCE: We compare the performance of RoCE to
small messages into its own memory as part of a posted work Infiniband.
request. This immediately frees the buffer for application
reuse, and makes the transfer more efficient since the adapter II. T EST P ROCEDURES
has the data ready to send and does not need to retrieve it Our tests are variations of two simple applications, ping
over the memory bus during the transfer. and blast. In the ping tool, a client sends data to a
4) Completion Detection: An asynchronous RDMA server and the server sends it back. Ping has variations
transfer starts when an application posts a work request for SEND/RECV and RDMA WRITE WITH IMM/RECV,
to the interface adapter, and completes when the interface but not for RDMA READ or RDMA WRITE because with
adapter enqueues a work completion in its completion queue. these opcodes a server cannot determine when a transfer

779
100 14 1000
RDMA_WRITE_WITH_IMM with INLINE
SEND/RECV with INLINE
RDMA_WRITE_WITH_IMM 12
80 SEND/RECV 100
10

Megabits per second


Microseconds
60 10
8
Percent

6
40 1
SEND/RECV notify RDMA_WRITE_WITH_IMM with INLINE busy
RDMA_WRITE_WITH_IMM notify SEND/RECV with INLINE busy
4 SEND/RECV with INLINE notify RDMA_WRITE_WITH_IMM busy
RDMA_WRITE_WITH_IMM with INLINE notify SEND/RECV busy
20 SEND/RECV busy 0.1 RDMA_WRITE_WITH_IMM with INLINE notify
2 RDMA_WRITE_WITH_IMM busy SEND/RECV with INLINE notify
SEND/RECV with INLINE busy RDMA_WRITE_WITH_IMM notify
RDMA_WRITE_WITH_IMM with INLINE busy SEND/RECV notify
0 0 0.01
1 4 16 64 256 1024 1 4 16 64 256 1024 1 4 16 64 256 1024
Message size (bytes) Message size (bytes) Message size (bytes)

(a) CPU usage (event notification only). (b) Average one-way time. (c) Average throughput.

Figure 1. Ping with each completion detection strategy for small messages.

16 64000 8
RDMA_READ notify 32000 RDMA_READ busy RDMA_WRITE_WITH_IMM busy
SEND/RECV notify 16000 RDMA_WRITE_WITH_IMM busy RDMA_WRITE busy
14 RDMA_WRITE notify RDMA_WRITE busy 7 RDMA_WRITE_WITH_IMM and INLINE busy
8000
RDMA_WRITE_WITH_IMM notify SEND/RECV busy RDMA_WRITE with INLINE busy
12 RDMA_WRITE_WITH_IMM busy SEND/RECV notify 6
SEND/RECV busy RDMA_WRITE_WITH_IMM notify
Megabits per second

RDMA_WRITE busy 1000 RDMA_WRITE notify


RDMA_READ busy RDMA_READ notify
Microseconds

Microseconds
10 5

8 100 4

6 3
10

4 2
1
2 1

0 0.1 0
1 2 4 8 16 32 64 128 256 512 1 2 4 8 16 32 64 128 256 512 1 4 16 64 256 1024
Message size (bytes) Message size (bytes) Message size (bytes)

(a) Average one-way time. (b) Average throughput. (c) Average one-way time with and without inline,
busy polling only.

Figure 2. Blast with each opcode set and completion detection strategy for small messages using one buffer.

has completed. In the blast tool, which can run with each Average CPU utilization is the sum of user and system CPU
of the 4 opcode sets, a client sends data to a server as fast time reported by the POSIX getrusage function divided
as possible, but the server does not acknowledge it. by elapsed time.
Tests are run between two identical nodes, each consisting
III. P ERFORMANCE R ESULTS
of twin 6-core Intel Westmere-EP 2.93GHz processors with
6 GB of RAM and PCIe-2.0x8 running OFED 1.5.4 on A. Ping example, small messages
Scientific Linux 6.1. Each node has a dual-port Mellanox Ping is the application generally used to measure round
MT26428 adapter with 256 byte cache line, one port con- trip time. It repeatedly sends a fixed-size message back
figured for Infiniband 4xQDR ConnectX VPI with 4096-byte and forth between client and server. In our tests, message
MTUs, the other for 10 Gbps Ethernet with 9000-byte jumbo size varies by powers of 2 from 1 to 1024 bytes. Fig-
frames. With these configurations, each Infiniband or RoCE ure 1 shows a total of 8 combinations of SEND/RECV
frame can carry up to 4096 bytes of user data. Nodes are and RDMA WRITE WITH IMM/RECV, with and without
connected back-to-back on both ports, and all transfers use inline, and with busy polling and event notification.
Reliable Connection (RC) transport mode, which fragments Figure 1a shows slight differences between the CPU
and reassembles large user messages. usage by opcode sets, with inline requiring more cycles
We measure 3 performance metrics, all based on elapsed for messages less than 32 bytes due to the extra data copy
time, which is measured from just before the first message involved. Only event notification cases are graphed, as busy
transfer is posted to just after the last transfer completes. polling always has 100% CPU usage. Figure 1b shows
Average throughput is the number of messages times the clearly that busy polling produces lower one-way times than
size of each message divided by elapsed time. Average one- event notification, and transfers with inline perform better
way time per message for blast is the elapsed time divided than those without it (6.2 microseconds for messages smaller
by the number of messages; for ping it is half this value. than the 256 byte cache line). Figure 1c shows that for each

780
6 8 64000
912-byte messages 912-byte messages 32000
256-byte messages 30000 256-byte messages 16000
64-byte messages 20000 7 64-byte messages
5 10000 8000
16-byte messages 16-byte messages
4-byte messages 6 4-byte messages
1-byte messages 1-byte messages

Megabits per second

Megabits per second


4 1000 1000
Microseconds

Microseconds
5

3 100 4 100

3
2 10 10
912-byte messages 912-byte messages
256-byte messages 2 256-byte messages
1 1 64-byte messages 1 64-byte messages
16-byte messages 1 16-byte messages
4-byte messages 4-byte messages
1-byte messages 1-byte messages
0 0 0.1
1 2 4 8 16 32 64 1 2 4 8 16 32 64 1 2 4 8 16 32 64 1 2 4 8 16 32 64
Buffer count Buffer count Buffer count Buffer count

(a) Average one-way time with sin- (b) Average throughput with single (c) Average one-way time with mul- (d) Average throughput with multi-
gle work request per posting. work request per posting. tiple work requests per posting. ple work requests per posting.

Figure 3. Blast with RDMA WRITE with busy polling for small messages with inline using multiple buffers.

100 40 64000
1KiB message 64KiB message, notify 32000
8KiB message 64KiB message, busy 16000
32KiB message 35 32KiB message, notify 8000
80 64KiB message 32KiB message, busy
30 8KiB message, notify
8KiB message, busy

Megabits per second


1KiB message, notify 1000
1KiB message, busy
Microseconds

25
60
Percent

20 100

40 15
10
64KiB message, busy
10 64KiB message, notify
20 8KiB message, busy
1
5 8KiB message, notify
1KiB message, busy
1KiB message, notify
0 0 0.1
1 2 4 8 16 32 64 128 1 2 4 8 16 32 64 128 1 2 4 8 16 32 64 128
Buffer count Buffer count Buffer count

(a) Active side CPU usage (event notification only). (b) Average one-way time. (c) Average throughput.

Figure 4. Blast with RDMA WRITE for each completion detection strategy for large messages using multiple buffers and multiple work requests per
posting.

opcode set throughput increases proportionally with message as it performs much better than event notification for small
size and is slightly better for busy polling at any given size. messages. Figure 2c shows that one-way time is lowest for
There is little difference in throughput between opcode sets, both opcodes when using inline and messages smaller than
with low total throughput for them all, since small messages the 256 byte cache line, although our adapters accepted up
cannot maximize throughput. Using either SEND/RECV or to 912 bytes of inline data.
RDMA WRITE WITH IMM/RECV with inline and busy
polling gives the best one-way time and marginally better C. Blast example, small messages, multiple buffers
throughput for small messages, but suffers from 100% CPU Next consider the use of multiple outstanding buffers.
utilization. We initially post an RDMA WRITE for every buffer, then
repost each buffer as soon as we get the completion of its
B. Blast example, small messages, single buffer previous transfer. This way, the interface adapter processes
Next is a blast study using small messages. It compares posted work queue entries in parallel with the application
all transfer opcode sets for both busy polling and event code processing completions.
notification. As with ping, busy polling cases have lower In Figure 3a, we vary the buffer count for several message
one-way time and higher throughput, as shown in Figure 2a sizes using RDMA WRITE with inline and busy polling.
and Figure 2b respectively. CPU usage for event notification One-way time for messages less than or equal to 64 bytes is
cases is not significantly different between the opcode sets. only about 300 nanoseconds when using 8 or 16 buffers, and
As expected, RDMA READ with event notification performs is less than 1 microsecond when using 2 or 4 buffers. For
poorly. However, RDMA READ with busy polling gives larger buffer counts, one-way time for messages smaller than
slightly better performance than other opcodes, which is odd, 64 bytes increases, but remains around 300 nanoseconds
as RDMA READ is expected to perform worse because data for 64 byte messages. Throughput, shown in Figure 3b,
must flow from the responder back to the requester, which increases proportionally with message size, except that
requires a full round-trip in order to deliver the first bit. throughput of 64 byte messages slightly exceeds that of 256
Figure 2c examines the use of inline in WRITE operations byte messages for 8 or more buffers, while throughput for
for small message blast. This is only done for busy polling smaller messages drops noticeably for 32 or 64 buffers.

781
100 100 100000 64000
SEND/RECV SEND/RECV 32000
RDMA_WRITE RDMA_WRITE 16000
RDMA_WRITE_WITH_IMM RDMA_WRITE_WITH_IMM 10000 8000
80 RDMA_READ 80 RDMA_READ

1000

Megabits per second


1000

Microseconds
60 60 100
Percent

Percent
100
10
40 40 RDMA_READ, notify RDMA_WRITE, busy
SEND/RECV, notify 10 RDMA_WRITE_WITH_IMM, busy
1 RDMA_WRITE_WITH_IMM, notify SEND/RECV, busy
RDMA_WRITE, notify RDMA_READ, busy
20 20 RDMA_READ, busy SEND/RECV, notify
1
0.1 SEND/RECV, busy RDMA_WRITE, notify
RDMA_WRITE_WITH_IMM, busy RDMA_WRITE_WITH_IMM, notify
RDMA_WRITE, busy RDMA_READ, notify
0 0 0.01 0.1
1KiB 8KiB 64KiB 512KiB 4MiB 32MiB 256MiB 1KiB 8KiB 64KiB 512KiB 4MiB 32MiB 256MiB 1KiB 8KiB 64KiB 512KiB 4MiB 32MiB 256MiB 1KiB 8KiB 64KiB 512KiB 4MiB 32MiB 256MiB
Message size (bytes) Message size (bytes) Message size Message size

(a) Active side CPU usage (event (b) Passive side CPU usage (event (c) Average one-way time. (d) Average throughput.
notification only). notification only).

Figure 5. Blast with each opcode set and each completion detection strategy for large messages using four buffers with multiple work requests per posting.

8 64000 16
multiple WR per post, periodic signaling 32000 single WR per post, full signaling signaled READ/signaled WRITE notify
multiple WR per post, full signaling 16000 single WR per post, periodic signaling unsignaled READ/signaled WRITE notify
7 single WR per post, periodic signaling multiple WR per post, full signaling 14 signaled READ/signaled WRITE busy
8000
single WR per post, full signaling multiple WR per post, periodic signaling unsignaled READ/signaled WRITE busy
6 12
Megabits per second

1000
Microseconds

Microseconds
5 10

4 100 8

3 6
10

2 4
1
1 2

0 0.1 0
1 2 4 8 16 32 64 128 256 1 2 4 8 16 32 64 128 256 1 2 4 8 16 32 64 128 256 512
Buffer count Buffer count Message size (bytes)

(a) Average one-way time for blast with (b) Average throughput for blast with (c) Average one-way time for ping with client
RDMA WRITE and busy polling for 16 byte RDMA WRITE and busy polling for 16 byte issuing RDMA WRITE and RDMA READ,
messages, with each work request submission messages, with each work request submission with each completion detection strategy for
strategy using multiple buffers with multiple strategy using multiple buffers with multiple small messages.
work requests per posting. work requests per posting.

Figure 6. Comparing completion signaling strategies.

D. Blast example, small messages, multiple buffers, multiple messages). We vary the buffer count from 1 to 128 for
work requests per posting 1, 8, 32, and 64 kibibyte messages, and post multiple
The next study is identical to the previous, except instead work requests per list. Figure 4a shows that 64 kibibyte
of posting each work request as we process its previous messages have the lowest CPU utilization when using event
completion, we place it into a list and post that list after notification, and, for all message sizes examined, using more
processing all available completions. Comparing one-way than 4 buffers has little or no effect on CPU utilization. The
times in Figure 3c with those in Figure 3a shows that times one-way time, shown in Figure 4b, and throughput, shown in
for 256 and 912 byte messages are unchanged, but for 64 Figure 4c, both increase as message size increases. Also, for
byte and smaller messages they increase when using more both one-way time and throughput, busy polling and event
than 2 buffers, and for more than 16 buffers they increase to notification results converge given enough buffers (ranging
that of 256-byte messages. For messages of 64 bytes or less from 2 or more buffers for 64KiB to 5 or more for 1KiB).
with 4, 8 or 16 buffers Figure 3d does not show the increase Next we study the effect of each opcode set for large
in throughput seen in Figure 3b. Perhaps the time needed message transfers. We vary the message size from 1 kibibyte
to process a large number of completions before posting a to 256 mibibytes and use only 4 buffers, as it was just shown
single list of new work requests causes the adapter’s work that using more buffers produces no performance gains. The
queue to empty. In all cases, posting multiple work requests CPU usage for the active and passive side of each transfer
produces less dependence on the number of buffers than is shown in Figure 5a and Figure 5b, respectively. Active
does single posting of work requests. side CPU utilization generally decreases with message size,
although there is a bump around 8KiB. Passive side CPU uti-
E. Blast example, large messages, multiple buffers lization is always 0 for RDMA WRITE and RDMA READ,
We next examine the effects of the buffer count and but is similar to the active side for SEND/RECV and
message size on large messages, using RDMA WRITE RDMA WRITE WITH IMM/RECV. One-way time, shown
without inline (since inline can be used only with small in Figure 5c, and throughput, shown in Figure 5d, both

782
32000 250
25600 RDMA_READ, SDR
SEND/RECV, SDR
RDMA_WRITE, SDR
16000 200 RDMA_WRITE_WITH_IMM/RECV, SDR
RDMA_READ, DDR
SEND/RECV, DDR
Megabits per second

RDMA_WRITE, DDR
RDMA_WRITE_WITH_IMM/RECV, DDR

Microseconds
8000 150 RDMA_READ, QDR
RDMA_WRITE, QDR
RDMA_WRITE_WITH_IMM/RECV, QDR SEND/RECV, QDR
SEND/RECV, QDR RDMA_WRITE, QDR
RDMA_READ, QDR RDMA_WRITE_WITH_IMM/RECV, QDR
4000 100
RDMA_WRITE, DDR
RDMA_WRITE_WITH_IMM, DDR
SEND/RECV, DDR
RDMA_READ, DDR
2000 RDMA_WRITE, SDR 50
RDMA_WRITE_WITH_IMM/RECV, SDR
SEND/RECV, SDR
RDMA_READ, SDR
1000 0
1 2 4 8 16 1 2 4 8 16
Buffer count Buffer count
(a) Average throughput. (b) Average one-way time.

Figure 7. Blast with each opcode set and each Infiniband speed with busy polling for 64KiB messages using multiple buffers with multiple work requests
per posting.

8 100
RDMA_WRITE RoCE RDMA_READ RoCE 64KiB message RoCE
SEND/RECV RoCE 30000 RDMA_WRITE_WITH_IMM RoCE 64KiB message IB 30000
7 RDMA_WRITE_WITH_IMM RoCE 20000 20000
RDMA_WRITE RoCE 8KiB message RoCE
RDMA_READ RoCE 10000 SEND/RECV RoCE 8KiB message IB 10000
80
6 RDMA_WRITE IB RDMA_READ IB 1KiB message RoCE
SEND/RECV IB RDMA_WRITE IB 1KiB message IB
Megabits per second

Megabits per second


RDMA_WRITE_WITH_IMM IB RDMA_WRITE_WITH_IMM IB
RDMA_READ IB SEND/RECV IB
Microseconds

Microseconds

5 1000 1000
60

4
100 40 100
3

64KiB message, IB
2 8KiB message IB
10 20 10 64KiB message, RoCE
1 8KiB message RoCE
1KiB message IB
1KiB message RoCE
0 1 0 1
1 2 4 8 16 32 64 128 256 512 1 2 4 8 16 32 64 128 256 512 1 2 4 8 16 32 64 128 1 2 4 8 16 32 64 128
Message size (bytes) Message size (bytes) Buffer count Buffer count

(a) Average one-way time with each (b) Average throughput with each (c) Average one-way time with (d) Average throughput with
opcode set for small messages using opcode set for small messages using RDMA WRITE for large messages RDMA WRITE for large messages
one buffer. one buffer. using multiple buffers with multiple using multiple buffers with multiple
work requests per posting. work requests per posting.

Figure 8. Comparison of QDR Infiniband and RoCE for blast with busy polling.

perform best for busy polling up to about 16 kibibytes, at in the blast example all buffers need processing after they
which point there is no difference between busy polling and are transferred, so not signaling for a completion just delays
event notification. At 32 kibibytes and above, the transfer that processing until a signaled completion occurs, at which
operation also has no effect on performance. point all buffers transferred up to that time are processed in
a big batch, breaking the flow of new transfer postings.
F. Completion signaling
An example without this batching effect is ping with
All tests so far used full signaling. In this test we an active client using RDMA WRITE to push messages
use blast with RDMA WRITE and 16 byte messages to and RDMA READ to pull them back from a passive
compare full signaling, where every request is signaled, server. Every RDMA WRITE can be unsignaled because
against periodic signaling, where only one work request out its buffer does not need any processing after the transfer.
of every nbuf f ers/2 is signaled. Effects are visible only Figure 6c shows that one-way time is always lower when
when using more than 2 buffers, since otherwise we signal the RDMA WRITE is unsignaled, more so with event
every buffer. Both one-way time in Figure 6a and throughput notification, less so with busy polling. This figure also shows
in Figure 6b show much better performance when using one remarkably little variation with message size.
work request per post than when using multiple requests
per post. But there is no performance difference between G. Infiniband speed comparison
full and periodic signaling with one work request per post, All previous tests were done at QDR speed, the maximum
and with multiple work requests per post the only effect of supported by our adapter. However, Infiniband adapters can
periodic signaling is to decrease performance when there be configured to run at several speeds, as shown in Gigabits
are 4, 8 or 16 buffers. We believe this is due to the fact that per second (Gbps) in Table I. Usable Gbps is 20% lower than

783
raw Gbps due to 8b/10b encoding on serial lines. Figure 7a message sizes, compared to about 20% with event notifica-
shows that throughput doubles from near the SDR maximum tion for small messages up to 512 bytes and 0% for messages
of 8 usable Gbps to near the DDR maximum of 16, but it of 4 mibibytes or more.
does not double again from DDR to QDR. The observed Regardless of whether busy polling or event notification
25.6 Gbps for QDR is about 20% lower than the expected is employed, messages smaller than the cache line size give
32 Gbps due to the overhead of the PCIe-2 bus, an impact noticeably better one-way times for opcode sets that use
also noted in [2] and [18]. PCIe-2 overhead also shows up inline, with a slight improvement in throughput but slightly
in Figure 7b, where the observed DDR one-way time of higher CPU utilization with event notification. Although the
33 microseconds is half the 66 observed for SDR, but the amount of inline data allowed depends on the implementa-
observed QDR time of 21 microseconds is 20% greater than tion, it always makes sense to use inline whenever adapters
the expected 16.5. support it.
An application must use the opcode best for its needs.
raw usable Gbps over In situations such as ping, where both sides need to know
Designation
Gbps Gbps PCIe-2
IB 4X SDR 10 8 8 when data arrives, the choice is limited to SEND/RECV and
RoCE 12.5 10 10 RDMA WRITE WITH IMM/RECV, both of which perform
IB 4X DDR 20 16 16 essentially equally. For each of these, busy polling and inline
IB 4X QDR 40 32 25.6
always give better one-way times and throughput, but higher
Table I CPU utilization.
I NFINIBAND AND RO CE SPEEDS .
For all message sizes the choices for completion detection
H. Infiniband and RoCE comparison and inline are more significant factors in determining perfor-
We next compare 10 Gbps RoCE with 25.6 Gbps QDR mance than the opcode, which may be limited by application
Infiniband. Looking first at small messages for each opcode demands. For example, RDMA READ and RDMA WRITE
set, Figure 8a shows one-way times for Infiniband are less result in no CPU utilization on the passive side, which may
than a microsecond lower than those for RoCE, and Fig- be important for passive side scalability. More often, both
ure 8b shows that although QDR Infiniband has much greater sides of a transfer need to know when it completes, in which
maximum throughput than RoCE, there is little observable case SEND/RECV or RDMA WRITE WITH IMM/RECV
difference for small messages. is best, and if messages are small enough to allow inline
Examining larger messages transferred with then the SEND or RDMA WRITE WITH IMM could also
RDMA WRITE, the differences between Infiniband be unsignaled.
and RoCE are greater. One-way time and throughput for It is best to have between 3 and 8 transfers simulta-
1, 8, and 64 kibibyte messages are shown in Figure 8c neously queued on the adapter. With small messages this
and Figure 8d. For all large message sizes, one-way time number should be closer to 8; for large messages closer to
is roughly proportional to message size and is essentially 3. Using more buffers gives no performance increase, so
independent of the buffer count for both technologies, staying within these limits avoids consuming extra adapter
but as message size increases, one-way time for RoCE resources. Our studies on buffer numbers do not consider
increases faster than for Infiniband. When using one the additional delay introduced when communicating nodes
buffer, the throughput increases with message size for both are separated by any significant distance. Clearly a longer
technologies, but as the buffer count increases, all of the communications channel can store more buffers in transit, so
RoCE curves converge near its maximum 10 Gbps, whereas that more simultaneously queued buffers would be necessary
for Infiniband, the 8 and 64 kibibyte curves converge near to keep the channel full, especially when small messages are
the higher QDR maximum of 25.6 Gbps and only the 1 being transmitted.
kibibyte curve converges at 10 Gbps. In general, rather than collecting work requests into lists
it is better to post them individually as soon as possible and
IV. C ONCLUSIONS let the adapter queue them. This ensures that the connection
In all situations performance is much more sensitive to is kept busy. A list might be used if several work requests
the choice of RDMA options when using small messages must be created and sent together.
than when using large messages. Completion signaling has a small performance impact in
For all 4 opcode sets with small messages up to 4 specialized circumstances. Full signaling should be used if
kibibytes, much lower one-way time and much higher there is any need to process a transfer’s completion, but in
throughput are achieved by using busy polling rather than a situation such as ping with RDMA WRITE followed by
event notification to wait for completions. For messages RDMA READ, performance improves if the RDMA WRITE
of 16 kibibytes and larger both busy polling and event is not signaled.
notification produce the same one-way time and throughput. For small messages there is little performance differ-
But busy polling also causes 100% CPU utilization for all ence between RoCE and Infiniband. For larger messages,

784
QDR Infiniband’s 25.6 Gbps outperforms RoCE’s 10 Gbps. [10] P. Culley, U. Elzur, R. Recio, S. Bailey, and J. Carrier,
Therefore, if an application only transfers small messages or “Marker PDU Aligned Framing for TCP Specification,” RFC
network equipment cost is a significant factor, then RoCE 5044, Oct. 2007. [Online]. Available: http://www.ietf.org/rfc/
rfc5044.txt
may be appropriate, as it runs over Ethernet wires and
switches that may already be installed. Since the API is [11] H. Shah, J. Pinkerton, R. Recio, and P. Culley, “Direct Data
identical across technologies, an application could be written Placement over Reliable Transports,” RFC 5041, Oct. 2007.
and tested on RoCE, then migrated to Infiniband. This means [Online]. Available: http://www.ietf.org/rfc/rfc5041.txt
that RoCE may be good for initial RDMA programming or
[12] R. Recio, B. Metzler, P. Culley, J. Hilland, and D. Garcia,
development, or in applications where high throughput is not “A Remote Direct Memory Access Protocol Specification,”
necessary but the other benefits of RDMA are still desired. RFC 5040, Oct. 2007. [Online]. Available: http://www.ietf.
Non-RDMA factors are also important. Platforms with org/rfc/rfc5040.txt
PCIe-2 cannot fully utilize an Infiniband 4X QDR link,
although fabric switches should be able to handle full traffic [13] Infiniband Trade Association, “Supplement to Infiniband Ar-
chitecture Specification Volume 1, Release 1.2.1: Annex A16:
volume, even if each endpoint has limited throughput. RDMA over Converged Ethernet (RoCE),” Apr. 2010.
ACKNOWLEDGMENT [14] B. Metzler, F. Neeser, and P. Frey, “Softiwarp:
This research is supported in part by National Science A Software iWARP Driver for OpenFabrics,”
www.openfabrics.org/archives/spring2009sonoma/monday/sof
Foundation grant OCI-1127228.
tiwrp.pdf, 2009.
R EFERENCES
[15] System Fabric Works, “Soft RoCE,”
[1] M. Koop, T. Jones, and D. Panda, “Reducing Connection www.systemfabricworks.com/downloads/roce, 2011.
Memory Requirements of MPI for InfiniBand Clusters: A
Message Coalescing Approach,” in Seventh IEEE Interna- [16] OpenFabrics Alliance, “http://www.openfabrics.org.”
tional Symposium on Cluster Computing and the Grid, 2007.
[17] OpenFabrics Enterprise Distribution, “www.
[2] M. Koop, W. Huang, K. Gopalakrishan, and D. Panda, mellanox.com/pdf/products/software/OFED PB 1.pdf,”
“Performance Analysis and Evaluation of PCI 2.0 and Quad- 2008.
Data Rate InfiniBand,” in Sixteenth IEEE Symposium on High
Performance Interconnects, 2008. [18] National Instruments, “PCI Express – An
Overview of the PCI Express Standard,”
[3] P. Lai, H. Subramoni, S. Narravula, A. Mamidala, and http://zone.ni.com/devzone/cda/tut/p/id/3767, 2009.
D. Panda, “Designing Efficient FTP Mechanisms for High
Performance Data-Transfer over InfiniBand,” in International
Conference on Parallel Processing, 2009.

[4] H. Subramoni, P. Lai, R. Kettimuthu, and D. Panda, “High


Performance Data Transfer in Grid Environment Using
GridFTP over InfiniBand,” in International Symposium on
CLuster Computing and the Grid, 2010.

[5] B. Li, P. Zhang, Z. Huo, and D. Meng, “Early Experiences


with Write-Write Design of NFS over RDMA,” in IEEE
International Conference on Networking, Architecture, and
Storage, 2009.

[6] H. Subramoni, G. Marsh, S. Narravula, P. Lai, and D. Panda,


“Design and Evaluation of Benchmarks for Financial Appli-
cations using Advanced Message Queueing Protocol (AMPQ)
over InfiniBand,” in Workshop on High Performance Compu-
tational Finance, 2008.

[7] J. Wu, P. Wyckoff, and D. Panda, “PVFS over InfiniBAnd:


Design and Performance Evaluation,” Ohio Supercomputer
Center, Tech. Rep. OSU-CISRC-5/03-TR25, 2003.

[8] Infiniband Trade Association, “Infiniband Architecture Spec-


ification Volume 1, Release 1.2.1,” Nov. 2007.

[9] P. Grun, Introduction to InfiniBand for End Users. InfiniBand


Trade Association, 2010.

785

You might also like