0% found this document useful (0 votes)
136 views166 pages

Cisco Storage Network Insights

Uploaded by

Lv Nano
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
136 views166 pages

Cisco Storage Network Insights

Uploaded by

Lv Nano
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

IP Fabric for Storage Networks

Questions, Misconceptions, and Recommendations

Paresh Gupta
Technical Leader,
Technical Marketing Engineering, Cisco
BRKDCN-2945

#CiscoLive
Cisco Webex App
https://ciscolive.ciscoevents.com/
ciscolivebot/#BRKDCN-2945

Questions?
Use Cisco Webex App to chat
with the speaker after the session

How
1 Find this session in the Cisco Live Mobile App

2 Click “Join the Discussion”

3 Install the Webex App or go directly to the Webex space

4 Enter messages/questions in the Webex space

Webex spaces will be moderated Enter your personal notes here

by the speaker until June 7, 2024.

#CiscoLive BRKDCN-2945 © 2024 Cisco and/or its affiliates. All rights reserved. Cisco Public 2
Agenda

• TCP Storage Networks (iSCSI and NVMe/TCP)


• Lossless Ethernet Networks (RoCEv2)
• Congestion Detection and Prevention
• Nexus 9000 and Nexus Dashboard Insights for storage traffic
• Questions, Misconceptions, and Recommendations

#CiscoLive BRKDCN-2945 © 2024 Cisco and/or its affiliates. All rights reserved. Cisco Public 3
Type of Networks in a Data Center
By Framing and Encoding
Ethernet SCSI and NVMe InfiniBand
OSI Model InfiniBand Layers
Fibre Channel
Layer 7 Application
Fibre Channel Levels RDMA Verbs
Layer 6 Presentation
Layer 5 Session FC-4 Upper Layer Mapping Upper Layers

Layer 4 Transport FC-3 Services Transport

Layer 3 Network FC-2 Framing and Signaling Network

Layer 2 Data Link FC-1 Encode/Decode Link


Ethernet
Layer 1 Physical FC-0 Physical Physical
Optional Priority-based Flow B2B flow control. Credit-based
Control (PFC). Pause Frames, etc. R_RDY, Credits, etc. flow control
#CiscoLive BRKDCN-2945 © 2024 Cisco and/or its affiliates. All rights reserved. Cisco Public 4
Crossing The Boundaries of Network Types
Via TCP/IP
Ethernet SCSI and NVMe
OSI Model
NVMe/TCP Fibre Channel
Layer 7 Application
iSCSI Fibre Channel Levels
Layer 6 Presentation
Layer 5 Session FC-4 Upper Layer Mapping

Layer 4 Transport FC-3 Services

Layer 3 Network FC-2 Framing and Signaling

Layer 2 Data Link FC-1 Encode/Decode


Ethernet
Layer 1 Physical FC-0 Physical
Optional Priority-based Flow
Control (PFC). Pause Frames, etc.
#CiscoLive BRKDCN-2945 © 2024 Cisco and/or its affiliates. All rights reserved. Cisco Public 5
iSCSI and NVMe/TCP
• Internet SCSI (iSCSI) carries SCSI commands over TCP to remote block storage
• NVMe/TCP carries NVMe commands over TCP to remote block storage
• Both are OSI layer 5 (sessions layer) protocols.

iSCSI Packet

Ethernet IP TCP iSCSI SCSI CRC/


Payload
Header Header Header Header Header FCS

NVMe/TCP Packet

Ethernet IP TCP NVMe/TCP NVMe CRC/


Payload
Header Header Header Header Header FCS

#CiscoLive BRKDCN-2945 © 2024 Cisco and/or its affiliates. All rights reserved. Cisco Public 6
RDMA-Capable Protocols
No headers for RoCE & RoCEv2 (just the names). iWARP has a dedicated header
Ethernet InfiniBand
TCP’s reliable and in-order
OSI Model delivery, flow control, and InfiniBand Layers
congestion control
Layer 7 Application
iWARP RDMA Verbs
Layer 6 Presentation
Layer 5 Session Upper Layers
RoCEv2
Layer 4 Transport Transport
IP Routing capable,
Layer 3 Network ECN for congestion Network
prevention. Link
Layer 2 Data Link
Ethernet QoS using IP DSCP
Layer 1 Physical Physical
RoCE
Really means RoCEv2 these days
#CiscoLive BRKDCN-2945 © 2024 Cisco and/or its affiliates. All rights reserved. Cisco Public 7
RoCE and RoCEv2 Frame Formats
CoS DSCP ECN
(3 bits) (6 bits) (2 bits)

PCP/CoS field in IPv4 and IPv6


VLAN header Header
Ethernet IP UDP IB BTH+ CRC/
RoCEv2 Header Header Header (L4 Hdr)
IB Payload ICRC
FCS

EtherType IPv4 and IPv6 UDP dest port CRC CRC


0x0800 for protocol 17 4791 - next (IP to IB (Ethernet to
IPv4 and indicates next header IB for Payload) – ICRC) –
0x86DD for IPv6 header is UDP RoCEv2 Between end-Hop by hop
devices

Ethernet IB Global Routing IB Transport Header CRC/


RoCE Header Header (GRH) (BTH+) (L4 Hdr)
IB Payload ICRC
FCS

Ethertype = 0x8915
#CiscoLive BRKDCN-2945 © 2024 Cisco and/or its affiliates. All rights reserved. Cisco Public 8
Types of Storage, Protocols, Transports, & Networks

#CiscoLive BRKDCN-2945 © 2024 Cisco and/or its affiliates. All rights reserved. Cisco Public 9
Question

Dedicated or Shared Network for Storage Traffic?

Recommendation

Dedicated network for storage traffic


Dedicated or Shared Network for Storage Traffic?
Dedicated Storage Network
Prefer dedicated networks for storage traffic because of
Storage Traffic the following benefits:
• No traffic contention between storage and non-storage
traffic
• Easier detection, troubleshooting, and prevention or
Non-Storage Traffic congestion and other issues
• Easier change management

Shared Storage Network


Benefits of a shared storage network:
Storage + Non-Storage Traffic • Affordable

#CiscoLive BRKDCN-2945 © 2024 Cisco and/or its affiliates. All rights reserved. Cisco Public 11
Block Storage Protocols – SCSI and NVMe

• SCSI (Small Computer System Interface) • NVMe (Non-Volatile Memory Express) is


is a way of accessing computer a relatively newer way of block-level
peripherals access to storage devices.
• Besides HW specifications, SCSI also • Designed from the ground up to unleash
defines a software layer for block-level the full potential of non-volatile memory
access to storage devices with high parallelism
• Hundreds of commands. Read and Write • Lean. Only a handful of commands.
commands are important Read/Write commands are important
• Can be transported via network (e.g.,: • Can be transported via network using
Fibre Channel Protocol, iSCSI) NVMe over Fabrics (NVMe-oF) (e.g.,:
Fibre Channel Protocol, NVMe/TCP)
#CiscoLive BRKDCN-2945 © 2024 Cisco and/or its affiliates. All rights reserved. Cisco Public 12
Misconception

NVMe-oF is super fast because it has 64K queues each


with 64K commands

Reality

Be aware of technology specification versus its


implementation on commercially available devices
Question

I upgraded to NVMe-oF, but the network throughput did


not increase. Why?

Answer

That may be expected or probably you are not measuring


at a fine granularity
Does NVMe-oF Increase Network Throughput?
• NVMe, NVMe-oF, or the network itself does not generate traffic
• If the applications remain the same, their throughput also remains the same,
regardless of how they access their storage
• NVMe and NVMe over Fabrics may result in increased network throughput only if
the earlier stack (probably SCSI based) was a bottleneck
• Also, consider how you measure the network throughput
• 5-min granularity misses traffic burst at a microsecond scale
• NVMe-oF can transfer the same data in a much shorter duration (e.g. 100 ms
instead of 1 sec) and may also lead to link overutilization of the links
• But the “reported throughput” at 5-min granularity doesn’t catch this

#CiscoLive BRKDCN-2945 © 2024 Cisco and/or its affiliates. All rights reserved. Cisco Public 15
Question

Congestion in the network vanished after upgrading to


NVMe-oF. Is that possible?

Answer

Possible but not because of NVMe-oF


NVMe-oF and Network Congestion
• NVMe-oF is typically supported on newer devices
• Using high-capacity hosts, faster storage arrays, and, most importantly, higher
network speeds may eliminate many reasons for congestion, although this
success can’t be attributed to NVMe-oF itself

#CiscoLive BRKDCN-2945 © 2024 Cisco and/or its affiliates. All rights reserved. Cisco Public 17
Question

What effects does NVMe-oF have on network congestion?

Answer

Network congestion may remain the same or increase with


NVMe-oF
NVMe-oF and Network Congestion
• Network congestion may remain the same in environments where applications
were already getting the maximum storage performance and were not pushing
the limits of the storage networks and the various layers within the end devices.
• In contrast, network congestion may increase in environments where the storage
devices can read data quickly and send the response at the maximum rate of
their links. This leads to increased network utilization, resulting in congestion due
to overutilization. Such conditions have been commonly reported over the last
many years with all-flash arrays that did not use NVMe. As the usage of NVMe
storage arrays increases, the instances and severity of congestion issues may
increase further.

#CiscoLive BRKDCN-2945 © 2024 Cisco and/or its affiliates. All rights reserved. Cisco Public 19
Question

Do NVMe and NVMe-oF have mechanisms to control


network congestion?

Answer

NVMe and NVMe-oF rely on the network to control


network congestion
NVMe and NVMe-oF Flow Control is End-to-End

• NVMe defines a flow-control mechanism, but its purpose is to prevent the


submission and completion queues from becoming full on the host and the
controller. This is different from network flow control and congestion control.
• Moreover, NVMe flow control is optional with NVMe over Fabrics, which means
products in your environment may not even support or enable these
mechanisms.
• NVMe relies on the network to handle network congestion

#CiscoLive BRKDCN-2945 © 2024 Cisco and/or its affiliates. All rights reserved. Cisco Public 21
IP Storage Network Design
Spine-Leaf Design Recommendation
Design storage network with minimal
Spine over-subscription with ECMP core
Switches
(100 GbE edge, 400 GbE core)
Core - 400 GbE

Leaf Switches
Recommendation
Edge - 100 GbE Operate host and storage ports at
same speed (avoid speed mismatch)
Host Storage Host Storage Host
(Initiator) array (Initiator) array (Initiator)
(Target) (Target) Recommendation
Traffic path between hosts and storage Start small; Multiple smaller fabrics
arrays. Doesn’t show ISL load-balancing
is a valid alternative

#CiscoLive BRKDCN-2945 © 2024 Cisco and/or its affiliates. All rights reserved. Cisco Public 22
* Conceptual and over-simplified representation

Quality of Service (QoS)


Classification Queuing Q Management Scheduling
Identify and mark Assign traffic to Drop or ECN-mark Schedule packets
traffic using Ethernet queues (No-drop or packets within a queue from multiple
CoS and IP DSCP normal queue) (WRED, AFD, etc.) queues (DWRR, etc.)
Class 1: Non-storage: Lossy : Priority
Class 2: Non-storage: Lossy : Business critical
Class 3: Storage : Lossless: FCoE
Class 4: Storage : Lossy : iSCSI & NVMe/TCP Egress
Class 5: Storage : Lossless: RoCE Port
Class 6: Non-storage: Lossy : Management
Class 7: Non-storage: Lossy : Bulk
Class 8: Non-storage: Lossy : Best effort

#CiscoLive BRKDCN-2945 © 2024 Cisco and/or its affiliates. All rights reserved. Cisco Public 23
Misconception

QoS configuration is needed only in shared storage


networks

Reality

QoS configuration is mandatory in shared as well as


dedicated storage networks
Question

Which is better for storage traffic: policing or shaping?

Recommendation

Neither
Policing or Shaping Storage Traffic
• Both policing and shaping limit traffic rates, although they do so slightly
differently
• Policing drops excess traffic, which results in retransmission of the lost packets
or re-initiation of the entire operation based on the upper layers
• Shaping leads to buffering a limited amount of excess traffic, which increases
delay and variance in delay and still drops excess traffic
• Both policing and shaping lead to poor performance for storage traffic
• The aim should be to have enough network capacity that neither policing nor
shaping is needed

#CiscoLive BRKDCN-2945 © 2024 Cisco and/or its affiliates. All rights reserved. Cisco Public 26
Misconception

Storage traffic should be prioritized

Reality

Only in generic terms, but avoid QoS priority command


QoS Priority
Packets from a priority queue are transmitted before packets from other queues.
However, the priority queue is also limited in bandwidth. This bandwidth limit may
or may not be apt for storage traffic based on the application requirements.

#CiscoLive BRKDCN-2945 © 2024 Cisco and/or its affiliates. All rights reserved. Cisco Public 28
Misconception

Assigning bandwidth to storage traffic limits its throughput


to the configured value

Reality

QoS bandwidth assignment is a lower limit,


not an upper limit
Misconception

Congestion in a traffic class does not


affect traffic in other classes

Reality

It depends on how a problem is approached


Storage Traffic Class
100% 100%
Other Traffic
% Util Storage Traffic % Util
Storage Traffic

• QoS “bandwidth” command provides a minimum guarantee


• Storage traffic can consume 100% capacity when other classes don’t have traffic
• Storage traffic is limited to guaranteed BW (e.g.: 50%) when other classes have
traffic
• Therefore, traffic in other classes may affect storage/no-drop traffic class
• Application: Earlier, app received 800 MBps throughput. Now, the app receives
only 500 MBps throughput and response time is poor. Why?
• Because other traffic classes have more traffic now, so the no-drop/storage traffic
class is limited to guarantee BW
#CiscoLive BRKDCN-2945 © 2024 Cisco and/or its affiliates. All rights reserved. Cisco Public 31
I/O Operations,
Traffic Patterns, and
Network Congestion
MTU and Maximum Segment Size (MSS)

iSCSI or SCSI or
Ethernet IP TCP CRC/
NVMe/TCP NVMe Payload
Header Header Header FCS
Header Header

Ethernet MTU IP MTU TCP MSS

Recommendation
Enable jumbo frames on end devices and switches.
Verify TCP MSS

#CiscoLive BRKDCN-2945 © 2024 Cisco and/or its affiliates. All rights reserved. Cisco Public 33
iSCSI Read I/O Operation
iSCSI Read I/O Operation
Host / Initiator Leaf-1 Spine-1 Leaf-2 Target

Initiate read TCP Port 3260


I/O

Transfer
data
Transfer
data

Response

#CiscoLive BRKDCN-2945 © 2024 Cisco and/or its affiliates. All rights reserved. Cisco Public 34
Storage Performance vs Network Performance
Throughput (Mbps)
Host / Initiator Leaf-1 Spine-1 Leaf-2 Target

Initiate read TCP Port 3260


I/O

I/O
Transfer Completion
data
Time
Transfer
data

Response

I/O Size Throughput (MBps) IOPS


#CiscoLive BRKDCN-2945 © 2024 Cisco and/or its affiliates. All rights reserved. Cisco Public 35
iSCSI Write I/O Operation
iSCSI Write I/O Operation
Host / Initiator Leaf-1 Spine-1 Leaf-2 Target

Initiate write TCP Port 3260


I/O

Ready To
Transfer

Transfer
data
Transfer
data

Response
#CiscoLive BRKDCN-2945 © 2024 Cisco and/or its affiliates. All rights reserved. Cisco Public 36
NVMe/TCP Read I/O Operation
NVMe/TCP Read I/O Operation
Host Leaf-1 Spine-1 Leaf-2 Target / Controller

Initiate read TCP Port 4420


I/O

Transfer
data
Transfer
data

Response

Network Handling is same for iSCSI & NVMe/TCP Co-exist Similar packet sizes & direction
#CiscoLive BRKDCN-2945 © 2024 Cisco and/or its affiliates. All rights reserved. Cisco Public 37
NVMe/TCP Write I/O Operation
NVMe/TCP Write I/O Operation
Host Leaf-1 Spine-1 Leaf-2 Target / Controller

Initiate write TCP Port 4420


I/O

Ready To
Transfer

Transfer
data
Transfer
data

Response
#CiscoLive BRKDCN-2945 © 2024 Cisco and/or its affiliates. All rights reserved. Cisco Public 38
NVMe/RDMA Read I/O Operation
NVMe/RDMA Read I/O Operation
Host Leaf-1 Spine-1 Leaf-2 Target / Controller

Initiate read
I/O

Transfer
data
Transfer
data
Response

#CiscoLive BRKDCN-2945 © 2024 Cisco and/or its affiliates. All rights reserved. Cisco Public 39
NVMe/RDMA Write I/O Operation
NVMe/RDMA Write I/O Operation
Host Leaf-1 Spine-1 Leaf-2 Target / Controller

Initiate write
I/O

Transfer
data
Transfer
data

Response
#CiscoLive BRKDCN-2945 © 2024 Cisco and/or its affiliates. All rights reserved. Cisco Public 40
SCSI, NVMe, and RDMA
are different ways
of transferring the
same data.

#CiscoLive BRKDCN-2945 © 2024 Cisco and/or its affiliates. All rights reserved. Cisco Public 41
Misconception

Flow-based load balancing leads to uniform network


utilization for storage traffic

Reality

Not always. Network load balancing considers up to


TCP/UDP layers, not I/O flows and I/O Operations
Data Plane of an IP Storage Network
I/O Flow
TCP or UDP
Flow
I/O Flow
Link
I/O Flow
TCP or UDP
Flow
I/O Flow

Source IP, Destination IP All I/O Operations to a


Layer-4 protocol, source Volume (Logical Unit I/O Operations
port, destination port (LU) or Namespace)
(Long Lived) (Long Lived) (Short Lived)
#CiscoLive BRKDCN-2945 © 2024 Cisco and/or its affiliates. All rights reserved. Cisco Public 43
Load Balancing and Multipathing
• Load balancing (ECMP) is based on OSI Layer 4 flows (source IP, destination IP,
layer 4 protocol (UDP or TCP), layer 4 source port, and layer 4 destination port)
• As a result, all iSCSI or NVMe/TCP traffic in a TCP flow and RoCEv2 traffic in a
UDP flow takes the same network link
• This scheme is more likely to cause non-uniform link utilization as compared to
the I/O Operation (exchange-based) load balancing scheme of Fibre Channel
• Multipathing I/O software on end devices spreads I/O operations
Recommendation
ECMP
Use multipathing I/O (MPIO) on
end devices

MPIO
#CiscoLive BRKDCN-2945 © 2024 Cisco and/or its affiliates. All rights reserved. Cisco Public 44
Load Balancing (LB) with iSCSI and NVMe/TCP

iSCSI NVMe/TCP
• Number of TCP connections for • NVMe/TCP, by default, creates one
iSCSI depends on the TCP connection per I/O Submission
MaxConnections parameter and Completion Queue pair.
(Default = 1).
• Leads to better LB on Network links
• Should be increased based on the
capability of the iSCSI initiator and
target.

#CiscoLive BRKDCN-2945 © 2024 Cisco and/or its affiliates. All rights reserved. Cisco Public 45
Misconception

Packet drops are acceptable in IP/Ethernet storage


networks because other layers retransmit the lost packet

Reality

Packet drops in storage networks lead to significant


performance degradation
Result of Packet Drops in a Storage Networks

• An I/O Operation needs to transfer many packets


• I/O Operation is not complete until all packets are received
• Loss of even one packet results in:
• Re-initiating the entire I/O Operation (typical with SCSI and NVMe)
• Re-initiating the entire sequence (typical with RDMA). Two schemes:
• Go-to-0
• Go-to-N (Nth packet was lost)

#CiscoLive BRKDCN-2945 © 2024 Cisco and/or its affiliates. All rights reserved. Cisco Public 47
Latency and Tail Latency
• TCP (used by iSCSI, NVMe/TCP) retransmits just the lost packet but this
retransmission increases (degrades) I/O Completion Time
• App degrades due to Tail Latency – Latency (I/O Completion Time) of 1% of I/O
Operations
Most I/O Operations
completed quickly

# of I/O
Operations 1% I/O Operations are
delayed

I/O Completion Time


#CiscoLive BRKDCN-2945 © 2024 Cisco and/or its affiliates. All rights reserved. Cisco Public 48
Network Throughput That Lacks Application Goodput
• e.g.: A transactional app runs 1000 I/O operations to move to the next step. Even if one
I/O is delayed due to one packet loss, the app is entirely stuck (Zero Goodput)
• 1000 I/O operations per second (IOPS) with I/O Size 1 MB, each I/O operation results in
1000 packets carrying 1KB. Enough to fill up 10 Gbps link (100% Network Throughput).

I tried so hard and got so far


But in the end, it doesn't even matter
I had to fall to lose it all…

App-level timeout while waiting for


the 1000th I/O to complete
?

#CiscoLive BRKDCN-2945 © 2024 Cisco and/or its affiliates. All rights reserved. Cisco Public 49
Misconception Misconception
There are no CRC/FCS errors Not treating bit errors at enough
so there are no bit errors severity

Reality Reality
Even a few bit errors in storage
Not all bit errors are reported by
networks can significantly
CRC/FCS counters
degrade performance
Bit Errors are More than CRC
• CRC/FCS counters increment only when bit error is within a frame
• Conditions when bit errors exist but CRC/FCS don’t increment:
• Not much traffic on a link
• Bit errors are outside a frame boundary (like inter-frame gap)
• Monitor Forward Error Correction (FEC) counters to detect all bit errors
• CRC errors can be predicted using Invalid Words and FEC

Recommendation

Monitor CRC/FCS as well as FEC counters to detect bit errors

#CiscoLive BRKDCN-2945 © 2024 Cisco and/or its affiliates. All rights reserved. Cisco Public 51
Bit Errors have Significant Effect in Storage Networks
• CRC-corrupted frame is dropped, leading to same effects explained earlier
• Bit errors may also interfere with flow control resulting in congestion or
worsening existing congestion (Pause frame corruption)

Recommendation

Resolve bit errors at top priority in storage networks

#CiscoLive BRKDCN-2945 © 2024 Cisco and/or its affiliates. All rights reserved. Cisco Public 52
TCP Storage
Networks
for iSCSI and
NVMe/TCP
Misconception

TCP works well for storage traffic because it has built-in


reliable delivery, flow control, congestion control, etc.

Reality

The entire reason that TCP invokes these mechanisms is


because something unexpected happened
iSCSI & NVMe/TCP Rely on Lower Layer for handling
Corrupted packets are dropped. Then, retransmitted by TCP
Bit Errors
after timeout or 3-duplicate ACKs (Fast Retransmission)

In-order TCP receiver rearranges packets before sending byte stream to


Deliver upper layers. Uses 32-bit sequence number in TCP header

Network ECMP load-balancing based on TCP flow for uniform link


Performance utilization
Congestion in
end device Receiver reduces window in TCP header (Flow Control)

Congestion in
TCP congestion control (different from flow control)
Network

#CiscoLive BRKDCN-2945 © 2024 Cisco and/or its affiliates. All rights reserved. Cisco Public 55
Window in TCP Header – Used for Flow Control
TCP Header
0 4 8 14 16 31 bit
Source Port Destination Port
Sequence Number
Acknowledgement Number (if ACK set)
20
CEUAPRSF
Data Res bytes
WC R C S S Y I Window Size
Offset 0000
R EGKHTNN
Checksum Urgent pointer (if URG set)

Informed by receiver to sender to pace sender’s rate (aka receiver-window)


#CiscoLive BRKDCN-2945 © 2024 Cisco and/or its affiliates. All rights reserved. Cisco Public 56
TCP Congestion Control Congestion- Linear congestion-
Maximum Sender increases its window halved on window increase (+1)
throughput transmission rate, but detecting packet per ACK during
achieved as per network congestion loss (Multiplicative Congestion Avoidance
receiver-window resulted in packet loss Decrease) (Additive Increase)
45
40 Loss
35 Loss
30
Packets
Sent per 25
Round Trip 20
15
Exponential Slow
10
Start
5
0
1 5 10 15 20 25 30 35 40 45 50
# Round Trip Times (RTT)
#CiscoLive BRKDCN-2945 © 2024 Cisco and/or its affiliates. All rights reserved. Cisco Public 57
TCP Congestion Control Congestion- Linear congestion-
Maximum Sender increases its window halved on window increase (+1)
throughput transmission rate, but detecting packet per ACK during
achieved as per network congestion loss (Multiplicative Congestion Avoidance
receiver-window resulted in packet loss Decrease) (Additive Increase)
45
40 Loss
35 Loss
30
Packets
Sent per 25
Round Trip 20
15
Exponential Slow
10
Start
5
0
1 5 10 15 20 25 30 35 40 45 50
# Round Trip Times (RTT)
#CiscoLive BRKDCN-2945 © 2024 Cisco and/or its affiliates. All rights reserved. Cisco Public 58
Congestion in TCP Storage Networks
Host-1 initiates Read I/O to many targets. Data from the targets arrives at the same time
Spine-1 Spine-2 Spine-3 Spine-4 TCP Incast
Cause 1 – Link Over-Utilization
Host-1-connected switchport on Leaf-6
drops or marks the packets, which
results in TCP congestion control
Leaf-1 Leaf-2 Leaf-3 Leaf-4 Leaf-5 Leaf-6 Cause 2 - Slow Host
Host-1 is unable to process traffic at the
ingress rate, which results in
TCP flow control
Target 1 Target 2 Target 3 Target 4 Target 5 Host-1
Initiate read I/O
(small-size packets)

Transfer data
(Large-size packets)

#CiscoLive BRKDCN-2945 © 2024 Cisco and/or its affiliates. All rights reserved. Cisco Public 59
Congestion in TCP Storage Networks
Host-1 initiates Read I/O to many targets. Data from the targets arrives at the same time
Spine-1 Spine-2 Spine-3 Spine-4 TCP Incast
Cause 1 – Link Over-Utilization
Host-1-connected switchport on Leaf-6
drops or marks the packets, which
results in TCP congestion control
Leaf-1 Leaf-2 Leaf-3 Leaf-4 Leaf-5 Leaf-6
Recommendation
Limit the number of storage ports
Target 1 Target 2 Target 3 Target 4 Target 5 Host-1 talking to a host port and vice-versa

Recommendation
Operate host ports and storage
ports at the same speed
#CiscoLive BRKDCN-2945 © 2024 Cisco and/or its affiliates. All rights reserved. Cisco Public 60
Cause 2 - Slow Host
TCP Flow Control
5 Leaf- 6 Recommendation
Proactively monitor TCP performance on end devices
(Retransmission, Changes in window, unexpected MSS,
unexpected RTO, etc.)
get 5 Host- 1
[root@localhost ~]# ss -e

ESTAB 0 0 172.22.163.112:60185 172.22.163.208:4420


timer:(keepalive,060ms,0) uid:983 ino:215051683 sk:ffff881fc1718780 <->

ts sack cubic wscale:7,7 rto:202 rtt:1.527/2.26 ato:40 mss:1448 cwnd:6


ssthresh:5 send 45.5Mbps lastsnd:4981 lastrcv:4962 lastack:4981
pacing_rate 91.0Mbps retrans:0/8724 rcv_rtt:116314 rcv_space:29760
#CiscoLive BRKDCN-2945 © 2024 Cisco and/or its affiliates. All rights reserved. Cisco Public 61
TCP Congestion Control – Sawtooth Pattern
This is what application using This is what application gets from
iSCSI or NVMe/TCP expects network under congestion

Packets
Sent per
Round Trip

Make Changes to achieve this Time

Recommendation
Proactively detect network congestion and solve its root cause like
increasing link speed, adding more links, etc.

#CiscoLive BRKDCN-2945 © 2024 Cisco and/or its affiliates. All rights reserved. Cisco Public 62
TCP Congestion Control

Approach 1: Drop packets and retransmit the lost packets


Approach 2: Inform end devices before dropping the packets

Both approaches result in reduced transmission rate on the sender leading to:
• Network congestion control
• Performance degradation on the end devices

#CiscoLive BRKDCN-2945 © 2024 Cisco and/or its affiliates. All rights reserved. Cisco Public 63
Congestion Prevention By Notifying the End Devices
1. Congestion detection: Switch detects congestion.
2. Sending a notification: Switch informs the end devices.
3. Receiving a notification: End devices understand the notification.
4. Congestion prevention action: End devices take preventive actions, such as
reducing their rate.

Similar mechanism exists for RoCEv2 Congestion Management (RCM) and Fibre Channel
with different degree of adoption and readiness.

#CiscoLive BRKDCN-2945 © 2024 Cisco and/or its affiliates. All rights reserved. Cisco Public 64
Explicit Congestion Notification in TCP/IP Networks
Ethernet Frame Format
(64 - 1522 bytes or 9216 bytes when jumbo frames are enabled)
(6) (6) (4) (2) (42 – 1500 or 9194) (4)
802.1Q
Destination Source Ether IP TCP CRC/
VLAN Payload
MAC MAC Type Header Header FCS
Header
Ethernet Header
IPv4 or IPv6 Header TCP Header
0 4 8 14 16 31 bit
DSCP ECN
Source Port Destination Port
(6 bits) (2 bits)
Sequence Number
Acknowledgement Number (if ACK set)
C E UAP RS F 20
Data Res bytes
WC R C S S Y I Window Size
Offset 0000
ECN-capable Transport (ECT): 01 or 10 R E GK H T NN
Checksum Urgent pointer (if URG set)
Congestion Experience (CE): 11
CWR: Congestion Window Reduced
ECE: ECN-Echo
#CiscoLive BRKDCN-2945 © 2024 Cisco and/or its affiliates. All rights reserved. Cisco Public 65
ECN in TCP/IP Networks
Traffic Spine-1

Leaf-1 Leaf-6
When Target-1 and Host-1 establish a TCP connection (layer
4), they also exchange their willingness to use ECN.
ECN is used ONLY if they agree on using it

Target-1 Host-1
Source Destination

#CiscoLive BRKDCN-2945 © 2024 Cisco and/or its affiliates. All rights reserved. Cisco Public 66
ECN in TCP/IP Networks
Traffic Spine-1

Leaf-1 Leaf-6
Mark ECT in egress IP
1
header
ECT

Target-1 Host-1
Source Destination

#CiscoLive BRKDCN-2945 © 2024 Cisco and/or its affiliates. All rights reserved. Cisco Public 67
ECN in TCP/IP Networks
Traffic Spine-1

Leaf-1 Leaf-6
Mark ECT in egress IP 2 Egress congestion detected
1
header
ECT
Congestion results in excessive utilization
Target-1 of egress queue Host-1
Source Destination

#CiscoLive BRKDCN-2945 © 2024 Cisco and/or its affiliates. All rights reserved. Cisco Public 68
ECN in TCP/IP Networks
Traffic Spine-1

Leaf-1 Leaf-6
Mark ECT in egress IP 2 Egress congestion detected
1
header 3 Mark CE in IP header
CE
ECT

Target-1 Marking is enabled by AQM (WRED, AFD) Host-1


Source Destination
Only ECT-enabled packets are marked with CE

#CiscoLive BRKDCN-2945 © 2024 Cisco and/or its affiliates. All rights reserved. Cisco Public 69
ECN in TCP/IP Networks
Traffic Spine-1

Leaf-1 Leaf-6
Mark ECT in egress IP 2 Egress congestion detected
1
header 3 Mark CE in IP header
CE
ECT Detects CE in ingress IP
ECE
4 header and then marks
ECE in TCP header of the
Target-1 next ACK to the source Host-1
Source Destination

#CiscoLive BRKDCN-2945 © 2024 Cisco and/or its affiliates. All rights reserved. Cisco Public 70
ECN in TCP/IP Networks
Traffic Spine-1

Leaf-1 Leaf-6
Mark ECT in egress IP 2 Egress congestion detected
1
header 3 Mark CE in IP header
CE
ECT Detects ECE in ingress Detects CE in ingress IP
ECE
TCP header. Reduces 4 header and then marks
ECE in TCP header of the
Target-1 5 transmission rate as if Host-1
packet had been next ACK to the source
Source Destination
dropped

#CiscoLive BRKDCN-2945 © 2024 Cisco and/or its affiliates. All rights reserved. Cisco Public 71
ECN in TCP/IP Networks
Traffic Spine-1

Leaf-1 Leaf-6
Mark ECT in egress IP 2 Egress congestion detected
1
header 3 Mark CE in IP header
CE
ECT CWR Detects ECE in ingress Detects CE in ingress IP
ECE
TCP header. Reduces 4 header and then marks
ECE in TCP header of the
Target-1 5 transmission rate as if Host-1
packet had been next ACK to the source
Source Destination
dropped
Marks CWR flag in next
6
packet
#CiscoLive BRKDCN-2945 © 2024 Cisco and/or its affiliates. All rights reserved. Cisco Public 72
ECN in TCP/IP Networks
Traffic Spine-1

Leaf-1 Leaf-6
Mark ECT in egress IP 2 Egress congestion detected
1
header 3 Mark CE in IP header
CE
ECT CWR Detects ECE in ingress Detects CE in ingress IP
ECE
TCP header. Reduces 4 header and then marks
ECE in TCP header of the
Target-1 5 transmission rate as if Host-1
packet had been next ACK to the source
Source Destination
dropped Detects CWR in ingress
Marks CWR flag in next TCP header. Stops
7
6 marking ECE in egress
packet
TCP header
#CiscoLive BRKDCN-2945 © 2024 Cisco and/or its affiliates. All rights reserved. Cisco Public 73
Congestion Prevention By Notifying the End Devices
ECN in TCP/IP Networks Spine-1 1
Hey “Destination”, Congestion is
detected in the network
Traffic

Leaf-1 Leaf-6
2
Okay. I will inform the “Source”

Target-1 3 Host-1
Source Okay. I will adjust my transmission Destination
rate

#CiscoLive BRKDCN-2945 © 2024 Cisco and/or its affiliates. All rights reserved. Cisco Public 74
Misconception

TCP ECN prevents packet drops

Reality

TCP ECN doesn’t guarantee no packet drops


TCP ECN – No Guarantee of No Packet Drops
Traffic Spine-1 1 Network detects congestion and informs destination
2 Destination informs the sources(s)
3 Source(s) reduce their transmission rates

Leaf-1 Leaf-6

Delayed action: Delay between ECN marking and


Target-1 observing the reduced rate on the congested port. Host-1
Source Packets may be dropped during this time Destination

“CE packets indicate persistent rather than transient congestion” - RFC 3168
#CiscoLive BRKDCN-2945 © 2024 Cisco and/or its affiliates. All rights reserved. Cisco Public 76
ECN Considerations for Storage Traffic
1. Synchronization: Potential overreaction or under-reaction by multiple senders.
• A random AQM for marking packets should avoid this situation
2. Delayed action: Delay between ECN marking and observing the reduced rate
on the congested port. Packets may be dropped during this time. “CE packets
indicate persistent rather than transient congestion” - RFC 3168.
Recommendation
Mark ECN early so that rate-reduction is observed before queues become full
3. Packet drops: ECN does not guarantee that the packets are not dropped.
4. Network Boundary: For you, anything below TCP layer is network. For me,
network boundary starts at the switchport.
5. Configuration : Not all end devices may support ECN or enable it by default
Recommendation
Verify and Enable ECN on end devices
#CiscoLive BRKDCN-2945 © 2024 Cisco and/or its affiliates. All rights reserved. Cisco Public 77
Why Switch Buffer Management?
TCP sender reduced its transmission rate:
1. When the sender detects packet loss because of:
a. Retransmission timeout (RTO) or
b. Receiving third duplicate ACK (Fast Retransmission)
2. When the sender receives an ECE-marked TCP packet.
Both events originate at a network port when its queues fill up. Therefore,
dropping or marking schemes for packets that are waiting in a queue can
significantly influence TCP’s behavior on the end devices. These schemes are
called Active Queue Management (AQM):
• Weighted Random Early Detection (WRED)
• Approximate Fair Dropping (AFD)
• Dynamic Packet Prioritization (DPP)
#CiscoLive BRKDCN-2945 © 2024 Cisco and/or its affiliates. All rights reserved. Cisco Public 78
Network Congestion Detection – Queue Utilization
Eth1/2 Eth1/3 Eth1/1 Eth1/2 Eth1/3 Eth1/4
Eth1/1 Eth1/4

To Spine-2

To Spine-4
To Spine-3
To Spine-1

Ingress Traffic

Leaf-6 Queue assigned


to storage traffic
Eth1/5

Host-1 Egress Traffic


Eth1/5
#CiscoLive BRKDCN-2945 © 2024 Cisco and/or its affiliates. All rights reserved. Cisco Public 79
Queue Utilization States
Queue Utilization Empty

Max.
Queue
Size

Transmission No
Queueing Delay None
Packet drops None
Headroom for traffic burst Highest

#CiscoLive BRKDCN-2945 © 2024 Cisco and/or its affiliates. All rights reserved. Cisco Public 80
Queue Utilization States
Queue Utilization Empty Low
Ingress Traffic

Headroom
Max. for Traffic
Queue Burst
Size
Current
Utilization
Egress Traffic
Transmission No Yes
Queueing Delay None Low
Packet drops None None
Headroom for traffic burst Highest High

#CiscoLive BRKDCN-2945 © 2024 Cisco and/or its affiliates. All rights reserved. Cisco Public 81
Queue Utilization States
Queue Utilization Empty Low High
Ingress Traffic Ingress Traffic
Headroom
Headroom for Traffic
Max. for Traffic Burst
Queue Burst
Size Current
Current Utilization
Utilization
Egress Traffic Egress Traffic
Transmission No Yes Yes
Queueing Delay None Low High
Packet drops None None None
Headroom for traffic burst Highest High Low

#CiscoLive BRKDCN-2945 © 2024 Cisco and/or its affiliates. All rights reserved. Cisco Public 82
Queue Utilization States
Queue Utilization Empty Low High Full
Ingress Traffic Ingress Traffic Ingress Traffic
Headroom
Headroom for Traffic
Max. for Traffic Burst Current
Queue Burst Utilization
Size Current
Current Utilization
Utilization
Egress Traffic Egress Traffic Egress Traffic
Transmission No Yes Yes Yes
Queueing Delay None Low High Highest
Packet drops None None None Yes
Headroom for traffic burst Highest High Low None

#CiscoLive BRKDCN-2945 © 2024 Cisco and/or its affiliates. All rights reserved. Cisco Public 83
Queue Utilization States
Recommendation
Most Desirable Not Desirable
Problem Under-utilized
Next best option (60%) links are
Queue Utilization Empty Low High Full better than highly
Ingress Traffic Ingress Traffic Ingress Traffic utilized (90%) links
Headroom
for Traffic
in storage networks
Headroom
Max. for Traffic Burst Current
Queue Burst Utilization
Size Current
Current Utilization
Utilization
Egress Traffic Egress Traffic
Recommendation
Egress Traffic
Transmission No Yes Yes Yes Over-provision
Queueing Delay None Low High Highest
capacity in storage
Packet drops None None None Yes
Headroom for traffic burst Highest High Low None
networks

#CiscoLive BRKDCN-2945 © 2024 Cisco and/or its affiliates. All rights reserved. Cisco Public 84
ECN Recommendations
Most Desirable Not Desirable
Problem
Next best option
Queue Utilization Empty Low High Full Recommendation
Ingress Traffic Ingress Traffic Ingress Traffic Mark ECN early so
Headroom
Headroom
for Traffic
that rate-reduction is
Max. for Traffic Burst Current observed before
Utilization
queues become full
Queue Burst
Size Current
Current Utilization
Utilization
Egress Traffic Egress Traffic Egress Traffic
Transmission No Yes Yes Yes Recommendation
Queueing Delay None Low High Highest
Verify and Enable
Packet drops None None None Yes
Headroom for traffic burst Highest High Low None
ECN on end devices

#CiscoLive BRKDCN-2945 © 2024 Cisco and/or its affiliates. All rights reserved. Cisco Public 85
WRED Configuration on Nexus 9000

policy-map type queuing OUTPUT_Q


class type queuing c-out-8q-q3
random-detect minimum-threshold 100 kbytes
maximum-threshold 400 kbytes
drop-probability 50 weight 0 ecn

Misconception

QoS configuration is needed only in shared storage


networks

Mark packets instead of dropping


Reality

QoS configuration is mandatory in shared as well as


dedicated storage networks

#CiscoLive BRKDCN-2945 © 2024 Cisco and/or its affiliates. All rights reserved. Cisco Public 86
How WRED Works
No packets are Packets are All packets are New incoming packets
detected detected randomly detected are tail dropped

Ingress
Traffic

Ingress
Probability Traffic
denominator More packets are
detected
Ingress
Traffic
Fewer packets are
detected
Ingress
Traffic
min max full
Average queue utilization
#CiscoLive BRKDCN-2945 © 2024 Cisco and/or its affiliates. All rights reserved. Cisco Public 87
WRED is class-aware.
All storage traffic in a class is treated equal.
No per-flow detection

TCP Flow Tier-1 RoCEv2


Storage
Traffic TCP Flow Tier-2 NVMe/TCP
Class TCP Flow Tier-3 iSCSI
Link
Other
Solution
Traffic
Class Use AFD and DPP on Nexus 9000 Switches

#CiscoLive BRKDCN-2945 © 2024 Cisco and/or its affiliates. All rights reserved. Cisco Public 88
AFD Headroom for Desired
mice flows queue length

Mice Flow

Elephant Flow

Available buffers for small-size I/O Elephant flows are


operations, typically used by Non-empty queue dropped or ECN-marked
transactional workloads, and non-I/O achieves full utilization of as per configuration
SCSI/NVMe commands the link capacity
switch (config)# hardware qos etrap byte-count 1048555
switch (config)# hardware qos etrap bandwidth-threshold 500 bytes
switch (config)# hardware qos etrap age-period 50 usec
#CiscoLive BRKDCN-2945 © 2024 Cisco and/or its affiliates. All rights reserved. Cisco Public 89
Prioritized Mice-flow
DPP DPP max-num-pkts = N
queue for small-size I/O
operations, typically used
Mice Flow by transactional
Mice-Flow Queue
(< N Packets) workloads, and non-I/O
N N SCSI/NVMe commands
+ + N 2 1
2 1 Egress
Elephant Flow Port
• First N packets go
through the mice-flow
Long-lived large flows,
queue Regular Queue typically used by backup
• Packets beyond Nth go
applications are
through the regular
automatically moved to
queue for the class
the regular queue
switch(config)# hardware qos dynamic-packet-prioritization max-num-pkts 120
switch(config)# hardware qos dynamic-packet-prioritization age-period 5000 usec

#CiscoLive BRKDCN-2945 © 2024 Cisco and/or its affiliates. All rights reserved. Cisco Public 90
Detecting Congestion in TCP Storage Networks
• Packet Drops
• ECN Counters
• Link Utilization
• Queue-depth monitoring and Microburst Detection

Recommendation

Monitor per-interface and per-class counters

#CiscoLive BRKDCN-2945 © 2024 Cisco and/or its affiliates. All rights reserved. Cisco Public 91
Queue-depth Monitoring and Microburst Detection
Ingress Traffic
Queue Full: Detected via packet drops

Queue utilization above ECN threshold: Detected via ECN counters

Queue Depth: Current queue usage


Queue utilization above microburst threshold: Detected via microburst events
Non-empty queue: Detected by Tx packet counters
Egress Traffic

#CiscoLive BRKDCN-2945 © 2024 Cisco and/or its affiliates. All rights reserved. Cisco Public 92
Microburst Detection on Nexus 9000 Switches
Detects when queue utilization exceeds a thresholds for a few microseconds
switch# show queuing burst-detect detail
slot 1
=======
-----------------------------------------------------------
Microburst Statistics
Flags: E - Early start record, U - Unicast, M - Multicast
----------------------------------------------------------------------------------------------------------------------------- -----------------------------
Ethernet | Queue |Start Depth| Start Time | Peak Depth | Peak Time | End Depth | En d Time | Duration |
Interface| | (bytes) | | (bytes) | | (bytes) | | |
----------------------------------------------------------------------------------------------------------------------------- -----------------------------
Eth1/15| U5 | 29840 | 2023/03/22 11:55:34:993672 | 5390944 | 2023/03/22 11:55:35:013740 | 29952 | 2023/03/22 1 1:59:16:153676 | 221.16 s |
Eth1/15| U5 | 29840 | 2023/03/22 12:17:17:653207 | 7497568 | 2023/03/22 12:17:17:682402 | 29952 | 2023/03/22 1 2:26:13:771249 | 536.12 s |
----------------------------------------------------------------------------------------------------------------------------- -----------------------------

Occasional microburst events should be fine


However, need investigation on Ports that
• Continuously report microburst events,
• Show a higher number of microbursts than other ports, or
• Show an increasing trend of microburst events
#CiscoLive BRKDCN-2945 © 2024 Cisco and/or its affiliates. All rights reserved. Cisco Public 93
Packet Drops, ECN Counters, and Queue-Depth
switch# show interface ethernet 1/8
Ethernet1/8 is up
switch# show queuing interface ethernet 1/15
<snip> <snip>
RX +-------------------------------------------------------------+
<snip> | QOS GROUP 5 |
895511132995 input packets 1572386629130153 bytes +-------------------------------------------------------------+
732342060958 jumbo packets 0 storm suppression bytes | | Unicast |Multicast |
0 runts 0 giants 20 CRC 0 no buffer +-------------------------------------------------------------+
0 input error 0 short frame 0 overrun 0 underrun 0 | Tx Pkts | 16368604| 0|
ignored | Tx Byts | 46733248300| 0|
0 watchdog 0 bad etype drop 0 bad proto drop 0 if down drop | ECN Pkts | 9609408| 0|
0 input with dribble 0 input discard | ECN Byts | 27614822366| 0|
0 Rx pause | Q Depth Byts | 7819136| 0|
0 Stomped CRC | WD Flush Pkts | 14107532| 0|
TX +-------------------------------------------------------------+
<snip> | QOS GROUP 6 |
874664925716 output packets 1418828496876292 bytes +-------------------------------------------------------------+
655056929813 jumbo packets | | Unicast |Multicast |
0 output errors 0 collision 0 deferred 0 late collision +-------------------------------------------------------------+
0 lost carrier 0 no carrier 0 babble 0 output discard | Tx Pkts | 0| 0|
0 Tx pause | Tx Byts | 0| 0|
<snip> | WRED/AFD & Tail Drop Pkts | 0| 0|
| WRED/AFD & Tail Drop Byts | 0| 0|
| Q Depth Byts | 0| 0|
| WD & Tail Drop Pkts | 0| 0|
+-------------------------------------------------------------+

#CiscoLive BRKDCN-2945 © 2024 Cisco and/or its affiliates. All rights reserved. Cisco Public 94
Verifying Configuration
switch# show policy-map interface ethernet 1/15 output detail
Class-map (queuing): c-out-8q-q5 (match-any)
bandwidth remaining percent 15

for WRED, shape min 1 gbps max 1 gbps


random-detect minimum-threshold 100 kbytes

Microburst Detection maximum-threshold 200 kbytes


drop-probability 100 weight 0 ecn

and burst-detect rise-threshold 30000 bytes


fall-threshold 30000 bytes

ECN Counters
PFC Statistics
TXPPP Packets: 0
RXPPP Packets: 0
PFC WD Statistics Shutdown Count: 0
Queue Full: Packet drops Restore Count: 0
Flushed Packets: 0
queue dropped pkts : 0

WRED Max: 200KB


queue depth in bytes : 51424
Ingress queue discard packets : 0
Ingress queue depth in bytes : 0
ECN marked packets: 4459882
WRED Min: 100KB Unicast Transmit Packets : 21651482445
Multicast Transmit Packets : 0
Queue depth: 51,424KB Unicast Transmit Bytes : 15068405583053

Microburst threshold: 30KB Multicast Transmit Bytes : 0


Unicast Dropped Packets : 47536
Multicast Dropped Packets : 0
Unicast Dropped Bytes : 135706844
Multicast Dropped Bytes : 0

Packet drops despite enabling ECN! Unicast Depth Bytes : 51424


Multicast Depth Bytes : 0

#CiscoLive BRKDCN-2945 © 2024 Cisco and/or its affiliates. All rights reserved. Cisco Public 95
Flow Monitoring in Nexus Dashboard Insights

#CiscoLive BRKDCN-2945 © 2024 Cisco and/or its affiliates. All rights reserved. Cisco Public 96
Revisit

Misconception

TCP works well for storage traffic because it has built-in


reliable delivery, flow control, congestion control, etc.

Recommendation

TCP built-in mechanisms should be your insurance policy.


Have them, but don’t rely on them. Solve the root cause.
Lossless Ethernet
for RoCEv2
Misconception

Unlike FC, RoCE does not suffer from slow-drain or


congestion spreading

Reality

All lossless networks are prone to congestion spreading


Lossy Networks Speed mismatch

• No flow control between directly Frame drops


connected devices when the
receiver’s buffers are full, hence 1 GbE 10 GbE
new incoming packets are
dropped Host-1 Switch-1 Target-1

• Another layer (e.g. TCP) BW mismatch


retransmits the lost packets
10 GbE
Frame drops
Target-1
10 GbE 10 GbE

Host-1 Switch-1 Target-2

Traffic 10 GbE
Target-3
#CiscoLive BRKDCN-2945 © 2024 Cisco and/or its affiliates. All rights reserved. Cisco Public 100
Lossless Networks Speed mismatch

• Flow control between directly Slow down ingress


connected devices to pace traffic to 1 Gbps
traffic rate for avoiding buffer 1 GbE 10 GbE
overrun on the receiver
Host-1 Switch-1 Target-1
• No guarantee of lossless frame
delivery, although works quite BW mismatch
well
Slow down combined 10 GbE
• Frames are still dropped due to ingress traffic to 10 Gbps
bit errors (CRC) Target-1
10 GbE 10 GbE

Host-1 Switch-1 Target-2

Traffic 10 GbE
Backpressure
Target-3
#CiscoLive BRKDCN-2945 © 2024 Cisco and/or its affiliates. All rights reserved. Cisco Public 101
Ethernet Flow Control – Two Flavors
Link-Level Flow Control Traffic
(LLFC)
Flow Control
Paces all traffic from directly-
connected device (IEEE 802.3x).
Rarely used.

Priority-based Flow Control Class 1 No Flow Control


(PFC) Class 2 No Flow Control
Class 3 No Flow Control
Paces traffic in specific classes from Class 4 Flow Control
directly-connected device while other Class 5 Flow Control
classes are not flow controlled (IEEE Class 6 No Flow Control
802.1Qbb). Class 7 No Flow Control
Class 8 No Flow Control
#CiscoLive BRKDCN-2945 © 2024 Cisco and/or its affiliates. All rights reserved. Cisco Public 102
Configuring Lossless/Converged Ethernet

1. Classifying and marking the traffic: Layer 2 (CoS) or Layer 3 (DSCP).


2. Flow-control: PFC.
3. Bandwidth guarantee: Enhanced Transmission Selection (ETS).
4. Consistent implementation: Data Center Bridging Exchange (DCBX).

#CiscoLive BRKDCN-2945 © 2024 Cisco and/or its affiliates. All rights reserved. Cisco Public 103
Traffic Classification for PFC
Layer 2 PFC Layer 3 PFC
• Classify traffic using PCP/CoS in • Classify traffic using DSCP in IP
802.1Q VLAN header header. Use-case:
• VLAN header is not present
• VLAN header is not carried (e.g.
VXLAN)
DSCP-to-CoS mapping maintains lossless behavior in L2-L3 boundaries
(e.g.: VXLAN)

• PFC Pause frame itself allows flow control of up to 8 traffic classes regardless
of how traffic is classified
• Devices in your environment may support fewer no-drop classes (N9k: 3)

#CiscoLive BRKDCN-2945 © 2024 Cisco and/or its affiliates. All rights reserved. Cisco Public 104
Ethernet VLAN CoS and IP DSCP Mapping
DSCP Name DSCP Decimal DSCP Binary CoS Binary CoS Decimal
CS0 000000 0 000 0

CS1 / AF11 / AF12 / AF13 001000 / 001010 /


001100 / 001110 8 / 10 / 12 / 14 001 1

CS2 / AF21 / AF22 / AF23 010000 / 010010 /


010100 / 010110 16 / 18 / 20 / 22 010 2

CS3 / AF31 / AF32 / AF33 011000 / 011010 /


011100 / 011110 24 / 26 / 28 / 30 011 3

CS4 / AF41 / AF42 / AF43 100000 / 100010 /


100100 / 100110 32 / 34 / 36 / 38 100 4

CS5 / / / EF 101000 / 101010 / 40 / 42 / 44 / 46 101 5


101100 / 101110

CS6 110000 / 110010 / 48 / 50 / 52 / 54 110 6


110100 / 110110

CS7 111000/ 111010 / 56 / 58 / 60 / 62 111 7


111100 / 111110

NX-OS command show system internal ipqos global-defaults shows this


#CiscoLive BRKDCN-2945 © 2024 Cisco and/or its affiliates. All rights reserved. Cisco Public 105
Ethernet Flow Control – Pause Time
• Ethernet uses a special frame (Pause frame) for flow control
• Pause frames carry quanta between 0 to 65535 for each class. Nexus and
UCS always sets quanta to max quanta (65535)
• Pause time = Quanta x Time taken to transmit 512 bits
• Max Pause time on 10 GbE = 3.355 ms
• Max Pause time on 100 GbE = 0.355 ms
PFC Pause Frame
(64 bytes)
6 6 2 2 2 2 2 2 2 2 2 2 2 26 4
Class
Destination Source Ether Op C C C C C CCC CRC/
Enable Padding
MAC MAC Type Code 0 1 2 3 4 5 6 7 FCS
Vector
Per Class Quanta
#CiscoLive BRKDCN-2945 © 2024 Cisco and/or its affiliates. All rights reserved. Cisco Public 106
Ethernet Flow Control – Receiver and Sender
• Traffic Receiver
• Sends Pause Frame with non-zero quanta to stop the sender from transmission
• Sends Pause Frame with zero quanta to request sender to resume transmission
• Traffic Sender
• Stops transmission for Pause Time after Pause frame
receiving a Pause frame with non-zero quanta
• Starts transmission when:
• Pause Time expires, or Traffic Traffic
• Pause Frame with zero quanta is received Receiver Sender
• Pause frames have dual purpose.
• Pause frame with non-zero quanta stops or pauses the traffic
• Pause frame with zero quanta starts or resumes the traffic (aka Un-Pause frame)

#CiscoLive BRKDCN-2945 © 2024 Cisco and/or its affiliates. All rights reserved. Cisco Public 107
When are Pause Frames Sent?
• Ingress traffic rate is faster than
Egress Ingress
egress traffic rate No-drop
No-drop
• Hence, egress queue starts filling queue queue
up Headroom
Pause
• Pause Threshold is also known Threshold
as XOFF Threshold Resume
Threshold
• Resume Threshold is also known
as XON Threshold
1 GbE 10 GbE

Host-1 Traffic Target-1


Switch-1

#CiscoLive BRKDCN-2945 © 2024 Cisco and/or its affiliates. All rights reserved. Cisco Public 108
When are Pause Frames Sent?
• Before egress queue is
Egress Ingress
completely full, No-drop
No-drop
queue queue
Headroom
Pause
Threshold
Resume
Threshold

1 GbE 10 GbE

Host-1 Traffic Target-1


Switch-1

#CiscoLive BRKDCN-2945 © 2024 Cisco and/or its affiliates. All rights reserved. Cisco Public 109
When are Pause Frames Sent?
• Before egress queue is
Egress Ingress
completely full, backpressure is No-drop
No-drop
applied to ingress queue queue
queue
• Egress queue-full condition is Headroom
avoided because that would Pause
result in dropping of any new Threshold
incoming packets Resume
Threshold
• Ingress queue is maintained only
for lossless traffic class Backpressure
1 GbE 10 GbE

Host-1 Traffic Target-1


Switch-1

#CiscoLive BRKDCN-2945 © 2024 Cisco and/or its affiliates. All rights reserved. Cisco Public 110
When are Pause Frames Sent?
• New incoming packets are
Egress Ingress
buffered in the ingress queue No-drop
No-drop
queue queue
Headroom
Pause
Threshold
Resume
Threshold

Backpressure
1 GbE 10 GbE

Host-1 Traffic Target-1


Switch-1

#CiscoLive BRKDCN-2945 © 2024 Cisco and/or its affiliates. All rights reserved. Cisco Public 111
When are Pause Frames Sent?
• As ingress queue utilization Send Pause frame with
Egress Ingress non-zero quanta when
increases… No-drop
No-drop buffer utilization exceeds
queue queue Pause threshold
Headroom
Pause
Threshold
Resume
Threshold

Backpressure
1 GbE 10 GbE

Host-1 Traffic Target-1


Switch-1

Pause frame
#CiscoLive BRKDCN-2945 © 2024 Cisco and/or its affiliates. All rights reserved. Cisco Public 112
When are Pause Frames Sent?
• Target-1 receives the Pause Send Pause frame with
Egress Ingress non-zero quanta when
Frame and stops transmission No-drop
No-drop buffer utilization exceeds
queue queue Pause threshold
Headroom
Pause
Threshold
Resume
Threshold

Backpressure
1 GbE 10 GbE

Host-1 Traffic Target-1


Switch-1

Pause frame
#CiscoLive BRKDCN-2945 © 2024 Cisco and/or its affiliates. All rights reserved. Cisco Public 113
When are Pause Frames Sent?
• Switch-1 sends the next Pause Send Pause frames with
Egress Ingress non-zero quanta when
Frame with non-zero quanta if No-drop
No-drop buffer utilization exceeds
ingress queue utilization stays queue Pause threshold
queue
above the Resume threshold
Headroom
Pause
Threshold
Resume
Threshold

Backpressure
1 GbE 10 GbE

Host-1 Traffic Target-1


Switch-1

Pause frame
#CiscoLive BRKDCN-2945 © 2024 Cisco and/or its affiliates. All rights reserved. Cisco Public 114
When Pause Frames are Sent
• As ingress queue utilization
Egress Ingress
decreases… No-drop
No-drop
queue queue
Headroom
Pause Send Pause frame with
Threshold zero quanta when buffer
Resume utilization falls below
Threshold Resume threshold after
exceeding Pause
threshold
Backpressure
1 GbE 10 GbE

Host-1 Traffic Target-1


Switch-1

Pause frame
#CiscoLive BRKDCN-2945 © 2024 Cisco and/or its affiliates. All rights reserved. Cisco Public 115
Ethernet Flow Control - Summary
Send Pause frame with
Egress Ingress non-zero quanta when
No-drop No-drop buffer utilization exceeds
queue queue Pause threshold
Tight coordination between the Headroom
Pause Send Pause frame with
Pause and the Resume thresholds Threshold zero quanta when buffer
of Switch-1 and sending of the
Resume utilization falls below
Pause frames (non-zero quanta)
Threshold Resume threshold after
and Un-Pause frames (zero exceeding Pause
quanta) achieve rate equalization threshold
between the ingress port (10 GbE) Backpressure
1 GbE 10 GbE
with egress port (1 GbE)
Host-1 Traffic Target-1
Switch-1

Pause frame
#CiscoLive BRKDCN-2945 © 2024 Cisco and/or its affiliates. All rights reserved. Cisco Public 116
Congestion Spreading in Lossless Ethernet Networks
Backpressure Pause frame
1 GbE 10 GbE
Culprit Host-1
Victim
Traffic Target-1
Victims Host 2 - 40 Switch-1

• Flow-control from Switch-1 reduces rate of all traffic from Target-1


regardless of destination
• Target-1 is victimized
• Other hosts communicating with Target-1 are indirectly victimized

#CiscoLive BRKDCN-2945 © 2024 Cisco and/or its affiliates. All rights reserved. Cisco Public 117
PFC Storm
Backpressure Pause frame
Culprit Host-1
Victim
Traffic Target-1
Victims Host 2 - 40 Switch-1

• A term in use to indicate congestion in lossless Ethernet networks


• Refers to excessive number of Pause frame
• May be sent by a slow end device
• May be sent by a switchport as a result of congestion spreading caused
by a slow device or link overutilization

#CiscoLive BRKDCN-2945 © 2024 Cisco and/or its affiliates. All rights reserved. Cisco Public 118
Congestion Spreading in Lossless Ethernet Networks
Slow Drain Traffic
PFC Storm
Culprit
Pause frames
Host-1

Switch-1 Target 1 - 8
Victims Host 2 - 40

• Host-1 has a slower processing rate as compared to the rate at which traffic
is being delivered to it.
• Known as “Slow Drain” in Fibre Channel community
• Host-1 uses PFC to slow down traffic rate from Switch-1
• A storm of Rx PFC Pause frames on the switchport

#CiscoLive BRKDCN-2945 © 2024 Cisco and/or its affiliates. All rights reserved. Cisco Public 119
Congestion Spreading in Lossless Ethernet Networks
Slow Drain Traffic
PFC Storm
Culprit Pause
Pause frames
Host-1

Victims Victims
Switch-1 Target 1 - 8
Host 2 - 40

• Host-1 has a slower processing rate as compared to the rate at which traffic is being
delivered to it.
• Known as “Slow Drain” in Fibre Channel community
• Host-1 uses PFC to slow down traffic rate from Switch-1
• A storm of Rx PFC Pause frames on the switchport
• Switch-1 exceeds Pause threshold. Hence, reduces traffic rate from Targets

• Host 2 – 40 are victimized because their Targets are victimized by Host-1

#CiscoLive BRKDCN-2945 © 2024 Cisco and/or its affiliates. All rights reserved. Cisco Public 120
Congestion in Lossless Ethernet Networks (RoCEv2)
Host-1 initiates Read I/O to many targets. Data from the targets arrives at the same time
Spine-1 Spine-2 Spine-3 Spine-4 Cause 1 – Link Over-Utilization
Host-1-connected switchport on Leaf-6
is already transmitting at full capacity yet
the spines are sending more traffic for it

Leaf-1 Leaf-2 Leaf-3 Leaf-4 Leaf-5 Leaf-6 Cause 2 - Slow Host


Host-1 has slower processing rate as
compared to the rate at which traffic is
being delivered to it
Target 1 Target 2 Target 3 Target 4 Target 5 Host-1
Initiate read I/O These are the same
(small-size packets) causes as lossy (TCP)
Networks. But the effect
Transfer data is different
(Large-size packets) (Congestion Spreading)

#CiscoLive BRKDCN-2945 © 2024 Cisco and/or its affiliates. All rights reserved. Cisco Public 121
Congestion Spreading in Lossless Ethernet Networks
Spine-1 Spine-2 Spine-3 Spine-4

Leaf-1 Leaf-2 Leaf-3 Leaf-4 Leaf-5 Leaf-6

Traffic
Pause

Target-1 Target-2 Target-3 Target-4 Target-5 Host-1

Cause 2 - Slow Host


#CiscoLive BRKDCN-2945 © 2024 Cisco and/or its affiliates. All rights reserved. Cisco Public 122
Congestion Spreading in Lossless Ethernet Networks
Spine-1 Spine-2 Spine-3 Spine-4

Leaf-1 Leaf-2 Leaf-3 Leaf-4 Leaf-5 Leaf-6

Traffic
Pause PFC Storm

Target-1 Target-2 Target-3 Target-4 Target-5 Host-1


Culprit
Cause 2 - Slow Host
#CiscoLive BRKDCN-2945 © 2024 Cisco and/or its affiliates. All rights reserved. Cisco Public 123
Congestion Spreading in Lossless Ethernet Networks
Spine-1 Spine-2 Spine-3 Spine-4

Leaf-1 Leaf-2 Leaf-3 Leaf-4 Leaf-5 Leaf-6


Pause

Traffic
Pause PFC Storm

Target-1 Target-2 Target-3 Target-4 Target-5 Host-1


Culprit
Cause 2 - Slow Host
#CiscoLive BRKDCN-2945 © 2024 Cisco and/or its affiliates. All rights reserved. Cisco Public 124
Congestion Spreading in Lossless Ethernet Networks
Spine-1 Spine-2 Spine-3 Spine-4

Leaf-1 Leaf-2 Leaf-3 Leaf-4 Leaf-5 Leaf-6


Pause

Traffic Victims
Pause PFC Storm

Target-1 Target-2 Target-3 Target-4 Target-5 Host-1


Culprit
Cause 2 - Slow Host
#CiscoLive BRKDCN-2945 © 2024 Cisco and/or its affiliates. All rights reserved. Cisco Public 125
Congestion Spreading in Lossless Ethernet Networks
Spine-1 Spine-2 Spine-3 Spine-4

Pause Pause Pause Pause

Leaf-1 Leaf-2 Leaf-3 Leaf-4 Leaf-5 Leaf-6


Pause

Traffic Victims
Pause PFC Storm

Target-1 Target-2 Target-3 Target-4 Target-5 Host-1


Culprit
Cause 2 - Slow Host
#CiscoLive BRKDCN-2945 © 2024 Cisco and/or its affiliates. All rights reserved. Cisco Public 126
Congestion Spreading in Lossless Ethernet Networks
Spine-1 Spine-2 Spine-3 Spine-4

Pause Pause Pause Pause

Leaf-1 Leaf-2 Leaf-3 Leaf-4 Leaf-5 Leaf-6


Pause Pause Pause Pause Pause Pause

Traffic Victims
Pause PFC Storm

Target-1 Target-2 Target-3 Target-4 Target-5 Host-1


Culprit
Cause 2 - Slow Host
#CiscoLive BRKDCN-2945 © 2024 Cisco and/or its affiliates. All rights reserved. Cisco Public 127
Congestion Spreading in Lossless Ethernet Networks
Spine-1 Spine-2 Spine-3 Spine-4

Pause Pause Pause Pause

Leaf-1 Leaf-2 Leaf-3 Leaf-4 Leaf-5 Leaf-6


Pause Pause Pause Pause Pause Pause

Traffic Victims
Pause PFC Storm

Target-1 Target-2 Target-3 Target-4 Target-5 Host-1


Victims Victims Victims Victims Victims Culprit
Cause 2 - Slow Host
#CiscoLive BRKDCN-2945 © 2024 Cisco and/or its affiliates. All rights reserved. Cisco Public 128
Congestion Spreading in Lossless Ethernet Networks
Spine-1 Spine-2 Spine-3 Spine-4

Pause Pause Pause Pause

Leaf-1 Leaf-2 Leaf-3 Leaf-4 Leaf-5 Leaf-6


Pause Pause Pause Pause Pause Pause

Traffic Victims Victims Victims Victims Victims Victims


Pause PFC Storm

Target-1 Target-2 Target-3 Target-4 Target-5 Host-1


Victims Victims Victims Victims Victims Culprit
Cause 2 - Slow Host
#CiscoLive BRKDCN-2945 © 2024 Cisco and/or its affiliates. All rights reserved. Cisco Public 129
Congestion Spreading in Lossless Ethernet Networks
Cause 1 – Link Spine-1 Spine-2 Spine-3 Spine-4
Over-Utilization The only difference is
on the host-connected
port: No Rx Pause &
High Tx Utilization

Effect on the fabric is the same

Leaf-1 Leaf-2 Leaf-3 Leaf-4 Leaf-5 Leaf-6

Traffic Victims Victims Victims Victims Victims Victims


Pause

Target-1 Target-2 Target-3 Target-4 Target-5 Host-1


Victims Victims Victims Victims Victims Culprit
Cause 1 – Link Over-Utilization
#CiscoLive BRKDCN-2945 © 2024 Cisco and/or its affiliates. All rights reserved. Cisco Public 130
Detecting Congestion in Lossless Ethernet Networks
Pause frame count per interface

switch # show interface priority-flow-control


============================================================
Port Mode Oper(VL bmap) RxPPP TxPPP
============================================================
Ethernet1/1 Auto Off 4500190 63359675
Ethernet1/8 Auto Off 846830317 530658504
Ethernet1/1/7 Auto On (8) 17009450 1406361381
switch #

Using LLFC Using PFC on CoS


3 (1000)
#CiscoLive BRKDCN-2945 © 2024 Cisco and/or its affiliates. All rights reserved. Cisco Public 131
Detecting Congestion in Lossless Ethernet Networks
Pause frame count per queue (traffic class) “Active” indicates a currently active
switch# show queuing interface eth1/1 state of ingress Pause. Eth1/1 has
<snip> PFC Statistics stopped transmission in CoS1
---------------------------------------------------------------
TxPPP: 48668926, RxPPP: 150680840
---------------------------------------------------------------
COS QOS Group PG TxPause TxCount RxPause RxCount
0 - - Inactive 0 Inactive 0
1 - - Inactive 48668926 Active 150680840
2 - - Inactive 0 Inactive 0
3 - - Inactive 0 Inactive 0
4 - - Inactive 0 Inactive 0
5 - - Inactive 0 Inactive 0
6 - - Inactive 0 Inactive 0
7 - - Inactive 0 Inactive 0
---------------------------------------------------------------
#CiscoLive BRKDCN-2945 © 2024 Cisco and/or its affiliates. All rights reserved. Cisco Public 132
Detecting Congestion in Lossless Ethernet Networks
Packet Drops

Remember: Lossless Ethernet does not guarantee that packets are not lost

switch # show queuing interface ethernet 1/32


<snip>
Ingress Overflow Drop Statistics
--------------------------------------------------------
All Pause Drop Pkts 0
High Pause Drop Pkts 0
Low Pause Drop Pkts 233

Packet drops in no-drop queue despite enabling PFC because


Headroom did not have available buffers to accommodate these packets

#CiscoLive BRKDCN-2945 © 2024 Cisco and/or its affiliates. All rights reserved. Cisco Public 133
Nexus Dashboard Insights for Monitoring PFC & ECN

#CiscoLive BRKDCN-2945 © 2024 Cisco and/or its affiliates. All rights reserved. Cisco Public 134
Microburst Detection Early Indication of Congestion Spreading
Without Congestion With Congestion
Egress Ingress Egress Ingress
No-drop queue No-drop queue No-drop queue No-drop queue

Pause Pause
Threshold Threshold
Microburst Resume Resume
Microburst
detection Threshold Threshold
detection

Egress Ingress Egress Ingress


Port Switch Port Port Switch Port

Traffic Backpressure Pause frame


#CiscoLive BRKDCN-2945 © 2024 Cisco and/or its affiliates. All rights reserved. Cisco Public 135
Preventing Congestion in Lossless Ethernet

• PFC Watchdog
• Drop frames destined for congested end device

• RoCEv2 Congestion Management (RCM)


• Notify the end devices.
• Same concept as RFC 3168 (ECN for TCP/IP networks).
• But RoCEv2 does not use TCP, so destination uses a special packet
(Congestion Notification Packet (CNP)) to notify the sender

#CiscoLive BRKDCN-2945 © 2024 Cisco and/or its affiliates. All rights reserved. Cisco Public 136
Traffic
PFC Watchdog Pause

Unable to transmit continuously for a timeout duration


Event (e.g. 100ms) because of Rx Pause?

Class 1
1
Automatic Class 2
Drop egress frames
Action in that queue
no-drop 3
no-drop 4
Class 5
3
2 Class 6 Drop ingress traffic
Drop arriving Leaf-6 Class 7 (including Pause
frames on other Class 8 frames) on this port
ports on the switch that belongs to the
Result that are destined to same no-drop class
Host-1
Victims recover go out of this
from effect of queue
congestion
spreading priority-flow-control watch-dog-interval <on | off >

#CiscoLive BRKDCN-2945 © 2024 Cisco and/or its affiliates. All rights reserved. Cisco Public 137
Before PFC Watchdog – One Culprit, Many Victims
Spine-1 Spine-2 Spine-3 Spine-4

Congestion
Spreading

Leaf-1 Leaf-2 Leaf-3 Leaf-4 Leaf-5 Leaf-6

Traffic Victims Victims Victims Victims Victims Victims


Pause PFC Storm

Target-1 Target-2 Target-3 Target-4 Target-5 Host-1


Victims Victims Victims Victims Victims Culprit
#CiscoLive BRKDCN-2945 © 2024 Cisco and/or its affiliates. All rights reserved. Cisco Public 138
After PFC Watchdog – One Culprit, No Victims
Spine-1 Spine-2 Spine-3 Spine-4 Drop frames in the
No no-drop queue. This
Congestion frees up buffers and
therefore, does not
Spreading invoke PFC
Recommendation
Enable PFC Watchdog in Lossless Ethernet Networks
Leaf-1 Leaf-2 Leaf-3 Leaf-4 Leaf-5 Leaf-6

Traffic
Pause PFC Storm

Target-1 Target-2 Target-3 Target-4 Target-5 Host-1


Culprit
#CiscoLive BRKDCN-2945 © 2024 Cisco and/or its affiliates. All rights reserved. Cisco Public 139
Verifying PFC Watchdog
switch# show queuing pfc-queue
+----------------------------------------------------+
Global watch-dog interval [Enabled]
+----------------------------------------------------+
Global PFC watchdog configuration details
PFC watch-dog poll interval : 100 ms
PFC watch-dog shutdown multiplier : 1
PFC watch-dog auto-restore multiplier : 10
PFC watch-dog fixed-restore multiplier : 0
PFC watchdog internal-interface multiplier : 2
+-------------------------------------------------------------+
| Port PFC Watchdog (VL bmap) State (Shutdown) |
+-------------------------------------------------------------+
Ethernet1/1 Disabled ( 0x0 ) - - - - - - - -
Ethernet1/2 Disabled ( 0x0 ) - - - - - - - -
Ethernet1/3 Enabled ( 0x2 ) - - - - - - N -
Ethernet1/4 Disabled ( 0x0 ) - - - - - - - -
Ethernet1/5 Enabled ( 0x2 ) - - - - - - Y -
Ethernet1/6 Disabled ( 0x0 ) - - - - - - - -
#CiscoLive BRKDCN-2945 © 2024 Cisco and/or its affiliates. All rights reserved. Cisco Public 140
Verifying PFC Watchdog
switch# show queuing pfc-queue interface e1/5 detail
+----------------------------------------------------+
Ethernet1/5 Interface PFC watchdog: [Enabled]
+----------------------------------------------------+
| QOS GROUP 1 [Active] PFC [YES] PFC-COS [1]
+----------------------------------------------------+
| | Stats |
+----------------------------------------------------+
| Shutdown| 1|
| Restored| 1|
| Total pkts drained| 752|
| Total pkts dropped| 2197357321|
| Total pkts drained + dropped| 2197358073|
| Aggregate pkts dropped| 2197358073|
| Total Ingress pkts dropped| 66649|
| Aggregate Ingress pkts dropped| 66649|
+----------------------------------------------------+

#CiscoLive BRKDCN-2945 © 2024 Cisco and/or its affiliates. All rights reserved. Cisco Public 141
PFC Watchdog Messages

2021 Aug 18 10:18:51 N9K %$ VDC-1 %$ %TAHUSD-SLOT1-2-


TAHUSD_SYSLOG_PFCWD_QUEUE_SHUTDOWN: Queue 1 of Ethernet1/5 is shutdown
due to PFC watchdog timer expiring

2021 Aug 18 10:19:58 N9K %$ VDC-1 %$ %TAHUSD-SLOT1-2-


TAHUSD_SYSLOG_PFCWD_QUEUE_RESTORED: Queue 1 of Ethernet1/5 is restored
due to PFC watchdog timer expiring; 2197358073 egress packets/66649
ingress packets dropped during the event

#CiscoLive BRKDCN-2945 © 2024 Cisco and/or its affiliates. All rights reserved. Cisco Public 142
Question

I configured ECN, but RoCE Congestion Management


(RCM) doesn’t work like TCP ECN. Why?

Answer

Most probably end devices don’t reduce transmission rate


or don’t do it correctly
RoCEv2 Congestion Management
Spine-1
Traffic Similar to ECN for TCP/IP networks
Pause (RFC 3168)

Leaf-1 Leaf-6
1 Mark ECT in egress IP header
2 Egress congestion detected CE
CNP

CNP
3 Mark CE in IP header
ECT
Detects CE in ingress IP header
4 and send Congestion Notification
Target-1 Packet (CNP) to the source Host-1
Source Destination
When CNP is received, initially
5
reduces rate and adjusts later
#CiscoLive BRKDCN-2945 © 2024 Cisco and/or its affiliates. All rights reserved. Cisco Public 144
RoCEv2 Congestion Management – PFC, ECN, CNP
CS3 traffic in Traffic Eth 1/1 Eth 1/2
no-drop queue Pause (Ingress) (Egress)
WRED
Spine-1
Pause max
Resume min

Eth1/1 CS 3 CS 7 6 5 4 3 2 1 0
Leaf-6
Leaf-1
Eth1/2

Target-1 Host-1
Source Destination

• No Congestion. RDMA traffic flows normally in the no-drop queue

#CiscoLive BRKDCN-2945 © 2024 Cisco and/or its affiliates. All rights reserved. Cisco Public 145
RoCEv2 Congestion Management – PFC, ECN, CNP
CS3 traffic in Traffic Eth 1/1 Eth 1/2
no-drop queue Pause (Ingress) (Egress)
WRED
Spine-1
Pause max
Resume min

Eth1/1 CS 3 CS 7 6 5 4 3 2 1 0
Leaf-6
Leaf-1
Eth1/2

Target-1 Host-1
Source Destination

• Host-1 now slows down the processing of RDMA traffic, and therefore, it
starts sending PFC Pause frames for CS3

#CiscoLive BRKDCN-2945 © 2024 Cisco and/or its affiliates. All rights reserved. Cisco Public 146
RoCEv2 Congestion Management – PFC, ECN, CNP
CS3 traffic in Traffic Eth 1/1 Eth 1/2
no-drop queue Pause (Ingress) (Egress)
WRED
Spine-1
Pause max
Resume min

Eth1/1 CS 3 CS 7 6 5 4 3 2 1 0
Leaf-6
Leaf-1
Eth1/2

Target-1 Host-1
Source Destination

• Egress CS3 queue on Eth 1/2 starts filling up and exceeds WRED min
threshold.

#CiscoLive BRKDCN-2945 © 2024 Cisco and/or its affiliates. All rights reserved. Cisco Public 147
RoCEv2 Congestion Management – PFC, ECN, CNP
CS3 traffic in Traffic Eth 1/1 Eth 1/2
no-drop queue Pause (Ingress) (Egress)
WRED
Spine-1
Pause max
Resume min

Eth1/1 CS 3 CS 7 6 5 4 3 2 1 0

Leaf-1 CE Leaf-6
Eth1/2

Target-1 Host-1
Source Destination

• Leaf-6 marks CE flag randomly in ECN-capable packets destined for Host-1.

#CiscoLive BRKDCN-2945 © 2024 Cisco and/or its affiliates. All rights reserved. Cisco Public 148
RoCEv2 Congestion Management – PFC, ECN, CNP
CS3 traffic in Traffic Eth 1/1 Eth 1/2
no-drop queue Pause (Ingress) (Egress)
WRED
Spine-1
Pause max
Resume min

Eth1/1 CS 3 CS 7 6 5 4 3 2 1 0

Leaf-1 CE Leaf-6

CNP
Eth1/2

Target-1 Host-1
Source Destination

• Host-1 sends CNP to the source(s) (Target-1) of the received CE-marked


packets.

#CiscoLive BRKDCN-2945 © 2024 Cisco and/or its affiliates. All rights reserved. Cisco Public 149
RoCEv2 Congestion Management – PFC, ECN, CNP
CS3 traffic in Traffic Eth 1/1 Eth 1/2
no-drop queue Pause (Ingress) (Egress)
WRED
Spine-1
Pause max
Resume min

Eth1/1 CS 3 CS 7 6 5 4 3 2 1 0

Leaf-1 CE Leaf-6

CNP
Eth1/2

Target-1 Host-1
Source Destination

• By then, egress CS3 queue on Eth 1/2 fills up.

#CiscoLive BRKDCN-2945 © 2024 Cisco and/or its affiliates. All rights reserved. Cisco Public 150
RoCEv2 Congestion Management – PFC, ECN, CNP
Traffic Internal Eth 1/2
CS3 traffic in Eth 1/1
Backpressure (Egress)
no-drop queue Pause (Ingress)
WRED
Spine-1
Pause max
Resume min

Eth1/1 CS 3 CS 7 6 5 4 3 2 1 0

Leaf-1 CE Leaf-6

CNP
Eth1/2

Target-1 Host-1
Source Destination

Because of no-drop queue, instead of dropping the packets, ingress queue for
CS3 traffic on Eth 1/1 starts filling up

#CiscoLive BRKDCN-2945 © 2024 Cisco and/or its affiliates. All rights reserved. Cisco Public 151
RoCEv2 Congestion Management – PFC, ECN, CNP
Traffic Internal Eth 1/2
CS3 traffic in Eth 1/1
Backpressure (Egress)
no-drop queue Pause (Ingress)
WRED
Spine-1
Pause max
Resume min

Eth1/1 CS 3 CS 7 6 5 4 3 2 1 0

Leaf-1 CE Leaf-6

CNP
Eth1/2
CNP

Target-1 Host-1
Source Destination

• The source(s) (Target-1) receive CNP and reduce traffic rate to Host-1.

#CiscoLive BRKDCN-2945 © 2024 Cisco and/or its affiliates. All rights reserved. Cisco Public 152
RoCEv2 Congestion Management – PFC, ECN, CNP
Traffic Internal Eth 1/2
CS3 traffic in Eth 1/1
Backpressure (Egress)
no-drop queue Pause (Ingress)
WRED
Spine-1
Pause max
Resume min

Eth1/1 CS 3 CS 7 6 5 4 3 2 1 0

Leaf-1 CE Leaf-6

CNP
Eth1/2
CNP

Target-1 Host-1
Source Destination

• The source(s) (Target-1) receive CNP and reduce traffic rate to Host-1.
• By then, ingress queue on Eth 1/1 exceed Pause threshold, which results in
sending PFC Pause frames to upstream neighbor (Spine-1)

#CiscoLive BRKDCN-2945 © 2024 Cisco and/or its affiliates. All rights reserved. Cisco Public 153
RoCEv2 Congestion Management – PFC, ECN, CNP
Traffic Internal Eth 1/2
CS3 traffic in Eth 1/1
Backpressure (Egress)
no-drop queue Pause (Ingress)
WRED
Spine-1
Pause max
Resume min

Eth1/1 CS 3 CS 7 6 5 4 3 2 1 0

Leaf-1 CE Leaf-6

CNP
Eth1/2 RCM
CNP

No guarantee of preventing
Target-1 Host-1
Source Destination congestion spreading

• The source(s) (Target-1) receive CNP and reduce traffic rate to Host-1.
• By then, ingress queue on Eth 1/1 exceed Pause threshold, which results in
sending PFC Pause frames to upstream neighbor (Spine-1)
• Spine-1 slows down all CS3 traffic to Leaf-6.

#CiscoLive BRKDCN-2945 © 2024 Cisco and/or its affiliates. All rights reserved. Cisco Public 154
RoCEv2 Congestion Management – PFC, ECN, CNP
Traffic Internal Eth 1/2
CS3 traffic in Eth 1/1
Backpressure (Egress)
no-drop queue Pause (Ingress)
WRED
Spine-1
Pause max
Resume min

Eth1/1 CS 3 CS 7 6 5 4 3 2 1 0

Leaf-1 CE Leaf-6

CNP
Eth1/2
CNP

Target-1 Host-1
Source Destination

• When Host-1 is able to process the reduces traffic rate, it stops sending
Pause frames.

#CiscoLive BRKDCN-2945 © 2024 Cisco and/or its affiliates. All rights reserved. Cisco Public 155
RoCEv2 Congestion Management – PFC, ECN, CNP
Traffic Internal Eth 1/2
CS3 traffic in Eth 1/1
Backpressure (Egress)
no-drop queue Pause (Ingress)
WRED
Spine-1
Pause max
Resume min

Eth1/1 CS 3 CS 7 6 5 4 3 2 1 0

Leaf-1 CE Leaf-6

CNP
Eth1/2
CNP

Target-1 Host-1
Source Destination

• When Host-1 is able to process the reduces traffic rate, it stops sending
Pause frames.
• This drains the egress CS3 queue on Eth1/2, which results in reduced
ingress buffer utilization on Eth 1/1

#CiscoLive BRKDCN-2945 © 2024 Cisco and/or its affiliates. All rights reserved. Cisco Public 156
RoCEv2 Congestion Management – PFC, ECN, CNP
Traffic Internal Eth 1/2
CS3 traffic in Eth 1/1
Backpressure (Egress)
no-drop queue Pause (Ingress)
WRED
Spine-1
Pause max
Resume min

Eth1/1 CS 3 CS 7 6 5 4 3 2 1 0

Leaf-1 CE Leaf-6

CNP
Eth1/2
CNP

Target-1 Host-1
Source Destination

• Pause frames are still sent to Spine-1 until the ingress buffer utilization falls
below the Resume threshold.

#CiscoLive BRKDCN-2945 © 2024 Cisco and/or its affiliates. All rights reserved. Cisco Public 157
RoCEv2 Congestion Management – PFC, ECN, CNP
CS3 traffic in Traffic Eth 1/1 Eth 1/2
no-drop queue Pause (Ingress) (Egress)
WRED
Spine-1
Pause max
Resume min

Eth1/1 CS 3 CS 7 6 5 4 3 2 1 0

Leaf-1 CE Leaf-6

CNP
Eth1/2
CNP

Target-1 Host-1
Source Destination

• Finally, severe congestion is prevented.


• Ingress buffers on Eth 1/1 are empty. Egress CS3 queue on Eth1/2 is empty
or minimally used.

#CiscoLive BRKDCN-2945 © 2024 Cisco and/or its affiliates. All rights reserved. Cisco Public 158
RoCEv2 Congestion Management – PFC, ECN, CNP
CS3 traffic in Traffic
ECN/CNP
no-drop queue Pause
Spine-1
Flow-level scope, but its action
is delayed

Eth1/1 PFC
Leaf-1 CE Leaf-6 Quick action and avoids

CNP
Eth1/2
CNP

packet loss, but it has class-


level scope, and it leads to
Target-1 Host-1
Source Destination congestion spreading

Real Benefit
By using CNP/ECN with PFC brings the best of both mechanisms.
CNP/ECN limits the spread and duration of congestion spreading by PFC.
Important: You must still find why Host-1 caused congestion
#CiscoLive BRKDCN-2945 © 2024 Cisco and/or its affiliates. All rights reserved. Cisco Public 159
Recommendations Summary – IP Storage Networks
1. Prefer dedicated storage networks
2. Design storage networks with minimal or no over-subscription (ex: 100 GbE edge, 400 GbE core)
3. Operate host and storage ports at the same speed (avoid speed mismatch)
4. If considering a move from FC to IP/Ethernet, start small (one rack at a time)
5. Do not police or shape storage traffic
6. Must guarantee BW for lossless (RoCE) traffic (ETS)
7. Should guarantee BW for lossy (TCP – iSCSI & NVMe/TCP) storage traffic
8. Enable jumbo frames on end devices and switches.
9. Ensure that end devices are benefitting from larger MTU by verifying TCP MSS
10. Increase the number of iSCSI or NVMe/TCP sessions on a host

#CiscoLive BRKDCN-2945 © 2024 Cisco and/or its affiliates. All rights reserved. Cisco Public 160
Recommendations Summary – IP Storage Networks
11. Use host-based multipathing for uniform spread of I/O

12. Spread heavy-read-I/O apps on different hosts

13. Spread volumes for heavy-write-I/O apps on different storage arrays

14. Resolve bit errors at top priority in storage networks

15. Detect high-link utilization and upgrade link speed

16. Do not rely on built-in TCP congestion control for long

17. Verify that the end devices support RCM and if their rate-reduction algorithm is effective

18. Like TCP, do not rely on RCM for long

19. Limit the number of storage ports talking to a host port and vice-versa

20. Proactively monitor TCP performance on end devices (Retransmission, Changes in window,
unexpected MSS, unexpected RTO, etc.)

#CiscoLive BRKDCN-2945 © 2024 Cisco and/or its affiliates. All rights reserved. Cisco Public 161
Recommendations Summary – IP Storage Networks
21. Mark ECN early so that rate-reduction is observed before queues become full

22. Verify and enable ECN on end devices

23. Under-utilized (60%) links are better than highly utilized (90%) links in storage networks

24. Over-provision capacity in storage networks

25. Monitor per-interface and per-class counters

26. Enable PFC Watchdog

27. Use Nexus Dashboard Insights for congestion detection and troubleshooting

28. Monitor FEC counters to get early indications into the health of the link

29. Monitor switch queue utilization using microburst detection, q-depth monitoring, etc.

30. Read Cisco Press title: Detecting, Troubleshooting, and Preventing Congestion in Storage Networks

#CiscoLive BRKDCN-2945 © 2024 Cisco and/or its affiliates. All rights reserved. Cisco Public 162
Further Reading
~700 pages covering Fibre Channel, iSCSI, NVMe/TCP, RoCE
Chapters:
1. Introduction to Congestion in Storage Networks
2. Understanding Congestion in Fibre Channel Fabrics
3. Detecting Congestion in Fibre Channel Fabrics
4. Troubleshooting Congestion in Fibre Channel Fabrics
5. Solving Congestion with Storage I/O Performance Monitoring
6. Preventing Congestion in Fibre Channel Fabrics
7. Congestion Management in Ethernet Storage Networks
8. Congestion Management in TCP Storage Networks
9. Congestion Management in Cisco UCS Servers

#CiscoLive BRKDCN-2945 © 2024 Cisco and/or its affiliates. All rights reserved. Cisco Public 163
Complete Your Session Evaluations

Complete a minimum of 4 session surveys and the Overall Event Survey to be


entered in a drawing to win 1 of 5 full conference passes to Cisco Live 2025.

Earn 100 points per survey completed and compete on the Cisco Live
Challenge leaderboard.

Level up and earn exclusive prizes!

Complete your surveys in the Cisco Live mobile app.

#CiscoLive BRKDCN-2945 © 2024 Cisco and/or its affiliates. All rights reserved. Cisco Public 164
• Visit the Cisco Showcase
for related demos

• Book your one-on-one


Meet the Engineer meeting

Continue • Attend the interactive education


with DevNet, Capture the Flag,
your education and Walk-in Labs

• Visit the On-Demand Library


for more sessions at
www.CiscoLive.com/on-demand

Contact me at: [email protected]

BRKDCN-2945 © 2024 Cisco and/or its affiliates. All rights reserved. Cisco Public 165
Thank you

#CiscoLive

You might also like