0% found this document useful (0 votes)
46 views32 pages

SONiC Intro Netdev

The document discusses SONiC (Software for Open Networking in the Cloud), a Linux-based switch operating system designed for cloud environments, emphasizing its architecture, features, and system scaling capabilities. It highlights the integration of various networking functionalities, including routing, NAT, and memory management, while also addressing challenges related to netlink messaging and the implementation of full cone NAT in Linux. The presentation is a collaborative effort by experts from Microsoft, Broadcom, and Nvidia, showcasing the community-driven nature of SONiC as an open-source project.

Uploaded by

ashub261989
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
46 views32 pages

SONiC Intro Netdev

The document discusses SONiC (Software for Open Networking in the Cloud), a Linux-based switch operating system designed for cloud environments, emphasizing its architecture, features, and system scaling capabilities. It highlights the integration of various networking functionalities, including routing, NAT, and memory management, while also addressing challenges related to netlink messaging and the implementation of full cone NAT in Linux. The presentation is a collaborative effort by experts from Microsoft, Broadcom, and Nvidia, showcasing the community-driven nature of SONiC as an open-source project.

Uploaded by

ashub261989
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

SONiC and Linux

Guohan Lu (Microsoft)
Kalimuthu Velappan (Broadcom)
Kiran Kella (Broadcom)
Marian Pritzak (Nvidia)

9/11/2020 1
Agenda
• Motivation
• SONiC Architecture
• Beyond Single ASIC
• Sonic features and linux
• System Scaling
• Full cone NAT

9/11/2020 2
What is SONiC
• Software for Open Networking in the Cloud
• A collection of software components/tools
• Builds on the foundations of SAI
• Provides L2/L3 functionalities targeted for the cloud
• Linux-based switch operating system, looks and feels like Linux
• Community driven, open source effort
• Shared on GitHub, Apache License
• Believe in working code + quick iteration
SONiC Container Architecture – Single ASIC
SystemD Services

PMON DHCP SNMP LLDP BGP TeamD Teamd

BGP

Syncd
Switch State SyncD
Database
Service SAI Database
User Space

SWSS

……
PAL - sysfs netdev
Kernel

ASIC PCI driver


HW Peripheral drivers Network device drivers
Switch State Service (SSS)
Network

SAI DB: persist SAI objects


Applications

APP DB: persist App objects


APP DB backend: redis with object
DB library

Object Library w/
Redis Backend
SyncD: sync SAI objects between
software and hardware
Orchestration Agent

Orchestration Agent: translation


SAI DB
between apps and SAI objects,
resolution of dependency and
SyncD
conflict

SAI Key Goal: Evolve components


independently
ASIC
SONiC Software Module
• TEAMD:
LLDP SNMP
[Link]
ster/teamd
DHCP
• LLDP:
BGP TEAMD
RELAY [Link]
• BGP: Quagga
Platform DB SWSS • SNMP: Net-SNMP + SNMP subagent
SYNCD
• DHCP Relay: isc dhcp
• Platform: sensors
SONiC Base Image • DB: redis
• SWSS: switch state service
ONIE • Syncd: sairedis + syncd agent

9/11/2020 6
How Routing Works in SONiC
BGP BGPd Zebra fpmsyncd
Neighbor

APP
DB

Orchestration Agent

socket
SAI Redis SAI DB

Host Intf
netdev SyncD

ASIC SAI Route


How LAG Works in SONiC
LACP teamD teamsyncd
Neighbor

APP
DB

Orchestration Agent

socket
SAI Redis SAI DB
teamD
netdev
Host Intf SyncD
netdev

ASIC SAI lag


Beyond Single ASIC

9/11/2020 9
Linux Network Namespaces For Multiple ASICs
•SONiC Dockers with Linux Network Namespaces BGP

• Replicate bgp, syncd, swss, teamd, lldp, database LLDP

dockers per ASIC Teamd

• Different network namespaces for docker instances SWSS

Database

Syncd

SONiC SNMP

Namespace 0 Namespace 1
PMON

Telemetry

Namespace 0 Namespace 1 Namespace 2 Namespace 3


How routing works?

9/11/2020 12
SONIC FEATURES AND
LINUX
Marian Pritsak, August 2020
Features

VNET

Sflow

ACL

NAT

Switch Memory Management

SONiC Virtual Switch


14
VNET

VxLAN routing in SONiC

Connect Bare Metal machines to cloud VMs

Provisioning done by controller

Routes

Neighbors

No VxLAN device created in Linux

Underlay routing is programmed by BGP

As opposed to EVPN design (WIP) that is fully reflected in Linux

15
SFLOW

Psample driver ported


to Linux 4.9

NPU drivers are


required to support
psample

Netlink as host
interface channel for
packets

tc_sample is used for


virtual SONiC switch
testing environment

16
ACL
ACL in SONiC is used for:

Firewalling

Data plane telemetry

Packet mirroring

Enable/Disable NAT

PBR

ACL tables are not programmed in Linux,

Except NAT

SONiC keeps a separate table for control plane ACL

Implemented with iptables

17
NAT

Use cases

DNAT

SNAT

Hairpinning

NAPT

Full cone

All reflected in Linux


with iptables

18
SWITCH MEMORY MANAGEMENT
Comprehensive MMU object model

QoS maps

Shared buffers

Policers

Schedulers

Queueing algorithms

WRED

ECN

Fine tuning for specific use cases

TCP networks

RDMA networks

Programmed directly through SAI 19


SONIC VIRTUAL SWITCH
Virtual SONiC platform for generic feature validation

Two flavors

Docker image

VM

Core forwarding behavior modeled by Linux

Virtual ports for the data plane

L2 forwarding

L3 routing

Sflow

Control plane ACL

20
Netlink Message Filter
- Kalimuthu Velappan
Introduction
• Netlink messaging Architecture
• System Scaling Issue
• Proposed solution
• BPF – Berkeley Packet Filter
•Q&A
Netlink Messaging Framework
• Network Applications mainly uses the NETLINK_ROUTE family for
receiving the netdevice notifications ( Port, Up/Down, MTU, Speed,
etc.)
• Each netdevice notifies the netlink subsystem about the change in its
port-properties
• It is a broadcast domain
• Netlink subsystem posts a netlink message to socket recv-Q of all the
registered applications
• Application then reads the message from the recv-Q
System Scaling Issue

• Every net device has multiple attributes


Application Application Application Application • Any attribute change will generate a netlink message
1 (VLAN mgr) 2(teamd) 3(stpd) 4(udld)
notification
• Each application registers for kernel netlink notification
8190 8190 8190 8190
User
Space
• Application has to receive/process all the messages,
whether it’s interested in them or not
Kernel
Space
• When 4K VLAN is configured per port, It generates ~8K
NETLINK subsystem
netlink messages
• On a scaled system
Netlink msg send • More than 1M unnecessary messages can be broadcast across system
NetDevices • Application is not able to process all the messages during config reload
4K Vlans Ethernet0 and system reboot
Netlink msg recv
• Due to this burstiness, important Netlink messages might get
delayed/dropped in kernel(ENOBUFF)
• Dropped Netlink messages can’t be retrieved!!!
Ex: Ethernet0 is added to 4K Vlans
<<config vlan member range add 2 4094 Ethernet0>>
Proposed Solution – Berkeley Packet Filter

• Filter to drop all unwanted netlink messages in Kernel


using Berkeley Packet Filter(BPF)
Application Application Application Application
1( VLAN mgr) 2 (teamd) 3(stpd) 4 (udld)
• Filter is applied for each application socket in kernel
8190 0 0 8190
User • Filter is based on the one or more message attributes
Space

Kernel
• Application will get a notification only when a
Space requested attribute changes
Dropped Dropped
8190 8190

NETLINK subsystem

Ethernet0 is added to 4K
Vlans
4K Vlans Ethernet0 << config vlan member range add 2
4094 Ethernet0>>
Netlink Message Filtering Mechanism
Application
BPF Filter Logic
• Berkeley Socket Filter (BPF) - Interface to execute Micro ASM (ASM)
f d = socket(NETLINIK_ROUTE)
in the kernel as a Minimal VM
• ASM Filter code gets executed for every packet reception s etsockoption(fd,
• Return value decides whether to accept/drop the packet SO_ATTACH_BPF..)
recvmsg(fd..)
• Gets executed as part of Netlink message sender context
Required Changes User

• Kernel patch for nlattr and nestednlattr helper functions. Kernel

• Customized eBPF filter logic to drop unnecessary messages


for the application BPF verifier BPF Filter

Socket fd
• Filter Logic: BPF JIT
compiler
Receive netlink message

• Entry { KEY: Interface, VALUE { Attr : Value, … } } Hash MAP


DB
• Hash map to store only the required attribute information
• It filters all the NetLink messages except the interested attribute
changes KEY / VALUE/
IFINDEX Attributes
• Application will get notification only when interested attribute 1 [ s:1, f:2, v:3 ]
changes N etlink subsystem
64 [ s:1, f:3, v:7 ]

23 [ s:1, f:5, v:6 ]


Support for full cone NAT
- Kiran Kumar Kella
NAT in Linux
• Linux today does NAT based on 5-tuple uniqueness of the translated conntrack
entries
• For example, with an iptables rule, the following 2 traffic flows are subjected to SNAT as
below
#iptables -t nat -nvL

Chain POSTROUTING (policy ACCEPT 33097 packets, 2755K bytes)

pkts bytes target prot opt in out source destination


41987 2519K SNAT udp -- * * [Link]/24 [Link]/0 to:[Link]:1001-2000

SNAT
SIP/SPORT [Link]:100 SIP/SPORT [Link]:1001
DIP/DPORT [Link]:200 DIP/DPORT [Link]:200

SNAT
SIP/SPORT [Link]:100 SIP/SPORT [Link]:1001
DIP/DPORT [Link]:200 DIP/DPORT [Link]:200

• Both flows SNAT to the same external IP + Port ([Link]:1001) as they are 5-tuple unique [Protocol
+ SIP + SPORT + DIP + DPORT]
Support for full cone NAT in Linux
• RFC 3489 says:
Full Cone: A full cone NAT is one where all requests from the same internal
IP address and port are mapped to the same external IP address and port.
Furthermore, any external host can send a packet to the internal host, by
sending a packet to the mapped external address.

• Some switching ASICs that can leverage from Linux NAT feature would need full
cone NAT support in Linux
• In other words, to support full cone NAT would need 3-tuple uniqueness of the
conntrack entries SIP/SPORT [Link]:100 SNAT
SIP/SPORT [Link]:1001
DIP/DPORT [Link]:200 DIP/DPORT [Link]:200

SNAT
SIP/SPORT [Link]:100 SIP/SPORT [Link]:1001
DIP/DPORT [Link]:200 DIP/DPORT [Link]:200

SNAT
SIP/SPORT [Link]:100 SIP/SPORT [Link]:1002
DIP/DPORT [Link]:200 DIP/DPORT [Link]:200

DNAT
SIP/SPORT [Link]:300 SIP/SPORT [Link]:300
DIP/DPORT [Link]:100 DIP/DPORT [Link]:1001
Changes done in NAT/conntrack modules
• New hash table (nf_nat_by_manip_src) is added as an infra to
support the 3-tuple uniqueness. This table hashes on the 3-tuple
translated source (Protocol + SIP + SPORT)
New hash table
nf_nat_by_manip_src

Hash table
nf_nat_by_source Conntrack1 Conntrack2 Conntrack3

• Core changes needed in


Conntrack4 Conntrackx
nf_nat_core.c in the routines
get_unique_tuple() and
nf_nat_l4proto_unique_tuple()
Changes done in NAT/conntrack modules
• The new hash table is updated during the SNAT to ensure 3-tuple
unique translation (full cone) for a given internal IP + Port.
• The same table is looked up by hashing on the destination IP + Port in
the reverse direction during DNAT to achieve the full cone behavior.

• Enhancement needed in iptables tool to pass the fullcone option to


the kernel
#define NF_NAT_RANGE_FULLCONE (1 << 6)

#iptables -t nat -nvL

Chain POSTROUTING (policy ACCEPT 33097 packets, 2755K bytes)

pkts bytes target prot opt in out source destination


41987 2519K SNAT udp -- * * [Link]/24 [Link]/0 to:[Link]:1001-2000 fullcone
Questions? Thank You

Broadcom Proprietary and Confidential. Copyright © 2018 Broadcom. All Rights Reserved. The term “Broadcom” refers to Broadcom Inc. and/or its subsidiaries.

You might also like