Distributed Switch Architecture
Distributed Switch Architecture
A.K.A. DSA
Abstract
DRAM
The Distributed Switch Architecture was first introduced to
Linux nearly 10 years ago. After being mostly quiet for 6 CPU
years, it recently became actively worked on again by a group
Ethernet controller
of tenacious contributors.
Data path
In this paper, we will cover its design goals and paradigms and
why they make it a good fit for supporting small home/office Control path
routers and switches. We will also cover the work that was Port 8 MDIO controller
done over the past 4 years, the relationship with switchdev and I2C controller
the networking stack, and finally give a heads-up on the up-
SPI controller
coming developments to be expected. Ethernet switch
Introduction
Distributed Switch Architecture is a Marvell SOHO switch Figure 1: The Basic DSA setup
term. However, as is often the case with the Linux Kernel,
the code to support it has been generalised, and now supports
a number of different vendors Ethernet switches.
The basic hardware configuration for DSA is shown in Fig- bus. The DSA software framework exports this MDIO bus
ure 1. The Ethernet switch has one port dedicated to passing to Linux as a normal MDIO bus. Thus the PHYs on the bus
Ethernet frames to/from the CPU, port 8 in the figure. This can be probed, the existing Linux PHY drivers used, and the
port is connected to an Ethernet controller of the CPU acting PHYs associated to the Linux slave interfaces representing
as the management interface. The CPU’s Ethernet controller the switch ports.
is referred to as the ’master’ interface, while the switch port is Port 3 shows a Fiber interface. Typically this is controlled
referred to as the ’cpu’ port. The remaining switch ports are and monitored via I2C, and would be connected to the hosts
user ports. DSA provides a Linux network interface for these I2C controller. Again, this Fiber module is associated to
user ports, known as ’slave’ interfaces. The slave interfaces the slave interface and can be managed using standard Linux
are standard Linux network inferfaces, as shown in figure 2, tools.
from the ZII devel B board. eth1 is the ’master’ interface, Lastly, ports 4 and 5 use external PHYs, connected via
and the ’slave’ interfaces are lan* and optical*. RGMII to the switch. Either the PHYs are managed via the
Overall, this forms the data plane. switches own MDIO bus, as used by the internal PHYs, or
The Ethernet switch is also connected to the CPU via a they can be connected to the CPUs MDIO bus. As with the in-
management interface. Often this is MDIO, but can also be ternal PHYs, Linux can manage the external PHYs and asso-
I2C, SPI, or memory mapped. The management interface is ciate them to the Linux slave interface representing the switch
used to configure the switch, retrieve status and access statis- ports.
tics counters. Overall, this forms the control plane.
Ports 0 to 2 of the switch connect directly to RJ45 con- DSA is however not limited to a single switch. Figure
nectors. In this case, the Ethernet PHY is embedded within 3 shows an architecture of multiple switches connected to-
the switch, and managed via the switch management inter- gether. This is the D in DSA, a distributed switch fabric.
face. Typically this is achieved via the switch having an inter- Currently, Linux only supports Marvell switches in this con-
nal MDIO bus, and exporting registers to control this MDIO figuration, however the concept is generic, so other switch
# ip link show
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN mode DEFAULT group default qlen 1000
link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
2: eth0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast state UP mode DEFAULT group default qlen 1000
link/ether ec:fa:aa:01:12:fe brd ff:ff:ff:ff:ff:ff
3: eth1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast state UP mode DEFAULT group default qlen 1000
link/ether 06:34:73:83:15:6b brd ff:ff:ff:ff:ff:ff
4: lan0@eth1: <BROADCAST,MULTICAST> mtu 1500 qdisc noqueue state DOWN mode DEFAULT group default qlen 1000
link/ether 06:34:73:83:15:6b brd ff:ff:ff:ff:ff:ff
5: lan1@eth1: <BROADCAST,MULTICAST> mtu 1500 qdisc noqueue state DOWN mode DEFAULT group default qlen 1000
link/ether 06:34:73:83:15:6b brd ff:ff:ff:ff:ff:ff
6: lan2@eth1: <BROADCAST,MULTICAST> mtu 1500 qdisc noqueue state DOWN mode DEFAULT group default qlen 1000
link/ether ce:00:11:22:33:44 brd ff:ff:ff:ff:ff:ff
7: lan3@eth1: <BROADCAST,MULTICAST> mtu 1500 qdisc noqueue state DOWN mode DEFAULT group default qlen 1000
link/ether 06:34:73:83:15:6b brd ff:ff:ff:ff:ff:ff
8: lan4@eth1: <BROADCAST,MULTICAST> mtu 1500 qdisc noqueue state DOWN mode DEFAULT group default qlen 1000
link/ether 06:34:73:83:15:6b brd ff:ff:ff:ff:ff:ff
9: lan5@eth1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP mode DEFAULT group default qlen 1000
link/ether 06:34:73:83:15:6b brd ff:ff:ff:ff:ff:ff
10: lan6@eth1: <BROADCAST,MULTICAST> mtu 1500 qdisc noqueue state DOWN mode DEFAULT group default qlen 1000
link/ether 06:34:73:83:15:6b brd ff:ff:ff:ff:ff:ff
11: lan7@eth1: <BROADCAST,MULTICAST> mtu 1500 qdisc noqueue state DOWN mode DEFAULT group default qlen 1000
link/ether 06:34:73:83:15:6b brd ff:ff:ff:ff:ff:ff
12: lan8@eth1: <BROADCAST,MULTICAST> mtu 1500 qdisc noqueue state DOWN mode DEFAULT group default qlen 1000
link/ether 06:34:73:83:15:6b brd ff:ff:ff:ff:ff:ff
13: optical3@eth1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP mode DEFAULT group default qlen 1000
link/ether 06:34:73:83:15:6b brd ff:ff:ff:ff:ff:ff
14: optical4@eth1: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 qdisc noqueue state LOWERLAYERDOWN mode DEFAULT group default qlen 1000
link/ether 06:34:73:83:15:6b brd ff:ff:ff:ff:ff:ff
Industrial Switches/Routers
Again, one switch is connected to the CPU via an Ether-
net controller to form the data plane between the CPU and There have been a number of contributions to DSA drivers
the switches. This port is referred to as the ’cpu’ port. And from industrial switch/router vendors from the transport in-
there is a management plane via MDIO, or SPI, I2C, MMIO. dustry. DSA has been flying in aircraft inflight entertainment
However, the data plane is extended to the cascaded switches (IFE) systems for a number of years. Busses and trains are be-
via the ’dsa’ ports. These ports are used to connect switches coming more networked, in order to provide passenger infor-
together, so that frames can be passed between switches, or mation systems, with DSA being used in the network equip-
forwarded to the CPU via its Ethernet controller. The man- ment. Figures 6 and 7 show a couple of example devices.
agement plane is extended, in that each switch is connected
to the management plane. Note that ’dsa’ ports are not visible History
to the user as normal network devices. DSA is not a new subsystem in the Linux kernel. It was added
The distributed nature of the switch is hidden from the in 2008, with support for a limited number of Marvell SOHO
user. Only a collection of Linux network interfaces are seen. switches (Linkstreet product line). However, after the ini-
Figure 2 illustrates this, in that the board actually has three tial contribution, development was dormant, as can be seen
Figure 5: Broadcom BCM97445VMS Board with an
BCM53125 Switch at the top-left
4000
3000
2000
The Switch as a Hardware Accelerator
1000 When swconfig was rejected, there was a number of different
0
ideas how Ethernet switches, and other network accelerators
2008-11
2009-02
2009-07
2010-04
2010-12
2011-04
2011-08
2012-05
2013-01
2014-01
2014-06
2014-09
2014-12
2015-03
2015-06
2015-09
2015-12
2016-03
2016-06
2016-09
2016-12
Ingress tagged (CPU towards switch) frame Figure 10: Marvell ESDA tag shown in Wireshark
MAC DA MAC SA Switch tag Ether type Payload FCS
Frames sent from the CPU to the switch are tagged with eth_type_trans()
an additional header, as shown in figure 9. The top frame
in the figure is that passed to a slave interface by the Linux
skb dev =
sw0p0
network stack. The bottom frame is that which egresses the Yes
netdev_uses_dsa()
master interface, the CPU network controller, and ingresses to ?
XX_tag_rcv()
the switch. The switch tag, which is generally added after the
source MAC address, is used to direct the frame out a specific No
Discard
vector of ports of the switch. Additionally, when there are ip_rcv()
Switch port valid Yes
?
multiple switches, it indicates which switch the egress port
belongs to. The tag indicates if this is an ingress or egress No
frame, relative to the switch. The metadata varies between
tagging protocols, but can for example indicate the presence Figure 11: Processing the Switch Tag
of a VLAN tag within the switch tag, the CFI, or the frame
priority.
Frames which egress the switch to the CPU Ethernet con-
troller have a similar switch tag. The metadata may indicate frame. This extracts the information from the tag, and then
why the switch egressed the frame to the CPU. The source removes the tag from the frame. If the switch ingress port is
port indicates the ingress port of the switch, and when there valid, the DSA slave interface is determined, and the ingress
are multiple switches, which switch the ingress port belongs interface is updated in the skb to point to the slave device.
to. The frame is then again passed to the network stack using
Figure 10 shows a wireshark dissection of an Ethernet netif_receive_skb(). This time the true Ether Type
frame with a Marvell EDSA tag. The NTP frame is being can be extracted from the frame, and the frame is passed on
sent by the CPU to egress port 3 of switch 0. for IP processing, etc.
DSA has a number of protocol taggers to insert/remove the The transmit path is similar. The slaves transmit function
switch tags. Currently there are taggers for Marvell DSA and invokes the tagger transmit function. It inserts the switch
EDSA, Broadcom, Qualcom, and the Mediatek tagger is un- tag, and then calls the master interface’s transmit function via
der review. dev_queue_xmit().
Figure 11 shows how these tagging protocols are used. This way of popping or pushing the switch tag is com-
The frame from the switch is received by the pletely standard and uses Linux’s way of dealing with a stack
CPU’s Ethernet controller, and the driver calls of devices on top of each other.
netif_receive_skb() to pass the frame to the
network stack in the normal way. eth_type_trans()
is called to determine the Ether Type of the frame. As Control Plane
part of eth_type_trans(), a check is made to see
if the ingress interface is a DSA master interface, i.e. The control plane for switches in the DSA framework makes
netdev_uses_dsa(). If so, tagged frames are expected. use of switchdev to interface with the Linux network stack’s
The tag protocol receiver function is then invoked on the control plane.
switchdev It is hoped the following features will appear during 2017.
switchdev is a stateless framework within the kernel stack • Merge the Mediatek driver. This driver is currently under
which lives under net/switchdev. It provides the needed review and might be merged before this paper is even pre-
control knobs within the network stack’s control plane to sented!
push tasks which can be offloaded down to the hardware. It • Add support for Microchip devices. Microchip is working
does this by offering a number of switchdev_ops, which on a driver and hope to contribute it soon.
switch-like devices can implement. Examples of this are
adding/removing a VLAN to a port, adding/removing a for- • Multiple CPU ports. Some WiFi access points have two
warding database entry to a port, changing the spanning tree ports connected to CPU Ethernet controllers, in order to
protocol state of a port, etc. In order to support the diverse increase the bandwidth between the CPU and the switch.
ways VLANs, forwarding database entries, etc. can be repre- However, DSA currently is limited to a single CPU Eth-
sented in hardware, switchdev provides an abstract model of ernet controller. The vendor firmware configures one of
these objects. It is the responsibility of the ops implementer the two CPU interfaces and the switch in a straight though
to translate the abstract representation into a concrete repre- manor, to implement the WAN port of the device. Al-
sentation needed by the switch. though simple, it potentially does not make the best use
switchdev is not a driver model. It does not define what a of the available bandwidth. The tagging headers already
switch is. It just defines operations that switch-like devices guarantee traffic segregation, so there is no need to ded-
may implement. This makes the API flexible to a wide range icate a CPU Ethernet controller to the WAN port. DSA
of hardware. The main user of this API is switches, but it will be extended to allow multiple CPU ports to be defined,
can also be used with Ethernet controllers with SRIOV VF and where possible, implement basic load balancing across
functionality, etc. these CPU ports. Each CPU ports will send traffic to a sub-
Additionally, switchdev is not involved in the data plane, set of the switches ports.
only at the control plane level. • IGMP snooping. Currently, all multicast traffic is flooded
In Summary, switchdev is an abstraction the network stack to all interfaces with the switch. However, these switches
uses to offload tasks down to the underlying hardware. have the ability to detect IGMP packets and direct them to
the CPU. The Linux bridge already supports IGMP snoop-
DSA vs. switchdev in the Control Plane ing, so feeding these IGMP packets to the bridge will al-
The DSA core framework lives under net/dsa, with the low the bridge to decide which interfaces multicast frames
device drivers in driver/net/dsa. Unlike switchdev, should egress, and which interfaces have no interest in the
DSA maintains a little state. However, it aims to keep as multicast frames and can be blocked. By implementing the
much state as possible within the switch, not the driver. DSA needed switchdev callbacks, this knowledge can be pushed
provides an abstract model of a switch. Each switch has a down into the switch to control the flooding. This is partic-
dsa_switch structure to represent it. The dsa_switch ularly important when the CPU is low powered, aimed at
structure contains a list of operations, dsa_switch_ops simply managing the switch. It has no interest in the mul-
which can be performed on the switch. In order to support the ticast data itself, and a high volume of multicast traffic can
D in DSA, a collection of switches in a tree are represented overload it.
by a dsa_swith_tree. And going the other way in the • Better D in DSA for Marvell switches. Currently, the dis-
hierarchy, each dsa_switch has a number of dsa_port tributed part of DSA is primitive. The support for VLANs
structures to represent each port of the switch. spanning multiple switches is limited. Bridges spanning
Given the abstract model of a switch, DSA binds multiple bridges may leak frames, etc. Work is in progress
the switch to the Linux network stack, by implement- to improve this.
ing the netdev_ops and ethtool_ops, using the
• Better support for Fiber interfaces. SFP modules are being
dsa_switch_ops to call into the switch driver. Addition-
seen on consumer devices, and industrial routes often have
ally, DSA implements the switchdev_ops by again call-
SFP modules.
ing into the switch driver via dsa_switch_ops.
DSA also provides a well defined device tree binding to • Improved automated testing using open source software
describe the switch ports, their names, their connection to an (Ostinato) [3]
internal/external PHY, and how they are interconnected in a There are also some more long term goals.
D in DSA system.
In summary, DSA provides the glue between the network • Team/Bonding support.
stack and the switch device drivers. • TCAM support to offload parts of the firewall.
• Qualcom Hardware NAT.
Future Development Work
• Metering, broadcast storm suppression.
DSA is not complete. In fact, there is a lot left to do, when
comparing the features supported by DSA with the ones sup- • More TC support for QoS priorities and maps and other
ported by switchdev devices like the Mellanox mlxsw [2]. offloads.
The bottleneck is the availability of developers to implement It would also be good to have more vendor endorsed de-
these features, not the framework itself. velopment. We are already in a good position with 4 vendors
supporting their own devices. But there are more vendors and
devices out there. It does however seem that switch vendors
are now realizing that to be part of the Linux kernel, they have
to use switchdev, and where appropriate, DSA.
Conclusions
DSA is now a mature and working subsystem which has re-
ceived support from a fair amount of contributors actively
using it in existing products. Although there is still a long
way to go in terms of feature completeness regarding what
existing Ethernet switches can do, the fundamental paradigm
that a switch port should be a Linux network device has been
proven successful.
DSA benefits from working on a product space that is today
largely mature and receives little radical changes that would
require a complete redesign. The latest major change was in
the device driver model aspect and has since opened the door
to supporting many more devices. Having to support such de-
vices allows developers to focus on bringing additional fea-
tures into what Linux can already do, and therefore pushing
for better integration of offloads.
Ultimately, the goals of getting a device supported in Linux
is to get finer and better control over what existing WiFi ac-
cess points/routers and other Linux based network products
can do. Better control allows building reliable, scalable and
sustainable networks with equally scalable open source solu-
tions, benefiting every one.
References
[1] net: phy: add Generic Netlink switch configuration API
https://www.spinics.net/lists/netdev/msg254794.html
[2] Mellanox Technologies Switch ASICs support
https://git.kernel.org/pub/scm/linux/kernel/git/
torvalds/linux.git/tree/drivers/net/ethernet/mellanox/mlxsw
[3] Ostinato Network Traffic Generator
http://ostinato.org/