Linux Networking Explained
LinuxCon 2016, Toronto
Thomas Graf (@tgraf__)
Kernel, Cilium & Open vSwitch Team
Noiro Networks (Cisco)
Did you catch part I?
● Part II: LinuxCon, Toronto, 2016
Linux Networking Explained
Network devices, Namespaces, Routing, Veth, VLAN, IPVLAN, MACVLAN,
MACVTAP, Bonding, Team, OVS, Bridge, BPF, IPSec
● Part I: LinuxCon, Seattle, 2015
Kernel Networking Walkthrough
The protocol stack, sockets, offloads, TCP fast open, TCP small queues,
NAPI, busy polling, RSS, RPS, memory accounting
[Link]
Network Devices
● Real / Physical ● Software / Virtual
Backed by hardware Simulation or virtual
representation
Example: Ethernet card,
WIFI, USB, ... Example: Loopback (lo),
Bridge (br), Virtual Ethernet
(veth), ...
$ ip link
[...]
$ ip link show enp1s0f1
4: enp1s0f1: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 qdisc mq state [...]
link/ether [Link] brd [Link]
Addresses
Do we need to consider a packet for local sockets?
Sockets
ip_local_deliver() ip_output()
Local?
ip_forward() Routing
[Link] = 1
$ ip addr add [Link]/24 dev em1
$ ip address show dev em1
2: em1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc fq_codel state UP [...]
link/ether [Link] brd [Link]
inet [Link]/24 brd [Link] scope global em1
valid_lft forever preferred_lft forever
inet6 fe80::12c3:7bff:fe95:21da/64 scope link
valid_lft forever preferred_lft forever
Pro Tip: The Local Table
List all accepted local addresses:
$ ip route list table local type local
[Link]/8 dev lo proto kernel scope host src [Link]
[Link] dev lo proto kernel scope host src [Link]
[Link] dev em1 proto kernel scope host src [Link]
[Link] dev virbr0 proto kernel scope host src [Link]
H4x0r Tip: You can also modify this table after the generated
local routes have been inserted.
Routing
Device
Sockets Device
Device
Direct Route - endpoints are direct neighbours (L2)
$ ip route add [Link]/8 dev em1
$ ip route show
[Link]/8 dev em1 scope link
Nexthop Route - endpoints are behind another router (L3)
$ ip route add [Link]/16 via [Link]
$ ip route show
[Link]/16 via [Link] dev em1
Pro Trick: Simulating a Route Lookup
How will a packet to [Link] get routed?
$ ip route get [Link]
[Link] via [Link] dev em1 src [Link]
cache
NOTE: This is not just $(ip route show | grep). It performs an
actual route lookup on the specified destination address in the
kernel.
Network Namespaces
Linux maintains resources and data structures per namespace
Namespace 1 Namespace 2
Addresses Sockets Addresses Sockets
Routes Routes
tap0 eth0
NOTE: Not all data structures are namespace aware yet!
$ ip netns add blue
$ ip link set tap0 netns blue
$ ip netns exec blue ip address
1: lo: <LOOPBACK> mtu 65536 qdisc noop state DOWN group default qlen 1
link/loopback [Link] brd [Link]
19: tap0: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN group default qlen 1000
link/ether [Link] brd [Link]
VLAN
Virtual Networks on Layer 2
Virtual Network 1 VLAN1 VLAN1
Virtual Network 2 VLAN2 L2 VLAN2
Virtual Network 3 VLAN3 VLAN3
Packet Headers:
Ethernet VLAN IP
$ ip link add link em1 vlan1 type vlan id 1
$ ip link set vlan1 up
$ ip link show vlan1
15: vlan1@em1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP [...]
link/ether [Link] brd [Link]
Bonding / Team
Link Aggregation
● Uses:
– Redundant network cards
(failover) team0
– Connect to multiple ToR (LB)
● Implementations:
– Team (new, user/kernel)
– Bonding (old, kernel only)
$ cp /usr/share/doc/teamd-*/example_configs/activebackup_ethtool_1.conf .
$ teamd -g -f activebackup_ethtool_1.conf -d
[...]
$ teamdctl team0 state
[...]
Veth
Virtual Ethernet Cable
Namespace 1 Namespace 2
● Bidirectional FIFO
● Often used to cross namespaces veth0 veth1
$ ip link add veth1 type veth peer name veth2
$ ip link set veth1 netns ns1
$ ip link set veth2 netns ns2
Bridge
Virtual Switch
● Flooding: Clone packets and send br0
to all ports.
● Learning: Learn who's behind port port port
which port to avoid flooding
● STP: Detect wiring loops and
disable ports
● Native VLAN integration
● Offload: Program HW based on FDB
table
$ ip link add br0 type bridge
$ ip link set eth0 master br0
$ ip link set tap3 master br0
$ ip link set br0 up
Example
Bridge + Team + Veth
Namespace
Host
br0
veth0 veth1
team0
Namespace Namespace
Container A Container B
eth0 eth0
eth0 eth1
MACVLAN
Simplified bridging for guests
● NOT 802.1Q VLANs
● Multiple MAC addresses on single interface
● KISS - no learning, no STP macvlan0 macvlan1
slaves MAC1 MAC2
● Modes:
– VEPA (default): Guest to guest done on
ToR, L3 fallback possible
master Physical Device
– Bridge: Guest to guest in software
– Private: Isolated, no guest to guest
– Passthrough: Attaches VF (SR-IOV)
$ ip link add link em1 name macvlan0 type macvlan mode bridge
$ ip -d link show macvlan0
23: macvlan0@em1: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN [...]
link/ether [Link] brd [Link] promiscuity 0
macvlan mode bridge addrgenmode eui64
$ ip link set macvlan0 netns blue
Example
Team + MACVLAN
Namespace
Host
team0
Namespace Namespace
Container A Container B
eth0 eth0
(macvlan) (macvlan)
eth0 eth1
TUN/TAP
A gate to user space
● Character Device in user space File File
Descriptor Descriptor
● Network device in kernel space user
● L2 (TAP) or L3 (TUN) kernel
tun0 tap0
● Uses: encryption, VPN, tunneling,
virtual machines, ...
$ ip tuntap add tun0 mode tun
$ ip link set tun0 up
$ ip link show tun0
18: tun0: <NO-CARRIER,POINTOPOINT,MULTICAST,NOARP,UP> mtu 1500 qdisc fq_codel [...]
link/none
$ ip route add [Link]/24 dev tun0
user.c:
fd = open("/dev/net/tun", O_RDWR);
strncpy(ifr.ifr_name,“tap0”, IFNAMSIZ);
ioctl(fd, TUNSETIFF, (void *) &ifr);
MACVTAP
Bridge + TAP = MACVTAP
● A TAP with an integrated bridge
/dev/tap2 /dev/tap3
● Connects VM/container via L2 user
● Same modes as MACVLAN kernel
macvtap2 macvtap3
MAC1 MAC2
Physical Device
$ ip link add link em1 name macvtap0 type macvtap mode vepa
$ ip -d link show macvtap
20: macvtap0@em1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc fq_codel state UP [...]
link/ether [Link] brd [Link]
macvtap mode vepa addrgenmode eui64
$ ls -l /dev/tap20
crw-------. 1 root root 241, 1 Aug 8 21:08 /dev/tap20
IPVLAN
MACVLAN for Layer 3 (L3)
● Can hide many containers behind a
single MAC address. ipvlan0 ipvlan1
● Shared L2 among slaves slaves IP1 IP2
● Mode:
– L2: Like MACVLAN w/ single MAC
master Physical Device
– L3: L2 deferred to master
namespace, no multicast/broadcast
$ ip netns add blue
$ ip link add link eth0 ipvl0 type ipvlan mode l3
$ ip link set dev ipvl0 netns blue
$ ip netns exec blue ip link set dev ipvl0 up
$ ip netns exec blue ip addr add [Link]/24 dev ipvl0
MACVLAN vs IPVLAN
MACVLAN IPVLAN
– ToR or NIC may have – DHCP based on MAC
maximum MAC address doesn't work, must use
limit client ID
– Doesn't work well with – EUI-64 IPv6 addresses
802.11 (wireless) generation issues
– No broadcast/multicast
in L3 mode
Encapsulation (Tunnels)
Virtual Networks on Layer 3/4
Virtual Network 1 vxlan1 vxlan1
Virtual Network 2 vxlan2 L3/L4 vxlan2
Virtual Network 3 vxlan3 vxlan3
VXLAN Headers example:
Ethernet IP UDP VXLAN Ethernet IP TCP
Underlay Overlay
$ ip link add vxlan42 type vxlan id 42 group [Link] dev em1 dstport 4789
$ ip link set vxlan42 up
$ ip link show vxlan42
31: vxlan42: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1450 qdisc noqueue state UNKNOWN [...]
link/ether [Link] brd [Link]
IPSec
Authenticated &
Encrypted Socket Socket
Netdevice L3 Netdevice
Transport Mode
● AH: Authentication
Ethernet IP ESP TCP
● ESP: Authenication +
Tunnel Mode encryption
Ethernet IP ESP IP TCP
$ ip xfrm state add src [Link] dst [Link] proto esp \
spi 0x53fa0fdd mode transport reqid 16386 replay-window 32 \
auth "hmac(sha1)" 0x55f01ac07e15e437115dde0aedd18a822ba9f81e \
enc "cbc(aes)" 0x6aed4975adf006d65c76f63923a6265b \
sel src [Link]/0 dst [Link]/0
● Fully programmable L2-L4 virtual
switch with APIs: OpenFlow and ...
OVSDB
ovs0
● Split into a user and kernel component
● Multiple control plane integrations:
port port port
– OVN, ODL, Neutron, CNI, Docker, ...
$ ovs-vsctl add-br ovs0
$ ovs-vsctl add-port ovs0 em1
$ ovs-ofctl add-flow ovs0 in_port=1,actions=drop
$ ovs-vsctl show
a425a102-c317-4743-b0ba-79d59ff04a74
Bridge "ovs0"
Port "em1"
Interface "em1"
[...]
BPF
Source Byte
Code Code
LLVM/clang Userspace
Verifier
+ JIT
Sockets Kernel
add eax,edx add eax,edx
shl eax,2 shl eax,2
Network
TC Stack TC
Ingress Egress
netdevice netdevice
Attaching a BPF program to eth0 at ingress:
$ clang -O2 -target bpf -c code.c -o code.o
$ tc qdisc add dev eth0 clsact
$ tc filter add dev eth0 ingress bpf da obj code.o sec my-section1
$ tc filter add dev eth0 egress bpf da obj code.o sec my-section2
BPF Features
(As of Aug 2016)
● Maps
– Arrays (per CPU), hashtables (per CPU)
● Packet mangling
● Redirect to other device
● Tunnel metadata (encapsulation)
● Cgroups integration
● Event notifications via perf ring buffer
XDP – Express Data Path
Source Byte
Code Code
LLVM/clang Userspace
Verifier
+ JIT
Access to Sockets Kernel
DMA buffer
add eax,edx
shl eax,2
Network
Netdevice Stack
Driver
Q&A
Learn more about networking with BPF:
Fast IPv6-only Networking for Containers Based on
BPF and XDP
Wednesday August 24, 2016 4:35pm – 5:35pm, Queen's Quay
Contact:
● Twitter: @tgraf__ Mail: tgraf@[Link]
Image Sources:
● Cover (Toronto)
Rick Harris ([Link]
● The Invisible Man
Dr. Azzacov ([Link]
● Chicken
JOHN LLOYD ([Link]