Computer Networks
Computer Networks
Release 1.0
Peter L Dordal
CONTENTS
0 Preface
0.1 Classroom Use . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
0.2 Progress Notes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
0.3 Technical considerations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
1 An Overview of Networks
1.1 Layers . . . . . . . . . . . . .
1.2 Bandwidth and Throughput . .
1.3 Packets . . . . . . . . . . . . .
1.4 Datagram Forwarding . . . . .
1.5 Topology . . . . . . . . . . . .
1.6 Routing Loops . . . . . . . . .
1.7 Congestion . . . . . . . . . . .
1.8 Packets Again . . . . . . . . .
1.9 LANs and Ethernet . . . . . .
1.10 IP - Internet Protocol . . . . .
1.11 DNS . . . . . . . . . . . . .
1.12 Transport . . . . . . . . . . .
1.13 Firewalls . . . . . . . . . . .
1.14 Network Address Translation
1.15 IETF and OSI . . . . . . . . .
1.16 Berkeley Unix . . . . . . . .
1.17 Epilog . . . . . . . . . . . . .
1.18 Exercises . . . . . . . . . . .
2 Ethernet
2.1 10-Mbps classic Ethernet
2.2 100 Mbps (Fast) Ethernet
2.3 Gigabit Ethernet . . . . .
2.4 Ethernet Switches . . . .
2.5 Spanning Tree Algorithm
2.6 Virtual LAN (VLAN) . .
2.7 Epilog . . . . . . . . . .
2.8 Exercises . . . . . . . . .
3 Other LANs
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
3
3
5
5
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
7
7
7
8
9
11
12
12
13
14
16
21
22
24
25
26
29
29
29
.
.
.
.
.
.
.
.
33
33
43
44
45
47
51
52
52
55
i
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
55
56
57
68
70
73
75
78
80
80
4 Links
4.1 Encoding and Framing . . .
4.2 Time-Division Multiplexing
4.3 Epilog . . . . . . . . . . .
4.4 Exercises . . . . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
85
85
89
93
93
5 Packets
5.1 Packet Delay . . . . . .
5.2 Packet Delay Variability
5.3 Packet Size . . . . . . .
5.4 Error Detection . . . . .
5.5 Epilog . . . . . . . . .
5.6 Exercises . . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
95
95
98
99
101
105
106
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
109
109
113
117
123
123
7 IP version 4
7.1 The IPv4 Header . . . . . . . . . . . . . . . .
7.2 Interfaces . . . . . . . . . . . . . . . . . . . .
7.3 Special Addresses . . . . . . . . . . . . . . .
7.4 Fragmentation . . . . . . . . . . . . . . . . .
7.5 The Classless IP Delivery Algorithm . . . . .
7.6 IP Subnets . . . . . . . . . . . . . . . . . . .
7.7 Address Resolution Protocol: ARP . . . . . .
7.8 Dynamic Host Configuration Protocol (DHCP)
7.9 Internet Control Message Protocol . . . . . .
7.10 Unnumbered Interfaces . . . . . . . . . . . .
7.11 Mobile IP . . . . . . . . . . . . . . . . . . .
7.12 Epilog . . . . . . . . . . . . . . . . . . . . .
7.13 Exercises . . . . . . . . . . . . . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
127
127
129
130
131
133
135
140
143
145
148
149
150
150
8 IP version 6
ii
.
.
.
.
.
.
.
.
.
.
.
.
153
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
153
154
156
156
158
159
160
164
164
165
167
167
167
9 Routing-Update Algorithms
9.1 Distance-Vector Routing-Update Algorithm
9.2 Distance-Vector Slow-Convergence Problem
9.3 Observations on Minimizing Route Cost . .
9.4 Loop-Free Distance Vector Algorithms . . .
9.5 Link-State Routing-Update Algorithm . . .
9.6 Routing on Other Attributes . . . . . . . . .
9.7 Epilog . . . . . . . . . . . . . . . . . . . .
9.8 Exercises . . . . . . . . . . . . . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
169
169
174
176
178
180
183
184
184
.
.
.
.
.
.
.
.
189
189
191
192
192
196
197
209
210
10 Large-Scale IP Routing
10.1 Classless Internet Domain Routing: CIDR
10.2 Hierarchical Routing . . . . . . . . . . .
10.3 Legacy Routing . . . . . . . . . . . . . .
10.4 Provider-Based Routing . . . . . . . . .
10.5 Geographical Routing . . . . . . . . . .
10.6 Border Gateway Protocol, BGP . . . . .
10.7 Epilog . . . . . . . . . . . . . . . . . . .
10.8 Exercises . . . . . . . . . . . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
11 UDP Transport
11.1 User Datagram Protocol UDP . . .
11.2 Fundamental Transport Issues . . . .
11.3 Trivial File Transport Protocol, TFTP
11.4 Remote Procedure Call (RPC) . . . .
11.5 Epilog . . . . . . . . . . . . . . . . .
11.6 Exercises . . . . . . . . . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
215
215
224
225
230
234
234
12 TCP Transport
12.1 The End-to-End Principle . . .
12.2 TCP Header . . . . . . . . . . .
12.3 TCP Connection Establishment
12.4 TCP and WireShark . . . . . .
12.5 TCP simplex-talk . . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
237
238
238
239
243
245
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
iii
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
249
251
252
253
255
256
256
257
257
258
258
259
259
260
260
260
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
263
264
268
272
274
276
279
280
282
283
283
284
284
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
289
289
290
297
304
305
307
309
311
313
315
315
315
315
iv
15.2 RTTs . . . . .
15.3 Highspeed TCP
15.4 TCP Vegas . .
15.5 FAST TCP . .
15.6 TCP Westwood
15.7 TCP Veno . . .
15.8 TCP Hybla . .
15.9 TCP Illinois . .
15.10 H-TCP . . . .
15.11 TCP CUBIC .
15.12 Epilog . . . .
15.13 Exercises . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
324
325
326
328
330
332
333
334
335
336
340
341
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
347
347
349
361
376
387
389
395
395
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
399
399
400
408
414
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
415
415
416
416
417
418
430
431
434
440
444
446
448
450
452
452
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
19 Quality of Service
457
19.1 Net Neutrality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 458
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
458
459
461
462
466
468
472
472
473
474
478
480
481
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
483
485
485
486
488
489
490
491
495
500
501
509
511
513
524
532
541
.
.
.
.
.
.
.
.
.
.
545
546
547
555
561
562
564
573
575
578
587
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
Bibliography
591
593
vi
Bibliography
595
Index
601
vii
viii
Peter L Dordal
Department of Computer Science
Loyola University Chicago
Contents:
CONTENTS
CONTENTS
0 PREFACE
No man but a blockhead ever wrote, except for money. - Samuel Johnson
The textbook world is changing. On the one hand, open source software and creative-commons licensing
have been great successes; on the other hand, unauthorized PDFs of popular textbooks are widely available,
and it is time to consider flowing with rather than fighting the tide. Hence this open textbook, released for
free under the Creative Commons license described below. Mene, mene, tekel pharsin.
Perhaps the last straw, for me, was patent 8195571 for a roundabout method to force students to purchase
textbooks. (A simpler strategy might be to include the price of the book in the course.) At some point,
faculty have to be advocates for their students rather than, well, Hirudinea.
This is not to say that I have anything against for-profit publishing. It is just that this particular book does not
and will not belong to that category. In this it is in good company: there is Wikipedia, there is Gnu/Linux,
and there is an increasing number of other free online textbooks out there. The market inefficiencies of
traditional publishing are sobering: the return to authors of advanced textbooks is at best modest, and costs
to users are quite high.
This text is released under the Creative Commons license Attribution-NonCommercial-NoDerivs; this text
is like a conventional book, in other words, except that it is free. You may copy the work and distribute it
to others, but reuse requires attribution. Creation of derivative works eg modifying chapters or creating
additional chapters and distributing them as part of this work also requires permission.
The work may not be used for commercial purposes without permission. Permission is likely to be granted
for use and distribution of all or part of the work in for-profit and commercial training programs, provided
there is no direct charge to recipients for the work and provided the free nature of the work is made clear to
recipients (eg by including this preface). However, such permission must always be requested. Alternatively,
participants in commercial programs may be instructed to download the work individually.
The official book website (potentially subject to change) is [Link]. The book is available
there as online html, as a zipped archive of html files, in .pdf format, and in other formats as may prove
useful.
The book can also be used as a networks supplement or companion to other resources for a variety of
other courses that overlap to some greater or lesser degree with networking. At Loyola, earlier versions of
this material have been used coupled with a second textbook in courses in computer security, network
management, telecommunications, and even introduction-to-computing courses for non-majors. Another
possibility is an alternative or nontraditional presentation of networking itself. It is when used in concert
with other works, in particular, that this books being free is of marked advantage.
Finally, I hope the book may also be useful as a reference work. To this end, I have attempted to ensure that
the indexing and cross-referencing is sufficient to support the drop-in reader. Similarly, obscure notation is
kept to a minimum.
Much is sometimes made, in the world of networking textbooks, about top-down versus bottom-up sequencing. This book is not really either, although the chapters are mostly numbered in bottom-up fashion.
Instead, the first chapter provides a relatively complete overview of the LAN, IP and transport network layers
(along with a few other things), allowing subsequent chapters to refer to all network layers without forward
reference, and, more importantly, allowing the chapters to be covered in a variety of different orders. As a
practical matter, when I use this text to teach Loyolas Introduction to Computer Networks course, I cover
the IP and TCP material more or less in parallel.
A distinctive feature of the book is the extensive coverage of TCP: TCP dynamics, newer versions of TCP
such as TCP Cubic, and a chapter on using the ns-2 simulator to explore actual TCP behavior. This has
its roots in a longstanding goal to find better ways to present competition and congestion in the classroom.
Another feature is the detailed chapter on queuing disciplines.
One thing this book makes little attempt to cover in detail is the application layer; the token example included is SNMP. While SNMP actually makes a pretty good example of a self-contained application, my
recommendation to instructors who wish to cover more familiar examples is to combine this text with the
appropriate application documentation.
For those interested in using the book for a traditional networks course, I with some trepidation offer the
following set of core material. In solidarity with those who prefer alternatives to a bottom-up ordering, I
emphasize that this represents a set and not a sequence.
1 An Overview of Networks
Selected sections from 2 Ethernet, particularly switched Ethernet
Selected sections from 3.3 Wi-Fi
Selected sections from 5 Packets
6 Abstract Sliding Windows
7 IP version 4 and/or 8 IP version 6
Selected sections from 9 Routing-Update Algorithms and 10 Large-Scale IP Routing
11 UDP Transport
12 TCP Transport
13 TCP Reno and Congestion Management
With some care in the topic-selection details, the above can be covered in one semester along with a survey
of selected important network applications, or the basics of network programming, or the introductory con-
0 Preface
figuration of switches and routers, or coverage of additional material from this book, or some other set of
additional topics. Of course, non-traditional networks courses may focus on a quite different sets of topics.
Peter Dordal
Shabbona, Illinois
=
The characters above should look roughly as they do in the following image:
If no available browser displays these properly, I recommend the pdf or epub formats. Generally Firefox
handles the necessary characters out of the box, as does Internet Explorer, but Chrome does not.
The diagrams in the body of the text are now all in bitmap .png format, although a few diagrams rendered
with line-drawing characters still appear in the exercises. I would prefer to use the vector-graphics .svg
format, but as of January 2014 most browsers do not appear to support zooming in on .svg images, which is
really the whole point.
0 Preface
1 AN OVERVIEW OF NETWORKS
Somewhere there might be a field of interest in which the order of presentation of topics is well agreed upon.
Computer networking is not it.
There are many interconnections in the field of networking, as in most technical fields, and it is difficult
to find an order of presentation that does not involve endless forward references to future chapters; this
is true even if as is done here a largely bottom-up ordering is followed. I have therefore taken here a
different approach: this first chapter is a summary of the essentials LANs, IP and TCP across the board,
and later chapters expand on the material here.
Local Area Networks, or LANs, are the physical networks that provide the connection between machines
within, say, a home, school or corporation. LANs are, as the name says, local; it is the IP, or Internet
Protocol, layer that provides an abstraction for connecting multiple LANs into, well, the Internet. Finally,
TCP deals with transport and connections and actually sending user data.
This chapter also contains some important other material. The section on datagram forwarding, central
to packet-based switching and routing, is essential. This chapter also discusses packets generally, congestion, and sliding windows, but those topics are revisited in later chapters. Firewalls and network address
translation are also covered here and not elsewhere.
1.1 Layers
These three topics LANs, IP and TCP are often called layers; they constitute the Link layer, the Internetwork layer, and the Transport layer respectively. Together with the Application layer (the software you use),
these form the four-layer model for networks. A layer, in this context, corresponds strongly to the idea
of a programming interface or library (though some of the layers are not accessible to ordinary users): an
application hands off a chunk of data to the TCP library, which in turn makes calls to the IP library, which
in turn calls the LAN layer for actual delivery.
The LAN layer is in charge of actual delivery of packets, using LAN-layer-supplied addresses. It is often
conceptually subdivided into the physical layer dealing with, eg, the analog electrical, optical or radio
signaling mechanisms involved, and above that an abstracted logical LAN layer that describes all the
digital that is, non-analog operations on packets; see 2.1.2 The LAN Layer. The physical layer is
generally of direct concern only to those designing LAN hardware; the kernel software interface to the LAN
corresponds to the logical LAN layer. This LAN physical/logical division gives us the Internet five-layer
model. This is less a formal hierarchy as an ad hoc classification method. We will return to this below in
1.15 IETF and OSI.
transmission rate, taking into account things like transmission overhead, protocol inefficiencies and perhaps
even competing traffic. It is generally measured at a higher network layer than the data rate.
The term bandwidth can be used to refer to either of these, though we here try to use it mostly as a synonym
for data rate. The term comes from radio transmission, where the width of the frequency band available is
proportional, all else being equal, to the data rate that can be achieved.
In discussions about TCP, the term goodput is sometimes used to refer to what might also be called
application-layer throughput: the amount of usable data delivered to the receiving application. Specifically, retransmitted data is counted only once when calculating goodput but might be counted twice under
some interpretations of throughput.
Data rates are generally measured in kilobits per second (Kbps) or megabits per second (Mbps); in the
context of data rates, a kilobit is 103 bits (not 210 ) and a megabit is 106 bits. The use of the lowercase b
means bits; data rates expressed in terms of bytes often use an upper-case B.
1.3 Packets
Packets are modest-sized buffers of data, transmitted as a unit through some shared set of links. Of necessity,
packets need to be prefixed with a header containing delivery information. In the common case known as
datagram forwarding, the header contains a destination address; headers in networks using so-called
virtual-circuit forwarding contain instead an identifier for the connection. Almost all networking today
(and for the past 50 years) is packet-based, although we will later look briefly at some circuit-switched
options for voice telephony.
At the LAN layer, packets can be viewed as the imposition of a buffer (and addressing) structure on top
of low-level serial lines; additional layers then impose additional structure. Informally, packets are often
referred to as frames at the LAN layer, and as segments at the Transport layer.
The maximum packet size supported by a given LAN (eg Ethernet, Token Ring or ATM) is an intrinsic
attribute of that LAN. Ethernet allows a maximum of 1500 bytes of data. By comparison, TCP/IP packets
originally often held only 512 bytes of data, while early Token Ring packets could contain up to 4KB of
data. While there are proponents of very large packet sizes, larger even than 64KB, at the other extreme the
ATM (Asynchronous Transfer Mode) protocol uses 48 bytes of data per packet, and there are good reasons
for believing in modest packet sizes.
One potential issue is how to forward packets from a large-packet LAN to (or through) a small-packet LAN;
in later chapters we will look at how the IP (or Internet Protocol) layer addresses this.
1 An Overview of Networks
Generally each layer adds its own header. Ethernet headers are typically 14 bytes, IP headers 20 bytes, and
TCP headers 20 bytes. If a TCP connection sends 512 bytes of data per packet, then the headers amount to
10% of the total, a not-unreasonable overhead. For one common Voice-over-IP option, packets contain 160
bytes of data and 54 bytes of headers, making the header about 25% of the total. Compressing the 160 bytes
of audio, however, may bring the data portion down to 20 bytes, meaning that the headers are now 73% of
the total; see 19.11.4 RTP and VoIP.
In datagram-forwarding networks the appropriate header will contain the address of the destination and
perhaps other delivery information. Internal nodes of the network called routers or switches will then make
sure that the packet is delivered to the requested destination.
The concept of packets and packet switching was first introduced by Paul Baran in 1962 ([PB62]). Barans
primary concern was with network survivability in the event of node failure; existing centrally switched
protocols were vulnerable to central failure. In 1964, Donald Davies independently developed many of the
same concepts; it was Davies who coined the term packet.
It is perhaps worth noting that packets are buffers built of 8-bit bytes, and all hardware today agrees what
a byte is (hardware agrees by convention on the order in which the bits of a byte are to be transmitted).
8-bit bytes are universal now, but it was not always so. Perhaps the last great non-byte-oriented hardware
platform, which did indeed overlap with the Internet era broadly construed, was the DEC-10, which had a
36-bit word size; a word could hold five 7-bit ASCII characters. The early Internet specifications introduced
the term octet (an 8-bit byte) and required that packets be sequences of octets; non-octet-oriented hosts had
to be able to convert. Thus was chaos averted. Note that there are still byte-oriented data issues; as one
example, binary integers can be represented as a sequence of bytes in either big-endian or little-endian byte
order. RFC 1700 specifies that Internet protocols use big-endian byte order, therefore sometimes called
network byte order.
next_hop
0
1
2
2
2
The table for S2 might be as follows, where we have consolidated destinations A and C for visual simplicity.
S2
destination
A,C
D
E
B
next_hop
0
1
2
3
Alternatively, we could replace the interface information with next-node, or neighbor, information, as all
the links above are point-to-point and so each interface connects to a unique neighbor. In that case, S1s
table might be written as follows (with consolidation of the entries for B, D and E):
S1
destination
A
C
B,D,E
next_hop
A
C
S2
A central feature of datagram forwarding is that each packet is forwarded in isolation; the switches involved do not have any awareness of any higher-layer logical connections established between endpoints.
This is also called stateless forwarding, in that the forwarding tables have no per-connection state. RFC
1122 put it this way (in the context of IP-layer datagram forwarding):
To improve robustness of the communication system, gateways are designed to be stateless,
forwarding each IP datagram independently of other datagrams. As a result, redundant paths can
be exploited to provide robust service in spite of failures of intervening gateways and networks.
Datagram forwarding is sometimes allowed to use other information beyond the destination address. In
theory, IP routing can be done based on the destination address and some quality-of-service information,
allowing, for example, different routing to the same destination for high-bandwidth bulk traffic and for lowlatency real-time traffic. In practice, many ISPs ignore quality-of-service information in the IP header, and
route only based on the destination.
By convention, switching devices acting at the LAN layer and forwarding packets based on the LAN address
are called switches (or, in earlier days, bridges), while such devices acting at the IP layer and forwarding
on the IP address are called routers. Datagram forwarding is used both by Ethernet switches and by IP
routers, though the destinations in Ethernet forwarding tables are individual nodes while the destinations in
IP routers are entire networks (that is, sets of nodes).
10
1 An Overview of Networks
In IP routers within end-user sites it is common for a forwarding table to include a catchall default entry,
matching any IP address that is nonlocal and so needs to be routed out into the Internet at large. Unlike the
consolidated entries for B, D and E in the table above for S1, which likely would have to be implemented as
actual separate entries, a default entry is a single record representing where to forward the packet if no other
destination match is found. Here is a forwarding table for S1, above, with a default entry replacing the last
three entries:
S1
destination
A
C
default
next_hop
0
1
2
Default entries make sense only when we can tell by looking at an address that it does not represent a
nearby node. This is common in IP networks because an IP address encodes the destination network, and
routers generally know all the local networks. It is however rare in Ethernets, because there is generally
no correlation between Ethernet addresses and locality. If S1 above were an Ethernet switch, and it had
some means of knowing that interfaces 0 and 1 connected directly to individual hosts, not switches and S1
knew the addresses of these hosts then making interface 2 a default route would make sense. In practice,
however, Ethernet switches do not know what kind of device connects to a given interface.
1.5 Topology
In the network diagrammed in the previous section, there are no loops; graph theorists might describe this
by saying the network graph is acyclic, or is a tree. In a loop-free network there is a unique path between
any pair of nodes. The forwarding-table algorithm has only to make sure that every destination appears in
the forwarding tables; the issue of choosing between alternative paths does not arise.
However, if there are no loops then there is no redundancy: any broken link will result in partitioning the
network into two pieces that cannot communicate. All else being equal (which it is not, but never mind for
now), redundancy is a good thing. However, once we start including redundancy, we have to make decisions
among the multiple paths to a destination. Consider, for a moment, the following network:
1.5 Topology
11
D might feel that the best path to B is DECB (perhaps because it believes the AD link is to be avoided).
If E similarly decides the best path to B is EDAB, and if D and E both choose their next_hop for B
based on these best paths, then a linear routing loop is formed: D routes to B via E and E routes to B via D.
Although each of D and E have identified a usable path, that path is not in fact followed. Moral: successful
datagram routing requires cooperation and a consistent view of the network.
1.7 Congestion
Switches introduce the possibility of congestion: packets arriving faster than they can be sent out. This can
happen with just two interfaces, if the inbound interface has a higher bandwidth than the outbound interface;
another common source of congestion is traffic arriving on multiple inputs and all destined for the same
output.
12
1 An Overview of Networks
Whatever the reason, if packets are arriving for a given outbound interface faster than they can be sent, a
queue will form for that interface. Once that queue is full, packets will be dropped. The most common
strategy (though not the only one) is to drop any packets that arrive when the queue is full.
The term congestion may refer either to the point where the queue is just beginning to build up, or to the
point where the queue is full and packets are lost. In their paper [CJ89], Chiu and Jain refer to the first point
as the knee; this is where the slope of the load v throughput graph flattens. They refer to the second point as
the cliff; this is where packet losses may lead to a precipitous decline in throughput. Other authors use the
term contention for knee-congestion.
In the Internet, most packet losses are due to congestion. This is not because congestion is especially bad
(though it can be, at times), but rather that other types of losses (eg due to packet corruption) are insignificant
by comparison.
When to Upgrade?
Deciding when a network really does have insufficient bandwidth is not a technical issue but an economic one. The number of customers may increase, the cost of bandwidth may decrease or customers
may simply be willing to pay more to have data transfers complete in less time; customers here can
be external or in-house. Monitoring of links and routers for congestion can, however, help determine
exactly what parts of the network would most benefit from upgrade.
We emphasize that the presence of congestion does not mean that a network has a shortage of bandwidth.
Bulk-traffic senders (though not real-time senders) attempt to send as fast as possible, and congestion is
simply the networks feedback that the maximum transmission rate has been reached.
Congestion is a sign of a problem in real-time networks, which we will consider in 19 Quality of Service.
In these networks losses due to congestion must generally be kept to an absolute minimum; one way to
achieve this is to limit the acceptance of new connections unless sufficient resources are available.
13
forwarding delay is hard to avoid (though some switches do implement cut-through switching to begin
forwarding a packet before it has fully arrived), but if one is sending a long train of packets then by keeping
multiple packets en route at the same time one can essentially eliminate the significance of the forwarding
delay; see 5.3 Packet Size.
Total packet delay from sender to receiver is the sum of the following:
Bandwidth delay, ie sending 1000 Bytes at 20 Bytes/millisecond will take 50 ms. This is a per-link
delay.
Propagation delay due to the speed of light. For example, if you start sending a packet right now
on a 5000-km cable across the US with a propagation speed of 200 m/sec (= 200 km/ms, about 2/3
the speed of light in vacuum), the first bit will not arrive at the destination until 25 ms later. The
bandwidth delay then determines how much after that the entire packet will take to arrive.
Store-and-forward delay, equal to the sum of the bandwidth delays out of each router along the path
Queuing delay, or waiting in line at busy routers. At bad moments this can exceed 1 sec, though that
is rare. Generally it is less than 10 ms and often is less than 1 ms. Queuing delay is the only delay
component amenable to reduction through careful engineering.
See 5.1 Packet Delay for more details.
14
1 An Overview of Networks
Time and Collisions. While Ethernet collisions definitely reduce throughput, in the larger view they should
perhaps be thought of as a part of a remarkably inexpensive shared-access mediation protocol.
In unswitched Ethernets every packet is received by every host and it is up to the network card in each host
to determine if the arriving packet is addressed to that host. It is almost always possible to configure the
card to forward all arriving packets to the attached host; this poses a security threat and password sniffers
that surreptitiously collected passwords via such eavesdropping used to be common.
Password Sniffing
In the fall of 1994 at Loyola University I remotely changed the root password on several CS-department
unix machines at the other end of campus, using telnet. I told no one. Within two hours, someone else
logged into one of these machines, using the new password, from a host in Europe. Password sniffing
was the likely culprit.
Two months later was the so-called Christmas Day Attack (12.9.1 ISNs and spoofing). One of the
hosts used to launch this attack was Loyolas hacked [Link]. It is unclear the degree to which
password sniffing played a role in that exploit.
Due to both privacy and efficiency concerns, almost all Ethernets today are fully switched; this ensures that
each packet is delivered only to the host to which it is addressed. One advantage of switching is that it
effectively eliminates most Ethernet collisions; while in principle it replaces them with a queuing issue, in
practice Ethernet switch queues so seldom fill up that they are almost invisible even to network managers
(unlike IP router queues). Switching also prevents host-based eavesdropping, though arguably a better
solution to this problem is encryption. Perhaps the more significant tradeoff with switches, historically, was
that Once Upon A Time they were expensive and unreliable; tapping directly into a common cable was dirt
cheap.
Ethernet addresses are six bytes long. Each Ethernet card (or network interface) is assigned a (supposedly)
unique address at the time of manufacture; this address is burned into the cards ROM and is called the cards
physical address or hardware address or MAC (Media Access Control) address. The first three bytes of
the physical address have been assigned to the manufacturer; the subsequent three bytes are a serial number
assigned by that manufacturer.
By comparison, IP addresses are assigned administratively by the local site. The basic advantage of having
addresses in hardware is that hosts automatically know their own addresses on startup; no manual configuration or server query is necessary. It is not unusual for a site to have a large number of identically configured
workstations, for which all network differences derive ultimately from each workstations unique Ethernet
address.
The network interface continually monitors all arriving packets; if it sees any packet containing a destination
address that matches its own physical address, it grabs the packet and forwards it to the attached CPU (via a
CPU interrupt).
Ethernet also has a designated broadcast address. A host sending to the broadcast address has its packet
received by every other host on the network; if a switch receives a broadcast packet on one port, it forwards
the packet out every other port. This broadcast mechanism allows host A to contact host B when A does
not yet know Bs physical address; typical broadcast queries have forms such as Will the designated server
please answer or (from the ARP protocol) will the host with the given IP address please tell me your
physical address.
15
Traffic addressed to a particular host that is, not broadcast is said to be unicast.
Because Ethernet addresses are assigned by the hardware, knowing an address does not provide any direct
indication of where that address is located on the network. In switched Ethernet, the switches must thus have
a forwarding-table record for each individual Ethernet address on the network; for extremely large networks
this ultimately becomes unwieldy. Consider the analogous situation with postal addresses: Ethernet is
somewhat like attempting to deliver mail using social-security numbers as addresses, where each postal
worker is provided with a large catalog listing each persons SSN together with their physical location. Real
postal mail is, of course, addressed hierarchically using ever-more-precise specifiers: state, city, zipcode,
street address, and name / room#. Ethernet, in other words, does not scale well to large sizes.
Switched Ethernet works quite well, however, for networks with up to 10,000-100,000 nodes. Forwarding
tables with size in that range are straightforward to manage.
To forward packets correctly, switches must know where all active destination addresses in the LAN are
located; Ethernet switches do this by a passive learning algorithm. (IP routers, by comparison, use active
protocols.) Typically a host physical address is entered into a switchs forwarding table when a packet from
that host is first received; the switch notes the packets arrival interface and source address and assumes
that the same interface is to be used to deliver packets back to that sender. If a given destination address
has not yet been seen, and thus is not in the forwarding table, Ethernet switches still have the backup
delivery option of forwarding to everyone, by treating the destination address like the broadcast address,
and allowing the host Ethernet cards to sort it out. Since this broadcast-like process is not generally used
for more than one packet (after that, the switches will have learned the correct forwarding-table entries), the
risk of eavesdropping is minimal.
The host,interface forwarding table is often easier to think of as host,next_hop, where the next_hop node
is whatever switch or host is at the immediate other end of the link connecting to the given interface. In a
fully switched network where each link connects only two interfaces, the two perspectives are equivalent.
1 An Overview of Networks
first byte
0-127
128-191
192-223
network bits
8
16
24
host bits
24
16
8
name
class A
class B
class C
application
a few very large networks
institution-sized networks
sized for smaller entities
For example, the original IP address allocation for Loyola University Chicago was [Link], a class B.
In binary, 147 is 10010011. The network/host division point is not carried within the IP header; in fact,
nowadays the division into network and host is dynamic, and can be made at different positions in the
address at different levels of the network.
IP addresses, unlike Ethernet addresses, are administratively assigned. Once upon a time, you would get
your Class B network prefix from the Internet Assigned Numbers Authority, or IANA (they now delegate
this task), and then you would in turn assign the host portion in a way that was appropriate for your local
site. As a result of this administrative assignment, an IP address usually serves not just as an endpoint
identifier but also as a locator, containing embedded location information.
The Class A/B/C definition above was spelled out in 1981 in RFC 791, which introduced IP. Class D was
added in 1986 by RFC 988; class D addresses must begin with the bits 1110. These addresses are for
multicast, that is, sending an IP packet to every member of a set of recipients (ideally without actually
transmitting it more than once on any one link).
The network portion of an IP address is sometimes called the network number or network address or
network prefix; as we shall see below, most forwarding decisions are made using only the network portion.
It is commonly denoted by setting the host bits to zero and ending the resultant address with a slash followed
by the number of network bits in the address: eg [Link]/8 or [Link]/16. Note that [Link]/8 and
[Link]/9 represent different things; in the latter, the second byte of any host address extending the network
address is constrained to begin with a 0-bit. An anonymous block of IP addresses might be referred to only
by the slash and following digit, eg we need a /22 block to accommodate all our customers.
All hosts with the same network address (same network bits) must be located together on the same LAN; as
we shall see below, if two hosts share the same network address then they will assume they can reach each
other directly via the underlying LAN, and if they cannot then connectivity fails. A consequence of this rule
is that outside of the site only the network bits need to be looked at to route a packet to the site.
Each individual LAN technology has a maximum packet size it supports; for example, Ethernet has a maximum packet size of about 1500 bytes but the once-competing Token Ring had a maximum of 4 KB. Today
the world has largely standardized on Ethernet and almost entirely standardized on Ethernet packet-size limits, but this was not the case when IP was introduced and there was real concern that two hosts on separate
large-packet networks might try to exchange packets too large for some small-packet intermediate network
to carry.
Therefore, in addition to routing and addressing, the decision was made that IP must also support fragmentation: the division of large packets into multiple smaller ones (in other contexts this may also be called
segmentation). The IP approach is not very efficient, and IP hosts go to considerable lengths to avoid fragmentation. IP does require that packets of up to 576 bytes be supported, and so a common legacy strategy
was for a host to limit a packet to at most 512 user-data bytes whenever the packet was to be sent via a
router; packets addressed to another host on the same LAN could of course use a larger packet size. Despite
its limited use, however, fragmentation is essential conceptually, in order for IP to be able to support large
packets without knowing anything about the intervening networks.
IP is a best effort system; there are no IP-layer acknowledgments or retransmissions. We ship the packet
off, and hope it gets there. Most of the time, it does.
1.10 IP - Internet Protocol
17
Architecturally, this best-effort model represents what is known as connectionless networking: the IP layer
does not maintain information about endpoint-to-endpoint connections, and simply forwards packets like a
giant LAN. Responsibility for creating and maintaining connections is left for the next layer up, the TCP
layer. Connectionless networking is not the only way to do things: the alternative could have been some
form connection-oriented internetworking, in which routers do maintain state information about individual
connections. Later, in 3.7 Virtual Circuits, we will examine how virtual-circuit networking can be used to
implement a connection-oriented approach; virtual-circuit switching is the primary alternative to datagram
switching.
Connectionless (IP-style) and connection-oriented networking each have advantages. Connectionless networking is conceptually more reliable: if routers do not hold connection state, then they cannot lose connection state. The path taken by the packets in some higher-level connection can easily be dynamically rerouted.
Finally, connectionless networking makes it hard for providers to bill by the connection; once upon a time
(in the era of dollar-a-minute phone calls) this was a source of mild astonishment to many new users. The
primary advantage of connection-oriented networking, however, is that the routers are then much better positioned to accept reservations and to make quality-of-service guarantees. This remains something of a
sore point in the current Internet: if you want to use Voice-over-IP, or VoIP, telephones, or if you want to
engage in video conferencing, your packets will be treated by the Internet core just the same as if they were
low-priority file transfers. There is no priority service option.
Perhaps the most common form of IP packet loss is router queue overflows, representing network congestion.
Packet losses due to packet corruption are rare (eg less than one in 104 ; perhaps much less). But in a
connectionless world a large number of hosts can simultaneously decide to send traffic through one router,
in which case queue overflows are hard to avoid.
1.10.1 IP Forwarding
IP routers use datagram forwarding, described in 1.4 Datagram Forwarding above, to deliver packets, but
the destination values listed in the forwarding tables are network prefixes representing entire LANs
instead of individual hosts. The goal of IP forwarding, then, becomes delivery to the correct LAN; a separate
process is used to deliver to the final host once the final LAN has been reached.
The entire point, in fact, of having a network/host division within IP addresses is so that routers need to list
only the network prefixes of the destination addresses in their IP forwarding tables. This strategy is the key
to IP scalability: it saves large amounts of forwarding-table space, it saves time as smaller tables allow faster
lookup, and it saves the bandwidth that would be needed for routers to keep track of individual addresses.
To get an idea of the forwarding-table space savings, there are currently (2013) around a billion hosts on the
Internet, but only 300,000 or so networks listed in top-level forwarding tables. When network prefixes are
used as forwarding-table destinations, matching an actual packet address to a forwarding-table entry is no
longer a matter of simple equality comparison; routers must compare appropriate prefixes.
IP forwarding tables are sometimes also referred to as routing tables; in this book, however, we make
at least a token effort to use forwarding to refer to the packet forwarding process, and routing to refer
to mechanisms by which the forwarding tables are maintained and updated. (If we were to be completely
consistent here, we would use the term forwarding loop rather than routing loop.)
Now let us look at a simple example of how IP forwarding (or routing) works. We will assume that all
network nodes are either hosts user machines, with a single network connection or routers, which do
packet-forwarding only. Routers are not directly visible to users, and always have at least two different
18
1 An Overview of Networks
network interfaces representing different networks that the router is connecting. (Machines can be both
hosts and routers, but this introduces complications.)
Suppose A is the sending host, sending a packet to a destination host D. The IP header of the packet will
contain Ds IP address in the destination address field (it will also contain As own address as the source
address). The first step is for A to determine whether D is on the same LAN as itself or not; that is, whether
D is local. This is done by looking at the network part of the destination address, which we will denote by
Dnet . If this net address is the same as As (that is, if it is equal numerically to Anet ), then A figures D is on
the same LAN as itself, and can use direct LAN delivery. It looks up the appropriate physical address for D
(probably with the ARP protocol, 7.7 Address Resolution Protocol: ARP), attaches a LAN header to the
packet in front of the IP header, and sends the packet straight to D via the LAN.
If, however, Anet and Dnet do not match D is non-local then A looks up a router to use. Most ordinary
hosts use only one router for all non-local packet deliveries, making this choice very simple. A then forwards
the packet to the router, again using direct delivery over the LAN. The IP destination address in the packet
remains D in this case, although the LAN destination address will be that of the router.
When the router receives the packet, it strips off the LAN header but leaves the IP header with the IP
destination address. It extracts the destination D, and then looks at Dnet . The router first checks to see
if any of its network interfaces are on the same LAN as D; recall that the router connects to at least one
additional network besides the one for A. If the answer is yes, then the router uses direct LAN delivery to the
destination, as above. If, on the other hand, Dnet is not a LAN to which the router is connected directly, then
the router consults its internal forwarding table. This consists of a list of networks each with an associated
next_hop address. These net,next_hop tables compare with switched-Ethernets host,next_hop tables;
the former type will be smaller because there are many fewer nets than hosts. The next_hop addresses in the
table are chosen so that the router can always reach them via direct LAN delivery via one of its interfaces;
generally they are other routers. The router looks up Dnet in the table, finds the next_hop address, and uses
direct LAN delivery to get the packet to that next_hop machine. The packets IP header remains essentially
unchanged, although the router most likely attaches an entirely new LAN header.
The packet continues being forwarded like this, from router to router, until it finally arrives at a router that
is connected to Dnet ; it is then delivered by that final router directly to D, using the LAN.
To make this concrete, consider the following diagram:
With Ethernet-style forwarding, R2 would have to maintain entries for each of A,B,C,D,E,F. With IP forwarding, R2 has just two entries to maintain in its forwarding table: 200.0.0/24 and 200.0.1/24. If A sends
to D, at [Link], it puts this address into the IP header, notes that 200.0.0 = 200.0.1, and thus concludes
D is not a local delivery. A therefore sends the packet to its router R1, using LAN delivery. R1 looks up the
destination network 200.0.1 in its forwarding table and forwards the packet to R2, which in turn forwards it
to R3. R3 now sees that it is connected directly to the destination network 200.0.1, and delivers the packet
via the LAN to D, by looking up Ds physical address.
In this diagram, IP addresses for the ends of the R1R2 and R2R3 links are not shown. They could be
assigned global IP addresses, but they could also use private IP addresses. Assuming these links are
19
point-to-point links, they might not actually need IP addresses at all; we return to this in 7.10 Unnumbered
Interfaces.
One can think of the network-prefix bits as analogous to the zip code on postal mail, and the host bits
as analogous to the street address. The internal parts of the post office get a letter to the right zip code,
and then an individual letter carrier gets it to the right address. Alternatively, one can think of the network
bits as like the area code of a phone number, and the host bits as like the rest of the digits. Newer protocols that support different net/host division points at different places in the network sometimes called
hierarchical routing allow support for addressing schemes that correspond to, say, zip/street/user, or
areacode/exchange/subscriber.
The Invertebrate Internet
Once upon a time, each leaf node connected through its provider to the backbone, and traffic between
any two nodes (or at least any two nodes not sharing a provider) passed through the backbone. The
backbone still carries a lot of traffic, but it is now also common for large providers such as Google
to connect (or peer) directly with large residential ISPs such as Comcast. See, for example, [Link].
We will refer to the Internet backbone as those IP routers that specialize in large-scale routing on the
commercial Internet, and which generally have forwarding-table entries covering all public IP addresses;
note that this is essentially a business definition rather than a technical one. We can revise the table-size
claim of the previous paragraph to state that, while there are many private IP networks, there are about
300,000 visible to the backbone. A forwarding table of 300,000 entries is quite feasible; a table a hundred
times larger is not, let alone a thousand times larger.
IP routers at non-backbone sites generally know all locally assigned network prefixes, eg 200.0.0/24 and
200.0.1/24 above. If a destination does not match any locally assigned network prefix, the packet needs
to be routed out into the Internet at large; for typical non-backbone sites this almost always this means
the packet is sent to the ISP that provides Internet connectivity. Generally the local routers will contain a
catchall default entry covering all nonlocal networks; this means that the router needs an explicit entry only
for locally assigned networks. This greatly reduces the forwarding-table size. The Internet backbone can be
approximately described, in fact, as those routers that do not have a default entry.
For most purposes, the Internet can be seen as a combination of end-user LANs together with point-to-point
links joining these LANs to the backbone, point-to-point links also tie the backbone together. Both LANs
and point-to-point links appear in the diagram above.
Just how routers build their destnet,next_hop forwarding tables is a major topic itself, which we cover in
9 Routing-Update Algorithms. Unlike Ethernet, IP routers do not have a broadcast delivery mechanism
as a fallback, so the tables must be constructed in advance. (There is a limited form of IP broadcast, but it
is basically intended for reaching the local LAN only, and does not help at all with delivery in the event that
the network is unknown.)
Most forwarding-table-construction algorithms used on a set of routers under common management fall into
either the distance-vector or the link-state category. In the distance-vector approach, often used at smaller
sites, routers exchange information with their immediately neighboring routers; tables are built up this
way through a sequence of such periodic exchanges. In the link-state approach, routers rapidly propagate
information about the state of each link; all routers in the organization receive this link-state information and
each one uses it to build and maintain a map of the entire network. The forwarding table is then constructed
20
1 An Overview of Networks
1.11 DNS
IP addresses are hard to remember (nearly impossible in IPv6). The domain name system, or DNS, comes
to the rescue by creating a way to convert hierarchical text names to IP addresses. Thus, for example, one can
type [Link] instead of [Link]. Virtually all Internet software uses the same basic library
calls to convert DNS names to actual addresses.
1.11 DNS
21
One thing DNS makes possible is changing a websites IP address while leaving the name alone. This
allows moving a site to a new provider, for example, without requiring users to learn anything new. It
is also possible to have several different DNS names resolve to the same IP address, and through some
modest trickery have the http (web) server at that IP address handle the different DNS names as completely
different websites.
DNS is hierarchical and distributed; indeed, it is the classic example of a widely distributed database.
In looking up [Link] three different DNS servers may be queried: for [Link], for
[Link], and for .edu. Searching a hierarchy can be cumbersome, so DNS search results are normally
cached locally. If a name is not found in the cache, the lookup may take a couple seconds. The DNS
hierarchy need have nothing to do with the IP-address hierarchy.
Besides address lookups, DNS also supports a few other kinds of searches. The best known is probably
reverse DNS, which takes an IP address and returns a name. This is slightly complicated by the fact that
one IP address may be associated with multiple DNS names, so DNS must either return a list, or return one
name that has been designated the canonical name.
1.12 Transport
Think about what types of communications one might want over the Internet:
Interactive communications such as via ssh or telnet, with long idle times between short bursts
Bulk file transfers
Request/reply operations, eg to query a database or to make DNS requests
Real-time voice traffic, at (without compression) 8KB/sec, with constraints on the variation in delivery time (known as jitter; see 19.11.3 RTP Control Protocol for a specific numeric interpretation)
Real-time video traffic. Even with substantial compression, video generally requires much more
bandwidth than voice
While separate protocols might be used for each of these, the Internet has standardized on the Transmission
Control Protocol, or TCP, for the first three (though there are periodic calls for a new protocol addressing
the third item above), and TCP is sometimes pressed into service for the last two. TCP is thus the most
common transport layer for application data.
The IP layer is not well-suited to transport. IP routing is a best-effort mechanism, which means packets
can and do get lost sometimes. Data that does arrive can arrive out of order. The sender has to manage
division into packets; that is, buffering. Finally, IP only supports sending to a specific host; normally, one
wants to send to a given application running on that host. Email and web traffic, or two different web
sessions, should not be commingled!
TCP extends IP with the following features:
reliability: TCP numbers each packet, and keeps track of which are lost and retransmits them after a
timeout, and holds early-arriving out-of-order packets for delivery at the correct time. Every arriving
data packet is acknowledged by the receiver; timeout and retransmission occurs when an acknowledgment isnt received by the sender within a given time.
22
1 An Overview of Networks
connection-orientation: Once a TCP connection is made, an application sends data simply by writing
to that connection. No further application-level addressing is needed.
stream-orientation: The application can write 1 byte at a time, or 100KB at a time; TCP will buffer
and/or divide up the data into appropriate sized packets.
port numbers: these provide a way to specify the receiving application for the data, and also to
identify the sending application.
throughput management: TCP attempts to maximize throughput, while at the same time not contributing unnecessarily to network congestion.
TCP endpoints are of the form host,port; these pairs are known as socket addresses, or sometimes as just
sockets though the latter refers more properly to the operating-system objects that receive the data sent to
the socket addresses. Servers (or, more precisely, server applications) listen for connections to sockets they
have opened; the client is then any endpoint that initiates a connection to a server.
When you enter a host name in a web browser, it opens a TCP connection to the servers port 80 (the standard
web-traffic port), that is, to the server socket with socket-address server,80. If you have several browser
tabs open, each might connect to the same server socket, but the connections are distinguishable by virtue
of using separate ports (and thus having separate socket addresses) on the client end (that is, your end).
A busy server may have thousands of connections to its port 80 (the web port) and hundreds of connections
to port 25 (the email port). Web and email traffic are kept separate by virtue of the different ports used. All
those clients to the same port, though, are kept separate because each comes from a unique host,port pair. A
TCP connection is determined by the host,port socket address at each end; traffic on different connections
does not intermingle. That is, there may be multiple independent connections to [Link],80. This is
somewhat analogous to certain business telephone numbers of the operators are standing by type, which
support multiple callers at the same time to the same number. Each call is answered by a different operator
(corresponding to a different cpu process), and different calls do not overhear each other.
TCP uses the sliding-windows algorithm, 6 Abstract Sliding Windows, to keep multiple packets en route
at any one time. The window size represents the number of packets simultaneously en route; if the window
size is 10, for example, then at any one time 10 packets are out there (perhaps 5 data packets and 5 returning
acknowledgments). As each acknowledgment arrives, the window slides forward and the data packet 10
packets ahead is sent. For example, consider the moment when the ten packets 20-29 are in transit. When
ACK[20] is received, Data[30] is sent, and so now packets 21-30 are in transit. When ACK[21] is received,
Data[31] is sent, so packets 22-31 are in transit.
Sliding windows minimizes the effect of store-and-forward delays, and propagation delays, as these then
only count once for the entire windowful and not once per packet. Sliding windows also provides an automatic, if partial, brake on congestion: the queue at any switch or router along the way cannot exceed the
window size. In this it compares favorably with constant-rate transmission, which, if the available bandwidth falls below the transmission rate, always leads to a significant percentage of dropped packets. Of
course, if the window size is too large, a sliding-windows sender may also experience dropped packets.
The ideal window size, at least from a throughput perspective, is such that it takes one round-trip time to send
an entire window, so that the next ACK will always be arriving just as the sender has finished transmitting the
window. Determining this ideal size, however, is difficult; for one thing, the ideal size varies with network
load. As a result, TCP approximates the ideal size. The most common TCP strategy that of so-called TCP
Reno is that the window size is slowly raised until packet loss occurs, which TCP takes as a sign that it
has reached the limit of available network resources. At that point the window size is reduced to half its
1.12 Transport
23
previous value, and the slow climb resumes. The effect is a sawtooth graph of window size with time,
which oscillates (more or less) around the optimal window size. For an idealized sawtooth graph, see
13.1.1 The Steady State; for some real (simulation-created) sawtooth graphs see 16.4.1 Some TCP Reno
cwnd graphs.
While this window-size-optimization strategy has its roots in attempting to maximize the available bandwidth, it also has the effect of greatly limiting the number of packet-loss events. As a result, TCP has come
to be the Internet protocol charged with reducing (or at least managing) congestion on the Internet, and
relatedly with ensuring fairness of bandwidth allocations to competing connections. Core Internet routers
at least in the classical case essentially have no role in enforcing congestion or fairness restrictions at all.
The Internet, in other words, places responsibility for congestion avoidance cooperatively into the hands of
end users. While cheating is possible, this cooperative approach has worked remarkably well.
While TCP is ubiquitous, the real-time performance of TCP is not always consistent: if a packet is lost,
the receiving TCP host will not turn over anything further to the receiving application until the lost packet
has been retransmitted successfully; this is often called head-of-line blocking. This is a serious problem
for sound and video applications, which can discretely handle modest losses but which have much more
difficulty with sudden large delays. A few lost packets ideally should mean just a few brief voice dropouts
(pretty common on cell phones) or flicker/snow on the video screen (or just reuse of the previous frame);
both of these are better than pausing completely.
The basic alternative to TCP is known as UDP, for User Datagram Protocol. UDP, like TCP, provides port
numbers to support delivery to multiple endpoints within the receiving host, in effect to a specific process on
the host. As with TCP, a UDP socket consists of a host,port pair. UDP also includes, like TCP, a checksum
over the data. However, UDP omits the other TCP features: there is no connection setup, no lost-packet
detection, no automatic timeout/retransmission, and the application must manage its own packetization.
The Real-time Transport Protocol, or RTP, sits above UDP and adds some additional support for voice and
video applications.
1.13 Firewalls
One problem with having a program on your machine listening on an open TCP port is that someone may
connect and then, using some flaw in the software on your end, do something malicious to your machine.
Damage can range from the unintended downloading of personal data to compromise and takeover of your
entire machine, making it a distributor of viruses and worms or a steppingstone in later break-ins of other
machines.
A strategy known as buffer overflow has been the basis for a great many total-compromise attacks. The idea
is to identify a point in a server program where it fills a memory buffer with network-supplied data without
careful length checking; almost any call to the C library function gets(buf) will suffice. The attacker
then crafts an oversized input string which, when read by the server and stored in memory, overflows the
buffer and overwrites subsequent portions of memory, typically containing the stack-frame pointers. The
usual goal is to arrange things so that when the server reaches the end of the currently executing function,
control is returned not to the calling function but instead to the attackers own payload code located within
the string.
A firewall is a program to block connections deemed potentially risky, eg those originating from outside
the site. Generally ordinary workstations do not ever need to accept connections from the Internet; client
24
1 An Overview of Networks
machines instead initiate connections to (better-protected) servers. So blocking incoming connections works
pretty well; when necessary (eg for games) certain ports can be selectively unblocked.
The original firewalls were routers. Incoming traffic to servers was often blocked unless it was sent to one
of a modest number of open ports; for non-servers, typically all inbound connections were blocked. This
allowed internal machines to operate reasonably safely, though being unable to accept incoming connections
is sometimes inconvenient. Nowadays per-machine firewalls in addition to router-based firewalls are
common: you can configure your machine not to accept inbound connections to most (or all) ports regardless
of whether software on your machine requests such a connection. Outbound connections can, in many cases,
also be prevented.
remote port
80
80
inside host
A
B
inside port
3000
3000
A packet to C from A,3000 would be rewritten by NR so that the source was NR,3000. A packet from
C,80 addressed to NR,3000 would be rewritten and forwarded to A,3000. Similarly, a packet from
D,80 addressed to NR,3000 would be rewritten and forwarded to B,3000; the NAT table takes into
account the sending socket address as well as the destination.
Now suppose B opens a connection to C,80, also from inside port 3000. This time NR must remap the
port number, because that is the only way to distinguish between packets from C,80 to A and to B. The
new table is
remote host
C
D
C
remote port
80
80
80
inside host
A
B
B
inside port
3000
3000
3000
25
Typically NR would not create TCP connections between itself and C,80 and D,80; the NAT table does
forwarding but the endpoints of the connection are still at the inside hosts. However, NR might very well
monitor the TCP connections to know when they have closed, and so can be removed from the table.
It is common for Voice-over-IP (VoIP) telephony using the SIP protocol (RFC 3261) to prefer to use UDP
port 5060 at both ends. If a VoIP server is outside the NAT router (which must be the case as the server
must generally be publicly visible) and a telephone is inside, likely port 5060 will pass through without
remapping, though the telephone will have to initiate the connection. But if there are two phones inside, one
of them will appear to be connecting to the server from an alternative port.
VoIP systems run into a much more serious problem with NAT, however. A call ultimately between two
phones is typically first negotiated between the phones respective VoIP servers. Once the call is set up, the
servers would prefer to step out of the loop, and have the phones exchange voice packets directly. The SIP
protocol was designed to handle this by having each phone report to its respective server the UDP socket
(IP address,port pair) it intends to use for the voice exchange; the servers then report these phone sockets
to each other, and from there to the opposite phones. This socket information is rendered incorrect by NAT,
however, certainly the IP address and quite likely the port as well. If only one of the phones is behind a
NAT firewall, it can initiate the voice connection to the other phone, but the other phone will see the voice
packets arriving from a different socket than promised and will likely not recognize them as part of the call.
If both phones are behind NAT firewalls, they will not be able to connect to one another at all. The common
solution is for the VoIP server of a phone behind a NAT firewall to remain in the communications path,
forwarding packets to its hidden partner.
If a site wants to make it possible to allow connections to hosts behind a NAT router or other firewall, one
option is tunneling. This is the creation of a virtual LAN link that runs on top of a TCP connection
between the end user and one of the sites servers; the end user can thus appear to be on one of the organizations internal LANs; see 3.1 Virtual Private Network. Another option is to open up a specific port: in
essence, a static NAT-table entry is made connecting a specific port on the NAT router to a specific internal
host and port (usually the same port). For example, all UDP packets to port 5060 on the NAT router might
be forwarded to port 5060 on internal host A, even in the absence of any prior packet exchange.
NAT routers work very well when the communications model is of client-side TCP connections, originating
from the inside and with public outside servers as destination. The NAT model works less well for peerto-peer networking, where your computer and a friends, each behind a different NAT router, wish to
establish a connection. NAT routers also often have trouble with UDP protocols, due to the tendency for
such protocols to have the public server reply from a different port than the one originally contacted. For
example, if host A behind a NAT router attempts to use TFTP (11.3 Trivial File Transport Protocol, TFTP),
and sends a packet to port 69 of public server C, then C is likely to reply from some new port, say 3000,
and this reply is likely to be dropped by the NAT router as there will be no entry there yet for traffic from
C,3000.
26
1 An Overview of Networks
27
It seems clear that the primary reasons the OSI protocols failed in the marketplace were their ponderous
bureaucracy for protocol management, their principle that protocols be completed before implementation
began, and their insistence on rigid adherence to the specifications to the point of non-interoperability. In
contrast, the IETF had (and still has) a two working implementations rule for a protocol to become a
Draft Standard. From RFC 2026:
A specification from which at least two independent and interoperable implementations from different
code bases have been developed, and for which sufficient successful operational experience has been
obtained, may be elevated to the Draft Standard level. [emphasis added]
This rule has often facilitated the discovery of protocol design weaknesses early enough that the problems
could be fixed. The OSI approach is a striking failure for the waterfall design model, when competing
with the IETFs cyclic prototyping model. However, it is worth noting that the IETF has similarly been
unable to keep up with rapid changes in html, particularly at the browser end; the OSI mistakes were mostly
evident only in retrospect.
Trying to fit protocols into specific layers is often both futile and irrelevant. By one perspective, the RealTime Protocol RTP lives at the Transport layer, but just above the UDP layer; others have put RTP into the
Application layer. Parts of the RTP protocol resemble the Session and Presentation layers. A key component
of the IP protocol is the set of various router-update protocols; some of these freely use higher-level layers.
Similarly, tunneling might be considered to be a Link-layer protocol, but tunnels are often created and
maintained at the Application layer.
A sometimes-more-successful approach to understanding layers is to view them instead as parts of a
protocol graph. Thus, in the following diagram we have two protocols at the transport layer (UDP and
RTP), and one protocol (ARP) not easily assigned to a layer.
28
1 An Overview of Networks
1.17 Epilog
This completes our tour of the basics. In the remaining chapters we will expand on the material here.
1.18 Exercises
1. Give forwarding tables for each of the switches S1-S4 in the following network with destinations A, B,
C, D. For the next_hop column, give the neighbor on the appropriate link rather than the interface number.
A
S1
S2
S3
S4
2. Give forwarding tables for each of the switches S1-S4 in the following network with destinations A, B,
C, D. Again, use the neighbor form of next_hop rather than the interface form. Try to keep the route to
each destination as short as possible. What decision has to be made in this exercise that did not arise in the
preceding exercise?
A
S1
S2
S4
S3
3. Consider the following arrangement of switches and destinations. Give forwarding tables (in neighbor
form) for S1-S4 that include default forwarding entries; the default entries should point toward S5. Eliminate all table entries that are implied by the default entry (that is, if the default entry is to S3, eliminate all
other entries for which the next hop is S3).
A
S1
D
S3
S2
S4
S5
29
4. Four switches are arranged as below. The destinations are S1 through S4 themselves.
S1
S2
S4
S3
(a). Give the forwarding tables for S1 through S4 assuming packets to adjacent nodes are sent along the
connecting link, and packets to diagonally opposite nodes are sent clockwise.
(b). Give the forwarding tables for S1 through S4 assuming the S1S4 link is not used at all, not even for
S1S4 traffic.
5. Suppose we have switches S1 through S4; the forwarding-table destinations are the switches themselves. The table
S2: S1,S1 S3,S3 S4,S3
S3: S1,S2 S2,S2 S4,S4
From the above we can conclude that S2 must be directly connected to both S1 and S3 as its table lists them
as next_hops; similarly, S3 must be directly connected to S2 and S4.
(a). Must S1 and S4 be directly connected? If so, explain; if not, give a network in which there is no direct
link between them, consistent with the tables above.
(b). Now suppose S3s table is changed to the following. In this case must S1 and S4 be directly
connected? Why or why not?
S3: S1,S4 S2,S2 S4,S4
While the table for S4 is not given, you may assume that forwarding does work correctly. However, you
should not assume that paths are the shortest possible; in particular, you should not assume that each switch
will always reach its directly connected neighbors by using the direct connection.
6. (a) Suppose a network is as follows, with the only path from A to C passing through B:
...
...
30
S4
S10
1 An Overview of Networks
S2
S5
S11
S3
S6
S12
Suppose S1-S6 have the forwarding tables below. For each destination A,B,C,D,E,F, suppose a packet is
sent to the destination from S1. Give the switches it passes through, including the initial switch S1, up until
the final switch S10-S12.
S1: (A,S4), (B,S2), (C,S4), (D,S2), (E,S2), (F,S4)
S2: (A,S5), (B,S5), (D,S5), (E,S3), (F,S3)
S3: (B,S6), (C,S2), (E,S6), (F,S6)
S4: (A,S10), (C,S5), (E,S10), (F,S5)
S5: (A,S6), (B,S11), (C,S6), (D,S6), (E,S4), (F,S2)
S6: (A,S3), (B,S12), (C,S12), (D,S12), (E,S5), (F,S12)
8. In the previous exercise, the routes taken by packets A-D are reasonably direct, but the routes for E and F
are rather circuitous.
Some routing applications assign weights to different links, and attempt to choose a path with the lowest
total link weight.
(a). Assign weights to the seven links S1S2, S2S3, S1S4, S2S5, S3S6, S4S5 and S5S6 so that
destination Es route in the previous exercise becomes the optimum (lowest total link weight) path.
(b). Assign (different!) weights to the seven links that make destination Fs route in the previous exercise
optimal.
Hint: you can do this by assigning a weight of 1 to all links except to one or two bad links; the bad links
get a weight of 10. In each of (a) and (b) above, the route taken will be the route that avoids all the bad
links. You must treat (a) entirely differently from (b); there is no assignment of weights that can account for
both routes.
9. Suppose we have the following three Class C IP networks, joined by routers R1R4. Give the forwarding
table for each router. For networks directly connected to a router (eg 200.0.1/24 and R1), include the network
in the table but list the next hop as direct.
R1
R4
R3
200.0.1/24
R2
200.0.2/24
200.0.3/24
1.18 Exercises
31
32
1 An Overview of Networks
2 ETHERNET
We now turn to a deeper analysis of the ubiquitous Ethernet LAN protocol. Current user-level Ethernet today
(2013) is usually 100 Mbps, with Gigabit Ethernet standard in server rooms and backbones, but because
Ethernet speed scales in odd ways, we will start with the 10 Mbps formulation. While the 10 Mbps speed is
obsolete, and while even the Ethernet collision mechanism is largely obsolete, collision management itself
continues to play a significant role in wireless networks.
33
Classic Ethernet came in version 1 [1980, DEC-Intel-Xerox], version 2 [1982, DIX], and IEEE 802.3. There
are some minor electrical differences between these, and one rather substantial packet-format difference. In
addition to these, the Berkeley Unix trailing-headers packet format was used for a while.
There were three physical formats for 10 Mbps Ethernet cable: thick coax (10BASE-5), thin coax (10BASE2), and, last to arrive, twisted pair (10BASE-T). Thick coax was the original; economics drove the successive
development of the later two. The cheaper twisted-pair cabling eventually almost entirely displaced coax, at
least for host connections.
The original specification included support for repeaters, which were in effect signal amplifiers although
they might attempt to clean up a noisy signal. Repeaters processed each bit individually and did no buffering.
In the telecom world, a repeater might be called a digital regenerator. A repeater with more than two ports
was commonly called a hub; hubs allowed branching and thus much more complex topologies.
Bridges later known as switches came along a short time later. While repeaters act at the bit layer,
a switch reads in and forwards an entire packet as a unit, and the destination address is likely consulted
to determine to where the packet is forwarded. Originally, switches were seen as providing interconnection (bridging) between separate Ethernets, but later a switched Ethernet was seen as one large virtual
Ethernet. We return to switching below in 2.4 Ethernet Switches.
Hubs propagate collisions; switches do not. If the signal representing a collision were to arrive at one port of
a hub, it would, like any other signal, be retransmitted out all other ports. If a switch were to detect a collision
one one port, no other ports would be involved; only packets received successfully are ever retransmitted
out other ports.
In coaxial-cable installations, one long run of coax snaked around the computer room or suite of offices;
each computer connected somewhere along the cable. Thin coax allowed the use of T-connectors to attach
hosts; connections were made to thick coax via taps, often literally drilled into the coax central conductor.
In a standalone installation one run of coax might be the entire Ethernet; otherwise, somewhere a repeater
would be attached to allow connection to somewhere else.
Twisted-pair does not allow mid-cable attachment; it is only used for point-to-point links between hosts,
switches and hubs. In a twisted-pair installation, each cable runs between the computer location and a
central wiring closest (generally much more convenient than trying to snake coax all around the building).
Originally each cable in the wiring closet plugged into a hub; nowadays the hub has likely been replaced by
a switch.
There is still a role for hubs today when one wants to monitor the Ethernet signal from A to B (eg for
intrusion detection analysis), although some switches now also support a form of monitoring.
All three cable formats could interconnect, although only through repeaters and hubs, and all used the same
10 Mbps transmission speed. While twisted-pair cable is still used by 100 Mbps Ethernet, it generally needs
to be a higher-performance version known as Category 5, versus the 10 Mbps Category 3.
Here is the format of a typical Ethernet packet (DIX specification):
The destination and source addresses are 48-bit quantities; the type is 16 bits, the data length is variable up
to a maximum of 1500 bytes, and the final CRC checksum is 32 bits. The checksum is added by the Ethernet
34
2 Ethernet
hardware, never by the host software. There is also a preamble, not shown: a block of 1 bits followed by a
0, in the front of the packet, for synchronization. The type field identifies the next higher protocol layer; a
few common type values are 0x0800 = IP, 0x8137 = IPX, 0x0806 = ARP.
The IEEE 802.3 specification replaced the type field by the length field, though this change never caught on.
The two formats can be distinguished as long as the type values used are larger than the maximum Ethernet
length of 1500 (or 0x05dc); the type values given in the previous paragraph all meet this condition.
Each Ethernet card has a (hopefully unique) physical address in ROM; by default any packet sent to this
address will be received by the board and passed up to the host system. Packets addressed to other physical
addresses will be seen by the card, but ignored (by default). All Ethernet devices also agree on a broadcast
address of all 1s: a packet sent to the broadcast address will be delivered to all attached hosts.
It is sometimes possible to change the physical address of a given card in software. It is almost universally
possible to put a given card into promiscuous mode, meaning that all packets on the network, no matter
what the destination address, are delivered to the attached host. This mode was originally intended for
diagnostic purposes but became best known for the security breach it opens: it was once not unusual to find
a host with network board in promiscuous mode and with a process collecting the first 100 bytes (presumably
including userid and password) of every telnet connection.
35
As long as the manufacturer involved is diligent in assigning the second three bytes, every manufacturerprovided Ethernet address should be globally unique. Lapses, however, are not unheard of.
36
2 Ethernet
If we need to send less than 46 bytes of data (for example, a 40-byte TCP ACK packet), the Ethernet packet
must be padded out to the minimum length. As a result, all protocols running on top of Ethernet need to
provide some way to specify the actual data length, as it cannot be inferred from the received packet size.
As a specific example of a collision occurring as late as possible, consider the diagram below. A and B are
5 units apart, and the bandwidth is 1 byte/unit. A begins sending helloworld at T=0; B starts sending just
as As message arrives, at T=5. B has listened before transmitting, but As signal was not yet evident. A
doesnt discover the collision until 10 units have elapsed, which is twice the distance.
Here are typical maximum values for the delay in 10 Mbps Ethernet due to various components. These
are taken from the Digital-Intel-Xerox (DIX) standard of 1982, except that point-to-point link cable is
replaced by standard cable. The DIX specification allows 1500m of coax with two repeaters and 1000m
of point-to-point cable; the table below shows 2500m of coax and four repeaters, following the later IEEE
802.3 Ethernet specification. Some of the more obscure delays have been eliminated. Entries are one-way
delay times, in bits. The maximum path may have four repeaters, and ten transceivers (simple electronic
devices between the coax cable and the NI cards), each with its drop cable (two transceivers per repeater,
plus one at each endpoint).
Ethernet delay budget
item
coax
transceiver cables
transceivers
repeaters
encoders
length
2500M
500M
delay, in bits
110 bits
25 bits
40 bits, max 10 units
25 bits, max 4 units
20 bits, max 10 units
The total here is 220 bits; in a full accounting it would be 232. Some of the numbers shown are a little high,
but there are also signal rise time delays, sense delays, and timer delays that have been omitted. It works out
fairly closely.
Implicit in the delay budget table above is the length of a bit. The speed of propagation in copper is about
0.77c, where c=3108 m/sec = 300 m/sec is the speed of light in vacuum. So, in 0.1 microseconds (the
37
time to send one bit at 10 Mbps), the signal propagates approximately 0.77c10-7 = 23 meters.
Ethernet packets also have a maximum packet size, of 1500 bytes. This limit is primarily for the sake
of fairness, so one station cannot unduly monopolize the cable (and also so stations can reserve buffers
guaranteed to hold an entire packet). At one time hardware vendors often marketed their own incompatible
extensions to Ethernet which enlarged the maximum packet size to as much as 4KB. There is no technical
reason, actually, not to do this, except compatibility.
The signal loss in any single segment of cable is limited to 8.5 db, or about 14% of original strength.
Repeaters will restore the signal to its original strength. The reason for the per-segment length restriction
is that Ethernet collision detection requires a strict limit on how much the remote signal can be allowed to
lose strength. It is possible for a station to detect and reliably read very weak remote signals, but not at the
same time that it is transmitting locally. This is exactly what must be done, though, for collision detection
to work: remote signals must arrive with sufficient strength to be heard even while the receiving station is
itself transmitting. The per-segment limit, then, has nothing to do with the overall length limit; the latter is
set only to ensure that a sender is guaranteed of detecting a collision, even if it sends the minimum-sized
packet.
2 Ethernet
assume that collision detection always takes one slot time (it will take much less for nodes closer together)
and that the slot start-times for each station are synchronized; this allows us to measure time in slots. A solid
arrow at the start of a slot means that sender began transmission in that slot; a red X signifies a collision. If
a collision occurs, the backoff value k is shown underneath. A dashed line shows the station waiting k slots
for its next attempt.
At T=0 we assume the transmitting station finishes, and all the Ai transmit and collide. At T=1, then, each
of the Ai has discovered the collision; each chooses a random k<2. Let us assume that A1 chooses k=1, A2
chooses k=1, A3 chooses k=0, A4 chooses k=0, and A5 chooses k=1.
Those stations choosing k=0 will retransmit immediately, at T=1. This means A3 and A4 collide again, and
at T=2 they now choose random k<4. We will Assume A3 chooses k=3 and A4 chooses k=0; A3 will try
again at T=2+3=5 while A4 will try again at T=2, that is, now.
At T=2, we now have the original A1, A2, and A5 transmitting for the second time, while A4 trying again
for the third time. They collide. Let us suppose A1 chooses k=2, A2 chooses k=1, A5 chooses k=3, and A4
chooses k=6 (A4 is choosing k<8 at random). Their scheduled transmission attempt times are now A1 at
T=3+2=5, A2 at T=4, A5 at T=6, and A4 at T=9.
At T=3, nobody attempts to transmit. But at T=4, A2 is the only station to transmit, and so successfully
seizes the channel. By the time T=5 rolls around, A1 and A3 will check the channel, that is, listen first, and
wait for A2 to finish. At T=9, A4 will check the channel again, and also begin waiting for A2 to finish.
A maximum of 1024 hosts is allowed on an Ethernet. This number apparently comes from the maximum
range for the backoff time as 0 k < 1024. If there are 1024 hosts simultaneously trying to send, then,
once the backoff range has reached k<1024 (N=10), we have a good chance that one station will succeed in
seizing the channel, that is; the minimum value of all the random ks chosen will be unique.
This backoff algorithm is not fair, in the sense that the longer a station has been waiting to send, the lower
its priority sinks. Newly transmitting stations with N=0 need not delay at all. The Ethernet capture effect,
below, illustrates this unfairness.
39
2.1.7 Errors
Packets can have bits flipped or garbled by electrical noise on the cable; estimates of the frequency with
which this occurs range from 1 in 104 to 1 in 106 . Bit errors are not uniformly likely; when they occur,
they are likely to occur in bursts. Packets can also be lost in hubs, although this appears less likely. Packets
can be lost due to collisions only if the sending host makes 16 unsuccessful transmission attempts and gives
up. Ethernet packets contain a 32-bit CRC error-detecting code (see 5.4.1 Cyclical Redundancy Check:
CRC) to detect bit errors. Packets can also be misaddressed by the sending host, or, most likely of all, they
can arrive at the receiving host at a point when the receiver has no free buffers and thus be dropped by a
higher-layer protocol.
40
2 Ethernet
As a first look at contention intervals, assume that there are N stations waiting to transmit at the start of the
interval. It turns out that, if all follow the exponential backoff algorithm, we can expect O(N) slot times
before one station successfully acquires the channel; thus, Ethernets are happiest when N is small and there
are only a few stations simultaneously transmitting. However, multiple stations are not necessarily a severe
problem. Often the number of slot times needed turns out to be about N/2, and slot times are short. If N=20,
then N/2 is 10 slot times, or 640 bytes. However, one packet time might be 1500 bytes. If packet intervals
are 1500 bytes and contention intervals are 640 byes, this gives an overall throughput of 1500/(640+1500)
= 70% of capacity. In practice, this seems to be a reasonable upper limit for the throughput of classic
shared-media Ethernet.
41
42
2 Ethernet
43
separately but not to the aggregated whole. In a fully switched (that is, no hubs) 100BASE-TX LAN, each
collision domain is simply a single twisted-pair link, subject to the 100-meter maximum length.
Fast Ethernet also introduced the concept of full-duplex Ethernet: two twisted pairs could be used, one
for each direction. Full-duplex Ethernet is limited to paths not involving hubs, that is, to single station-tostation links, where a station is either a host or a switch. Because such a link has only two potential senders,
and each sender has its own transmit line, full-duplex Ethernet is collision-free.
Fast Ethernet uses 4B/5B encoding, covered in 4.1.4 4B/5B.
Fast Ethernet 100BASE-TX does not particularly support links between buildings, due to the networkdiameter limitation. However, fiber-optic point-to-point links are quite effective here, provided full-duplex
is used to avoid collisions. We mentioned above that the coax-based 100BASE-FX standard allowed a
maximum half-duplex run of 400 meters, but 100BASE-FX is much more likely to use full duplex, where
the maximum cable length rises to 2,000 meters.
44
2 Ethernet
In developing faster Ethernet speeds, economics plays at least as important a role as technology. As new
speeds reach the market, the earliest adopters often must take pains to buy cards, switches and cable known
to work together; this in effect amounts to installing a proprietary LAN. The real benefit of Ethernet,
however, is arguably that it is standardized, at least eventually, and thus a site can mix and match its cards
and devices. Having a given Ethernet standard support existing cable is even more important economically;
the costs of replacing cable often dwarf the costs of the electronics.
45
If the destination address D is the broadcast address, or, for many switches, a multicast address, broadcast
is required.
In the diagram above, each switchs tables are indicated by listing near each interface the destinations known
to be reachable by that interface. The entries shown are the result of the following packets:
A sends to B; all switches learn where A is
B sends to A; this packet goes directly to A; only S3, S2 and S1 learn where B is
C sends to B; S4 does not know where B is so this packet goes to S5; S2 does know where B is so the
packet does not go to S1.
Switches do not automatically discover directly connected neighbors; S1 does not learn about A until A
transmits a packet.
Once all the switches have learned where all (or most of) the hosts are, packet routing becomes optimal. At
this point packets are never sent on links unnecessarily; a packet from A to B only travels those links that
lie along the (unique) path from A to B. (Paths must be unique because switched Ethernet networks cannot
have loops, at least not active ones. If a loop existed, then a packet sent to an unknown destination would be
forwarded around the loop endlessly.)
Switches have an additional advantage in that traffic that does not flow where it does not need to flow is
much harder to eavesdrop on. On an unswitched Ethernet, one host configured to receive all packets can
eavesdrop on all traffic. Early Ethernets were notorious for allowing one unscrupulous station to capture,
for instance, all passwords in use on the network. On a fully switched Ethernet, a host physically only sees
the traffic actually addressed to it; other traffic remains inaccessible.
Typical switches have room for table with 104 - 106 entries, though maxing out at 105 entries may be more
common; this is usually enough to learn about all hosts in even a relatively large organization. A switched
Ethernet can fail when total traffic becomes excessive, but excessive total traffic would drown any network
(although other network mechanisms might support higher bandwidth). The main limitations specific to
switching are the requirement that the topology must be loop-free (thus disallowing duplicate paths which
might otherwise provide redundancy), and that all broadcast traffic must always be forwarded everywhere.
As a switched Ethernet grows, broadcast traffic comprises a larger and larger percentage of the total traffic,
and the organization must at some point move to a routing architecture (eg as in 7.6 IP Subnets).
One of the differences between an inexpensive Ethernet switch and a pricier one is the degree of internal
parallelism it can support. If three packets arrive simultaneously on ports 1, 2 and 3, and are destined for
respective ports 4, 5 and 6, can the switch actually transmit the packets simultaneously? A simple switch
likely has a single CPU and a single memory bus, both of which can introduce transmission bottlenecks.
46
2 Ethernet
For commodity five-port switches, at most two simultaneous transmissions can occur; such switches can
generally handle that degree of parallelism. It becomes harder as the number of ports increases, but at some
point the need to support full parallel operation can be questioned; in many settings the majority of traffic
involves one or two server or router ports. If a high degree of parallelism is in fact required, there are various
architectures known as switch fabrics that can be used; these typically involve multiple simple processor
elements.
47
When a switch sees a new root candidate, it sends BPDUs on all interfaces, indicating the distance. The
switch includes the interface leading towards the root.
Once this process is complete, each switch knows
its own path to the root
which of its ports any further-out switches will be using to reach the root
for each port, its directly connected neighboring switches
Now the switch can prune some (or all!) of its interfaces. It disables all interfaces that are not enabled by
the following rules:
1. It enables the port via which it reaches the root
2. It enables any of its ports that further-out switches use to reach the root
3. If a remaining port connects to a segment to which other segment-neighbor switches connect as well,
the port is enabled if the switch has the minimum cost to the root among those segment-neighbors, or,
if a tie, the smallest ID among those neighbors, or, if two ports are tied, the port with the smaller ID.
4. If a port has no directly connected switch-neighbors, it presumably connects to a host or segment, and
the port is enabled.
Rules 1 and 2 construct the spanning tree; if S3 reaches the root via S2, then Rule 1 makes sure S3s port
towards S2 is open, and Rule 2 makes sure S2s corresponding port towards S3 is open. Rule 3 ensures that
each network segment that connects to multiple switches gets a unique path to the root: if S2 and S3 are
segment-neighbors each connected to segment N, then S2 enables its port to N and S3 does not (because
2<3). The primary concern here is to create a path for any host nodes on segment N; S2 and S3 will create
their own paths via Rules 1 and 2. Rule 4 ensures that any stub segments retain connectivity; these would
include all hosts directly connected to switch ports.
S1 has the lowest ID, and so becomes the root. S2 and S4 are directly connected, so they will enable the
interfaces by which they reach S1 (Rule 1) while S1 will enable its interfaces by which S2 and S4 reach it
48
2 Ethernet
(Rule 2).
S3 has a unique lowest-cost route to S1, and so again by Rule 1 it will enable its interface to S2, while by
Rule 2 S2 will enable its interface to S3.
S5 has two choices; it hears of equal-cost paths to the root from both S2 and S4. It picks the lower-numbered
neighbor S2; the interface to S4 will never be enabled. Similarly, S4 will never enable its interface to S5.
Similarly, S6 has two choices; it selects S3.
After these links are enabled (strictly speaking it is interfaces that are enabled, not links, but in all cases here
either both interfaces of a link will be enabled or neither), the network in effect becomes:
Eventually, all switches discover S1 is the root (because 1 is the smallest of {1,2,3,4,5,6}). S2, S3 and S4
are one (unique) hop away; S5, S6 and S7 are two hops away.
49
Algorhyme
I think that I shall never see
a graph more lovely than a tree.
A tree whose crucial property
is loop-free connectivity.
A tree that must be sure to span
so packet can reach every LAN.
First, the root must be selected.
By ID, it is elected.
Least-cost paths from root are traced.
In the tree, these paths are placed.
A mesh is made by folks like me,
then bridges find a spanning tree.
Radia Perlman
For the switches one hop from the root, Rule 1 enables S2s port 1, S3s port 1, and S4s port 1. Rule 2
enables the corresponding ports on S1: ports 1, 5 and 4 respectively. Without the spanning-tree algorithm
S2 could reach S1 via port 2 as well as port 1, but port 1 has a smaller number.
S5 has two equal-cost paths to the root: S5S4S1 and S5S3S1. S3 is the switch with the
lower ID; its port 2 is enabled and S5 port 2 is enabled.
S6 and S7 reach the root through S2 and S3 respectively; we enable S6 port 1, S2 port 3, S7 port 2 and S3
port 3.
The ports still disabled at this point are S1 ports 2 and 3, S2 port 2, S4 ports 2 and 3, S5 port 1, S6 port 2
and S7 port 1.
Now we get to Rule 3, dealing with how segments (and thus their hosts) connect to the root. Applying Rule
3,
We do not enable S2 port 2, because the network (B) has a direct connection to the root, S1
We do enable S4 port 3, because S4 and S5 connect that way and S4 is closer to the root. This enables
connectivity of network D. We do not enable S5 port 1.
S6 and S7 are tied for the path-length to the root. But S6 has smaller ID, so it enables port 2. S7s
port 1 is not enabled.
Finally, Rule 4 enables S4 port 2, and thus connectivity for host J. It also enables S1 port 2; network F has
two connections to S1 and port 2 is the lower-numbered connection.
All this port-enabling is done using only the data collected during the root-discovery phase; there is no
additional negotiation. The BPDU exchanges continue, however, so as to detect any changes in the topology.
If a link is disabled, it is not used even in cases where it would be more efficient to use it. That is, traffic
from F to B is sent via B1, D, and B5; it never goes through B7. IP routing, on the other hand, uses the
50
2 Ethernet
shortest path. To put it another way, all spanning-tree Ethernet traffic goes through the root node, or along
a path to or from the root node.
The traditional (IEEE 802.1D) spanning-tree protocol is relatively slow; the need to go through the treebuilding phase means that after switches are first turned on no normal traffic can be forwarded for ~30
seconds. Faster, revised protocols have been proposed to reduce this problem.
Another issue with the spanning-tree algorithm is that a rogue switch can announce an ID of 0, thus likely
becoming the new root; this leaves that switch well-positioned to eavesdrop on a considerable fraction of
the traffic. One of the goals of the Cisco Root Guard feature is to prevent this; another goal of this and
related features is to put the spanning-tree topology under some degree of administrative control. One likely
wants the root switch, for example, to be geographically at least somewhat centered.
In the diagram above, S1 and S3 each have both red and blue ports. The switch network S1-S4 will deliver
traffic only when the source and destination ports are the same color. Red packets can be forwarded to the
blue VLAN only by passing through the router R, entering Rs red port and leaving its blue port. R may
apply firewall rules to restrict redblue traffic.
51
When the source and destination ports are on the same switch, nothing needs to be added to the packet; the
switch can keep track of the color of each of its ports. However, switch-to-switch traffic must be additionally
tagged to indicate the source. Consider, for example, switch S1 above sending packets to S3 which has nodes
R3 (red) and B3 (blue). Traffic between S1 and S3 must be tagged with the color, so that S3 will know to
what ports it may be delivered. The IEEE 802.1Q protocol is typically used for this packet-tagging; a 32-bit
color tag is inserted into the Ethernet header after the source address and before the type field. The first
16 bits of this field is 0x8100, which becomes the new Ethernet type field and which identifies the frame as
tagged.
Double-tagging is possible; this would allow an ISP to have one level of tagging and its customers to have
another level.
2.7 Epilog
Ethernet dominates the LAN layer, but is not one single LAN protocol: it comes in a variety of speeds
and flavors. Higher-speed Ethernet seems to be moving towards fragmenting into a range of physical-layer
options for different types of cable, but all based on switches and point-to-point linking; different Ethernet
types can be interconnected only with switches. Once Ethernet finally abandons physical links that are
bi-directional (half-duplex links), it will be collision-free and thus will no longer need a minimum packet
size.
Other wired networks have largely disappeared (or have been renamed Ethernet). Wireless networks,
however, are here to stay, and for the time being at least have inherited the original Ethernets collisionmanagement concerns.
2.8 Exercises
1. Simulate the contention period of five Ethernet stations that all attempt to transmit at T=0 (presumably
when some sixth station has finished transmitting), in the style of the diagram in 2.1.4 Exponential Backoff
Algorithm. Assume that time is measured in slot times, and that exactly one slot time is needed to detect a
collision (so that if two stations transmit at T=1 and collide, and one of them chooses a backoff time k=0,
then that station will transmit again at T=2). Use coin flips or some other source of randomness.
2. Suppose we have Ethernet switches S1 through S3 arranged as below. All forwarding tables are initially
empty.
S1
S2
S3
52
2 Ethernet
3. Suppose we have the Ethernet switches S1 through S4 arranged as below. All forwarding tables are
empty; each switch uses the learning algorithm of 2.4 Ethernet Switches.
B
S4
A
S1
S2
S3
S1
S2
S3
Hint: Destination D must be in S3s forwarding table, but must not be in S2s.
5. Given the Ethernet network with learning switches below, with (disjoint) unspecified parts represented
by ?, explain why it is impossible for a packet sent from A to B to be forwarded by S1 only to S2, but to be
forwarded by S2 out all of S2s other ports.
?
|
S1
?
|
S2
6. In the diagram of 2.4 Ethernet Switches, suppose node D is connected to S5, and, with the tables as
shown below the diagram, D sends to B.
(a). Which switches will see this packet, and thus learn about D?
(b). Which of the switches in part (a) do not already know where B is and will use fallback-to-broadcase
(ie, will forward the packet out all non-arrival interfaces)?
7. Suppose two Ethernet switches are connected in a loop as follows; S1 and S2 have their interfaces 1 and
2 labeled. These switches do not use the spanning-tree algorithm.
2.8 Exercises
53
Suppose A attempts to send a packet to destination B, which is unknown. S1 will therefore forward the
packet out interfaces 1 and 2. What happens then? How long will As packet circulate?
8. The following network is like that of 2.5.1 Example 1: Switches Only, except that the switches are
numbered differently. Again, the ID of switch Sn is n, so S1 will be the root. Which links end up pruned
by the spanning-tree algorithm, and why?
S1
S4
S6
S3
S5
S2
9. Suppose you want to develop a new protocol so that Ethernet switches participating in a VLAN all keep
track of the VLAN color associated with every destination. Assume that each switch knows which of its
ports (interfaces) connect to other switches and which may connect to hosts, and in the latter case knows the
color assigned to that port.
(a). Suggest a way by which switches might propagate this destination-color information to other switches.
(b). What happens if a port formerly reserved for connection to another switch is now used for a host?
54
2 Ethernet
3 OTHER LANS
In the wired era, one could get along quite well with nothing but Ethernet and the occasional long-haul pointto-point link joining different sites. However, there are important alternatives out there. Some, like token
ring, are mostly of historical importance; others, like virtual circuits, are of great conceptual importance but
so far of only modest day-to-day significance.
And then there is wireless. It would be difficult to imagine contemporary laptop networking, let alone mobile
devices, without it. In both homes and offices, Wi-Fi connectivity is the norm. A return to being tethered by
wires is almost unthinkable.
55
After the VPN is set up, the home hosts tun0 interface appears to be locally connected to Site A, and thus
the home host is allowed to connect to the private area within Site A. The home hosts forwarding table will
be configured so that traffic to Site As private addresses is routed via interface tun0.
VPNs are also commonly used to connect entire remote offices to headquarters. In this case the remote-office
end of the tunnel will be at that offices local router, and the tunnel will carry traffic for all the workstations
in the remote office.
To improve security, it is common for the residential (or remote-office) end of the VPN connection to use
the VPN connection as the default route for all traffic except that needed to maintain the VPN itself. This
may require a so-called host-specific forwarding-table entry at the residential end to allow the packets that
carry the VPN tunnel traffic to be routed correctly via eth0. This routing strategy means that potential
intruders cannot access the residential host and thus the workplace internal network through the original
residential Internet access. A consequence is that if the home worker downloads a large file from a nonworkplace site, it will travel first to the workplace, then back out to the Internet via the VPN connection, and
finally arrive at the home.
3 Other LANs
3.3 Wi-Fi
Wi-Fi is a trademark denoting any of several IEEE wireless-networking protocols in the 802.11 family,
specifically 802.11a, 802.11b, 802.11g, 802.11n, and 802.11ac. Like classic Ethernet, Wi-Fi must deal
with collisions; unlike Ethernet, however, Wi-Fi is unable to detect collisions in progress, complicating the
backoff and retransmission algorithms. Wi-Fi is designed to interoperate freely with Ethernet at the logical
LAN layer; that is, Ethernet and Wi-Fi traffic can be freely switched from the wired side to the wireless side.
Band Width
To radio engineers, band width means the frequency range used by a signal, not data rate; in keeping
with this we will in this section and 3.4 WiMAX use the term data rate instead of bandwidth. We
will use the terms channel width or width of the frequency band for the frequency range. All else
being equal, the data rate achievable with a radio signal is proportional to the channel width.
Generally, Wi-Fi uses the 2.4 GHz ISM (Industrial, Scientific and Medical) band used also by microwave
ovens, though 802.11a uses a 5 GHz band, 802.11n supports that as an option and the new 802.11ac has
returned to using 5 GHz exclusively. The 5 GHz band has reduced ability to penetrate walls, often resulting
in a lower effective range. Wi-Fi radio spectrum is usually unlicensed, meaning that no special permission
is needed to transmit but also that others may be trying to use the same frequency band simultaneously; the
availability of unlicensed channels in the 5 GHz band continues to evolve.
The table below summarizes the different Wi-Fi versions. All bit rates assume a single spatial stream;
channel widths are nominal.
IEEE name
802.11a
802.11b
802.11g
802.11n
802.11ac
frequency
5 GHz
2.4 GHz
2.4 GHz
2.4/5 GHz
5 GHz
channel width
20 MHz
20 MHz
20 MHz
20-40 MHz
20-160 MHz
The maximum bit rate is seldom achieved in practice. The effective bit rate must take into account, at a
minimum, the time spent in the collision-handling mechanism. More significantly, all the Wi-Fi variants
above use dynamic rate scaling, below; the bit rate is reduced up to tenfold (or more) in environments with
higher error rates, which can be due to distance, obstructions, competing transmissions or radio noise. All
this means that, as a practical matter, getting 150 Mbps out of 802.11n requires optimum circumstances; in
particular, no competing senders and unimpeded line-of-sight transmission. 802.11n lower-end performance
can be as little as 10 Mbps, though 40-100 Mbps (for a 40 MHz channel) may be more typical.
The 2.4 GHz ISM band is divided by international agreement into up to 14 officially designated channels,
each about 5 MHz wide, though in the United States use may be limited to the first 11 channels. The 5 GHz
3.3 Wi-Fi
57
band is similarly divided into 5 MHz channels. One Wi-Fi sender, however, needs several of these official
channels; the typical 2.4 GHz 802.11g transmitter uses an actual frequency range of up to 22 MHz, or up
to five channels. As a result, to avoid signal overlap Wi-Fi use in the 2.4 GHz band is often restricted to
official channels 1, 6 and 11. The end result is that unrelated Wi-Fi transmitters can and do interact with and
interfere with each other.
The United States requires users of the 5 GHz band to avoid interfering with weather and military applications in the same frequency range. Once that is implemented, however, there are more 5 MHz channels at
this frequency than in the 2.4 GHz ISM band, which is one of the reasons 802.11ac can run faster (below).
Wi-Fi designers can improve speed through a variety of techniques, including
improved radio modulation techniques
improved error-correcting codes
smaller guard intervals between symbols
increasing the channel width
allowing multiple spatial streams via multiple antennas
The first two in this list seem by now to be largely tapped out; the third reduces the range but may increase
the data rate by 11%.
The largest speed increases are obtained by increasing the number of 5 MHz channels used. For example,
the 65 Mbps bit rate above for 802.11n is for a nominal frequency range of 20 MHz, comparable to that
of 802.11g. However, in areas with minimal competition from other signals, 802.11n supports using a 40
MHz frequency band; the bit rate then goes up to 135 Mbps (150 Mbps with a smaller guard interval). This
amounts to using two of the three available 2.4 GHz Wi-Fi bands. Similarly, the wide range in 802.11ac
bit rates reflects support for using channel widths ranging from 20 MHz up to 160 MHz (32 5-MHz official
channels).
For all the categories in the table above, additional bits are used for error-correcting codes. For 802.11g
operating at 54 Mbps, for example, the actual raw bit rate is (4/3)54 = 72 Mbps, sent in symbols consisting
of six bits as a unit.
58
3 Other LANs
59
it waits for time SIFS and sends the ACK; at the instant when the end of the SIFS interval is reached, the
receiver will be the only station authorized to send. Any other stations waiting the longer IFS period will see
the ACK before the IFS time has elapsed and will thus not interfere with the ACK; similarly, any stations
with a running backoff-wait clock will continue to have that clock suspended.
[Link] Wi-Fi RTS/CTS
Wi-Fi stations optionally also use a request-to-send/clear-to-send (RTS/CTS) protocol. Usually this is
used only for larger packets; often, the RTS/CTS threshold (the size of the largest packet not sent using RTS/CTS) is set (as part of the Access Point configuration) to be the maximum packet size, effectively
disabling this feature. The idea here is that a large packet that is involved in a collision represents a significant waste of potential throughput; for large packets, we should ask first.
The RTS packet which is small is sent through the normal procedure outlined above; this packet includes
the identity of the destination and the size of the data packet the station desires to transmit. The destination
station then replies with CTS after the SIFS wait period, effectively preventing any other transmission after
the RTS. The CTS packet also contains the data-packet size. The original sender then waits for SIFS after
receiving the CTS, and sends the packet. If all other stations can hear both the RTS and CTS messages, then
once the RTS and CTS are sent successfully no collisions should occur during packet transmission, again
because the only idle times are of length SIFS and other stations should be waiting for time IFS.
[Link] Hidden-Node Problem
Consider the diagram below. Each station has a 100-meter range. Stations A and B are 150 meters apart and
so cannot hear one another at all; each is 75 meters from C. If A is transmitting and B senses the medium in
preparation for its own transmission, as part of collision avoidance, then B will conclude that the medium is
idle and will go ahead and send.
However, C is within range of both A and B. If A and B transmit simultaneously, then from Cs perspective
a collision occurs. C receives nothing usable. We will call this a hidden-node collision as the senders A
and B are hidden from one another; the general scenario is known as the hidden-node problem.
Note that node D receives only As signal, and so no collision occurs at D.
60
3 Other LANs
The hidden-node problem can also occur if A and B cannot receive one anothers transmissions due to a
physical obstruction such as a radio-impermeable wall:
One of the rationales for the RTS/CTS protocol is the prevention of hidden-node collisions. Imagine that,
instead of transmitting its data packet, A sends an RTS packet, and C responds with CTS. B has not heard
the RTS packet from A, but does hear the CTS from C. A will begin transmitting after a SIFS interval, but B
will not hear As transmission. However, B will still wait, because the CTS packet contained the data-packet
size and thus, implicitly, the length of time all other stations should remain idle. Because RTS packets are
quite short, they are much less likely to be involved in collisions themselves than data packets.
[Link] Wi-Fi Fragmentation
Conceptually related to RTS/CTS is Wi-Fi fragmentation. If error rates or collision rates are high, a sender
can send a large packet as multiple fragments, each receiving its own link-layer ACK. As we shall see in
5.3.1 Error Rates and Packet Size, if bit-error rates are high then sending several smaller packets often
leads to fewer total transmitted bytes than sending the same data as one large packet.
Wi-Fi packet fragments are reassembled by the receiving node, which may or may not be the final destination.
As with the RTS/CTS threshold, the fragmentation threshold is often set to the size of the maximum packet.
Adjusting the values of these thresholds is seldom necessary, though might be appropriate if monitoring
revealed high collision or error rates. Unfortunately, it is essentially impossible for an individual station
to distinguish between reception errors caused by collisions and reception errors caused by other forms of
noise, and so it is hard to use reception statistics to distinguish between a need for RTS/CTS and a need for
fragmentation.
3.3 Wi-Fi
61
sender may fall back to the next lower bit rate. The actual bit-rate-selection algorithm lives in the particular
Wi-Fi driver in use; different nodes in a network may use different algorithms.
The earliest rate-scaling algorithm was Automatic Rate Fallback, or ARF, [KM97]. The rate decreases after
two consecutive transmission failures (that is, the link-layer ACK is not received), and increases after ten
transmission successes.
A significant problem for rate scaling is that a packet loss may be due either to low-level random noise
(white noise, or thermal noise) or to a collision (which is also a form of noise, but less random); only in the
first case is a lower transmission rate likely to be helpful. If a larger number of collisions is experienced, the
longer packet-transmission times caused by the lower bit rate may increase the frequency of hidden-node
collisions. In fact, a higher transmission rate (leading to shorter transmission times) may help; enabling the
RTS/CTS protocol may also help.
Signal Strength
Most Wi-Fi drivers report the received signal strength. Newer drivers use the IEEE Received Channel
Power Indicator convention; the RCPI is an 8-bit integer proportional to the absolute power received
by the antenna as measured in decibel-milliwatts (dBm). Wi-Fi values range from -10 dBm to -90 dBm
and below. For comparison, the light from the star Polaris delivers about -97 dBm to one eye on a good
night; Venus typically delivers about -73 dBm. A GPS satellite might deliver -127 dBm to your phone.
(Inspired by Wikipedia on DBm.)
A variety of newer rate-scaling algorithms have been proposed; see [JB05] for a summary. One, ReceiverBased Auto Rate (RBAR, [HVB01]), attempts to incorporate the signal-to-noise ratio into the calculation of
the transmission rate. This avoids the confusion introduced by collisions. Unfortunately, while the signalto-noise ratio has a strong theoretical correlation with the transmission bit-error rate, most Wi-Fi radios
will report to the host system the received signal strength. This is not the same as the signal-to-noise ratio,
which is harder to measure. As a result, the RBAR approach has not been quite as effective in practice as
might be hoped.
The Collision-Aware Rate Adaptation algorithm (CARA, [KKCQ06]) attempts (among other things) to infer
that a packet was lost to a collision rather than noise if, after one SIFS interval following the end of the packet
transmission, no link-layer ACK has been received and the channel is still busy. This will detect collisions,
of course, only with longer packets.
Because the actual data in a Wi-Fi packet may be sent at a rate not every participant is close enough to
receive correctly, every Wi-Fi transmission begins with a brief preamble at the minimum bit rate. Link-layer
ACKs, too, are sent at the minimum bit rate.
3 Other LANs
out by an exchange of special management packets may be restricted to stations with hardware (LAN)
addresses on a predetermined list, or to stations with valid cryptographic credentials. Stations may regularly
re-associate to their Access Point, especially if they wish to communicate some status update.
Access Points
Generally, a Wi-Fi access point has special features; Wi-Fi-enabled station devices like phones and
workstations do not act as access points. However, it may be possible to for a station device to become
an access point if the access-point mode is supported by the underlying radio hardware and if suitable
drivers can be found. Under linux, the hostapd package is one option.
Stations in an infrastructure network communicate directly only with their access point. If B and C share
access point A, and B wishes to send a packet to C, then B first forwards the packet to A and A then forwards
it to C. While this introduces a degree of inefficiency, it does mean that the access point and its associated
nodes automatically act as a true LAN: every node can reach every other node. In an ad hoc network, by
comparison, it is quite common for two nodes to be able to reach each other only by forwarding through an
intermediate third node; this is in fact exactly the hidden-node scenario.
Finally, Wi-Fi is by design completely interoperable with Ethernet; if station A is associated with access
point AP, and AP also connects via (cabled) Ethernet to station B, then if A wants to send a packet to B it
sends it using AP as the Wi-Fi destination but with B also included in the header as the actual destination.
Once it receives the packet by wireless, AP acts as an Ethernet switch and forwards the packet to B.
While this forwarding is transparent to senders, the Ethernet and Wi-Fi LAN header formats are entirely
different.
The above diagram illustrates an Ethernet header and the Wi-Fi header for a typical data packet not using
Wi-Fi quality-of-service features. The Ethernet type field usually moves to an IEEE Logical Link Control
header in the Wi-Fi region labeled data. The receiver and transmitter addresses are the MAC addresses
of the nodes receiving and transmitting the (unicast) packet; these may each be different from the ultimate
destination and source addresses. In infrastructure mode one of the receiver or transmitter addresses is the
access point; in typical situations either the receiver is the destination or the sender is the transmitter.
3.3 Wi-Fi
63
for pseudo-security reasons beacon packets can be suppressed). Large installations can create roaming access among multiple access points by assigning all the access points the same SSID. An individual station
will stay with the access point with which it originally associated until the signal strength falls below a certain level, at which point it will seek out other access points with the same SSID and with a stronger signal.
In this way, a large area can be carpeted with multiple Wi-Fi access points, so as to look like one large Wi-Fi
domain.
In order for this to work, traffic to wireless node B must find Bs current access point AP. This is done in
much the same way as, in a wired Ethernet, traffic finds a laptop that has been unplugged, carried to a new
building, and plugged in again. The distribution network is the underlying wired network (eg Ethernet) to
which all the access points connect. If the distribution network is a switched Ethernet supporting the usual
learning mechanism (2.4 Ethernet Switches), then Wi-Fi location update is straightforward. Suppose B
is a wireless node that has been exchanging packets via the distribution network with C (perhaps a router
connecting B to the Internet). When B moves to a new access point, all it has to do is send any packet over
the LAN to C, and the Ethernet switches involved will then learn the route through the switched Ethernet
from C to Bs current AP, and thus to B.
This process may leave other switches not currently communicating with B still holding in their forwarding tables the old location for B. This is not terribly serious, but can be avoided entirely if, after moving, B
sends out an Ethernet broadcast packet.
Ad hoc networks also have SSIDs; these are generated pseudorandomly at startup. Ad hoc networks have
beacon packets as well; all nodes participate in the regular transmission of these via a distributed algorithm.
64
3 Other LANs
To begin the association process, the supplicant contacts the authenticator using the Extensible Authentication Protocol, or EAP, with what amounts to a request to associate to that access point. EAP is a generic
message framework meant to support multiple specific types of authentication; see RFC 3748 and RFC
5247. The EAP request is forwarded to an authentication server, which may exchange (via the authenticator) several challenge/response messages with the supplicant. EAP is usually used in conjunction with
the RADIUS (Remote Authentication Dial-In User Service) protocol (RFC 2865), which is a specific (but
flexible) authentication-server protocol. WPA-Enterprise is sometimes known as 802.1X mode, EAP mode
or RADIUS mode.
One peculiarity of EAP is that EAP communication takes place before the supplicant is given an IP address
(in fact before the supplicant has completed associating itself to the access point); thus, a mechanism must
be provided to support exchange of EAP packets between supplicant and authenticator. This mechanism
is known as EAPOL, for EAP Over LAN. EAP messages between the authenticator and the authentication
server, on the other hand, can travel via IP; in fact, sites may choose to have the authentication server hosted
remotely.
Once the authentication server (eg RADIUS server) is set up, specific per-user authentication methods can
be entered. This can amount to username,password pairs, or some form of security certificate, or often
both. The authentication server will generally allow different encryption protocols to be used for different
supplicants, thus allowing for the possibility that there is not a common protocol supported by all stations.
When this authentication strategy is used, the access point no longer needs to know anything about what
authentication protocol is actually used; it is simply the middleman forwarding EAP packets between the
supplicant and the authentication server. The access point allows the supplicant to associate into the network
once it receives permission to do so from the authentication server.
3.3 Wi-Fi
65
Stations receiving data from the Access Point send the usual ACK after a SIFS interval. A data packet from
the Access Point addressed to station B may also carry, piggybacked in the Wi-Fi header, a Poll request to
another station C; this saves a transmission. Polled stations that send data will receive an ACK from the
Access Point; this ACK may be combined in the same packet with the Poll request to the next station.
At the end of the CFP, the regular contention period or CP resumes, with the usual CSMA/CA strategy.
The time interval between the start times of consecutive CFP periods is typically 100 ms, short enough to
allow some real-time traffic to be supported.
During the CFP, all stations normally wait only the Short IFS, SIFS, between transmissions. This works
because normally there is only one station designated to respond: the Access Point or the polled station.
However, if a station is polled and has nothing to send, the Access Point waits for time interval PIFS (PCF
Inter-Frame Spacing), of length midway between SIFS and IFS above (our previous IFS should now really
be known as DIFS, for DCF IFS). At the expiration of the PIFS, any non-Access-Point station that happens
to be unaware of the CFP will continue to wait the full DIFS, and thus will not transmit. An example of
such a CFP-unaware station might be one that is part of an entirely different but overlapping Wi-Fi network.
The Access Point generally maintains a polling list of stations that wish to be polled during the CFP. Stations
request inclusion on this list by an indication when they associate or (more likely) reassociate to the Access
Point. A polled station with nothing to send simply remains quiet.
PCF mode is not supported by many lower-end Wi-Fi routers, and often goes unused even when it is available. Note that PCF mode is collision-free, so long as no other Wi-Fi access points are active and within
range. While the standard has some provisions for attempting to deal with the presence of other Wi-Fi
networks, these provisions are somewhat imperfect; at a minimum, they are not always supported by other
access points. The end result is that polling is not quite as useful as it might be.
3.3.8 MANETs
The MANET acronym stands for mobile ad hoc network; in practice, the term generally applies to ad hoc
wireless networks of sufficient complexity that some internal routing mechanism is needed to enable full
connectivity. The term mesh network is also used. While MANETs can use any wireless mechanism, we
will assume here that Wi-Fi is used.
MANET nodes communicate by radio signals with a finite range, as in the diagram below.
66
3 Other LANs
Each nodes radio range is represented by a circle centered about that node. In general, two MANET nodes
may be able to communicate only by relaying packets through intermediate nodes, as is the case for nodes
A and G in the diagram above.
In the field, the radio range of each node may not be very circular, due to among other things signal reflection and blocking from obstructions. An additional complication arises when the nodes (or even just
obstructions) are moving in real time (hence the mobile of MANET); this means that a working route may
stop working a short time later. For this reason, and others, routing within MANETs is a good deal more
complex than routing in an Ethernet. A switched Ethernet, for example, is required to be loop-free, so there
is never a choice among multiple alternative routes.
Note that, without successful LAN-layer routing, a MANET does not have full node-to-node connectivity
and thus does not meet the definition of a LAN given in 1.9 LANs and Ethernet. With either LAN-layer or
IP-layer routing, one or more MANET nodes may serve as gateways to the Internet.
Note also that MANETs in general do not support broadcast, unless the forwarding of broadcast messages
throughout the MANET is built in to the routing mechanism. This can complicate the assignment of IP
addresses; the common IPv4 mechanism we will describe in 7.8 Dynamic Host Configuration Protocol
(DHCP) relies on broadcast and so usually needs some adaptation.
Finally, we observe that while MANETs are of great theoretical interest, their practical impact has been
modest; they are almost unknown, for example, in corporate environments. They appear most useful in
emergency situations, rural settings, and settings where the conventional infrastructure network has failed
or been disabled.
[Link] Routing in MANETs
Routing in MANETs can be done either at the LAN layer, using physical addresses, or at the IP layer with
some minor bending (below) of the rules.
3.3 Wi-Fi
67
Either way, nodes must find out about the existence of other nodes, and appropriate routes must then be
selected. Route selection can use any of the mechanisms we describe later in 9 Routing-Update Algorithms.
Routing at the LAN layer is much like routing by Ethernet switches; each node will construct an appropriate
forwarding table. Unlike Ethernet, however, there may be multiple paths to a destination, direct connectivity
between any particular pair of nodes may come and go, and negotiation may be required even to determine
which MANET nodes will serve as forwarders.
Routing at the IP layer involves the same issues, but at least IP-layer routing-update algorithms have always
been able to handle multiple paths. There are some minor issues, however. When we initially presented
IP forwarding in 1.10 IP - Internet Protocol, we assumed that routers made their decisions by looking
only at the network prefix of the address; if another node had the same network prefix it was assumed to be
reachable directly via the LAN. This model usually fails badly in MANETs, where direct reachability has
nothing to do with addresses. At least within the MANET, then, a modified forwarding algorithm must be
used where every address is looked up in the forwarding table. One simple way to implement this is to have
the forwarding tables contain only host-specific entries as were discussed in 3.1 Virtual Private Network.
Multiple routing algorithms have been proposed for MANETs. Performance of a given algorithm may
depend on the following factors:
The size of the network
Whether some nodes have agreed to serve as routers
The degree of node mobility, especially of routing-node mobility if applicable
Whether the nodes are under common administration, and thus may agree to defer their own transmission interests to the common good
per-node storage and power availability
3.4 WiMAX
WiMAX is a wireless network technology standardized by IEEE 802.16. It supports both stationary subscribers (802.16d) and mobile subscribers (802.16e). The stationary-subscriber version is often used to
provide residential Internet connectivity, in both urban and rural areas. The mobile version is sometimes
referred to as a fourth generation or 4G networking technology; its similar primary competitor is known
as LTE. WiMAX is used in many mobile devices, from smartphones to traditional laptops with wireless
cards installed.
As in the sidebar at the start of 3.3 Wi-Fi we will use the term data rate for what is commonly called
bandwidth to avoid confusion with the radio-specific meaning of the latter term.
WiMAX can use unlicensed frequencies, like Wi-Fi, but its primary use is over licensed radio spectrum.
WiMAX also supports a number of options for the width of its frequency band; the wider the band, the
higher the data rate. Wider bands also allow the opportunity for multiple independent frequency channels.
Downlink (base station to subscriber) data rates can be well over 100 Mbps (uplink rates are usually smaller).
Like Wi-Fi, WiMAX subscriber stations connect to a central access point, though the WiMAX standard
prefers the term base station which we will use henceforth. Stationary-subscriber WiMAX, however, operates on a much larger scale. The coverage radius of a WiMAX base station can be tens of kilometers
68
3 Other LANs
if larger antennas are provided, versus less (sometimes much less) than 100 meters for Wi-Fi; mobilesubscriber WiMAX might have a radius of one or two kilometers. Large-radius base stations are typically
mounted in towers. Subscriber stations are not generally expected to be able to hear other stations; they
interact only with the base station. As WiMAX distances increase, the data rate is reduced.
As with Wi-Fi, the central contention problem is how to schedule transmissions of subscriber stations so
they do not overlap; that is, collide. The base station has no difficulty broadcasting transmissions to multiple
different stations sequentially; it is the transmissions of those stations that must be coordinated. Once a
station completes the network entry process to connect to a base station (below), it is assigned regular
(though not necessarily periodic) transmission slots. These transmission slots may vary in size over time;
the base station may regularly issue new transmission schedules.
The centralized assignment of transmission intervals superficially resembles Wi-Fi PCF mode (3.3.7 WiFi Polling Mode); however, assignment is not done through polling, as propagation delays are too large
(below). Instead, each WiMAX subscriber station is told in effect that it may transmit starting at an assigned
time T and for an assigned length L. The station synchronizes its clock with that of the base station as part
of the network entry process.
Because of the long distances involved, synchronization and transmission protocols must take account of
speed-of-light delays. The round-trip delay across 30 km is 200 sec which is ten times larger than the basic
Wi-Fi SIFS interval; at 160 Mbps, this is the time needed to send 4 KB. If a station is to transmit so that its
message arrives at the base station at a certain time, it must actually begin transmission early by an amount
equal to the one-way station-to-base propagation delay; a special ranging mechanism allows stations to
figure out this delay.
A subscriber station begins the network-entry connection process to a base station by listening for the base
stations transmissions (which may be organized into multiple channels); these message streams contain
regular management messages containing, among other things, information about available data rates in
each direction.
Also included in the base stations message stream is information about start times for ranging intervals.
The station waits for one of these intervals and sends a range-request message to the base station. These
ranging intervals are open to all stations attempting network entry, and if another station transmits at the
same time there will be a collision. However, network entry is only done once (for a given base station) and
so the likelihood of a collision in any one ranging interval is small. An Ethernet/Wi-Fi-like exponentialbackoff process is used if a collision does occur. Ranging intervals are the only times when collisions can
occur; afterwards, all station transmissions are scheduled by the base station.
If there is no collision, the base station responds, and the station now knows the propagation delay and thus
can determine when to transmit so that its data arrives at the base station exactly at a specified time. The
station also determines its transmission signal strength from this ranging process.
Finally, and perhaps most importantly, the station receives from the base station its first timeslot for a
scheduled transmission. These timeslot assignments are included in regular uplink-map packets broadcast
by the base station. Each stations timeslot includes both a start time and a total length; lengths are in the
range of 2 to 20 ms. Future timeslots will be allocated as necessary by the base station, in future uplink-map
packets. Scheduled timeslots may be periodic (as is would be appropriate for voice) or may occur at varying
intervals. WiMAX stations may request any of several quality-of-service levels and the base station may
take these requests into account when determining the schedule. The base station also creates a downlink
schedule, but this does not need to be communicated to the subscriber stations; the base station simply uses
it to decide what to broadcast when to the stations. When scheduling the timeslots, the base station may also
3.4 WiMAX
69
take into account availability of multiple transmission channels and of directional antennas.
Through the uplink-map schedules and individual ranging, each station transmits so that one transmission
finishes arriving just before the next transmission begins arriving, as seen from the perspective of the base
station. Only minimal guard intervals need be included between consecutive transmissions. Two (or
more) consecutive transmissions may in fact be in the air simultaneously, as far-away stations need to
begin transmitting early so their signals will arrive at the base station at the expected time. The following
diagram illustrates this for stations separated by relatively large physical distances.
Mobile stations will need to update their ranging information regularly, but this can be done through future
scheduled transmissions. The distance to the base station is used not only for the mobile stations transmission timing, but also to determine its power level; signals from each mobile station, no matter where located,
should arrive at the base station with about the same power.
When a station has data to send, it includes in its next scheduled transmission a request for a longer transmission interval; if the request is granted, the station may send the data (or at least some of the data) in its
next scheduled transmission slot. When a station is done transmitting, its timeslot shrinks to the minimum,
and may be scheduled less frequently as well, but it does not disappear. Stations without data to send remain
connected to the base station by sending empty messages during these slots.
70
3 Other LANs
71
Trees vs Signal
Photo of the author attempting to improve his 2.4 GHz terrestrial-wireless signal via tree trimming.
Terrestrial fixed wireless was originally popularized for rural areas, where residential density is too low for
economical cable connections. However, some fixed-wireless ISPs now operate in urban areas, often using
WiMAX. One advantage of terrestrial fixed-wireless in remote areas is that the antennas covers a much
smaller geographical area than a satellite, generally meaning that there is more data bandwidth available per
user and the cost per megabyte is much lower.
Outdoor subscriber antennas often use a parabolic dish to improve reception; sizes range from 10 to 50 cm
in diameter. The size of the dish may depend on the distance to the central tower.
While there are standardized fixed-wireless systems, such as WiMAX, there are also a number of proprietary alternatives, including systems from Trango and Canopy. Fixed-wireless systems might, in fact, be
considered one of the last bastions of proprietary LAN protocols. This lack of standardization is due to a
72
3 Other LANs
variety of factors; two primary ones are the relatively modest overall demand for this service and the the fact
that most antennas need to be professionally installed by the ISP to ensure that they are properly mounted,
aligned, grounded and protected from lightning.
Packets will be transmitted in one direction (clockwise in the ring above). Stations in effect forward most
packets around the ring, although they can also remove a packet. (It is perhaps more accurate to think
73
of the forwarding as representing the default cable connectivity; non-forwarding represents the stations
momentarily breaking that connectivity.)
When the network is idle, all stations agree to forward a special, small packet known as a token. When a
station, say A, wishes to transmit, it must first wait for the token to arrive at A. Instead of forwarding the
token, A then transmits its own packet; this travels around the network and is then removed by A. At that
point (or in some cases at the point when A finishes transmitting its data packet) A then forwards the token.
In a small ring network, the ring circumference may be a small fraction of one packet. Ring networks
become large at the point when some packets may be entirely in transit on the ring. Slightly different
solutions apply in each case. (It is also possible that the physical ring exists only within the token-ring
switch, and that stations are connected to that switch using the usual point-to-point wiring.)
If all stations have packets to send, then we will have something like the following:
A waits for the token
A sends a packet
A sends the token to B
B sends a packet
B sends the token to C
C sends a packet
C sends the token to D
...
All stations get an equal number of chances to transmit, and no bandwidth is wasted on collisions.
One problem with token ring is that when stations are powered off it is essential that the packets continue
forwarding; this is usually addressed by having the default circuit configuration be to keep the loop closed.
Another issue is that some station has to watch out in case the token disappears, or in case a duplicate token
appears.
Because of fairness and the lack of collisions, IBM Token Ring was once considered to be the premium LAN
mechanism. As such, a premium price was charged (there was also the matter of licensing fees). But due
to a combination of lower hardware costs and higher bitrates (even taking collisions into account), Ethernet
eventually won out.
There was also a much earlier collision-free hybrid of 10 Mbps Ethernet and Token Ring known as Token
Bus: an Ethernet physical network (often linear) was used with a token-ring-like protocol layer above
that. Stations were physically connected to the (linear) Ethernet but were assigned identifiers that logically
arranged them in a (virtual) ring. Each station had to wait for the token and only then could transmit a
packet; after that it would send the token on to the next station in the virtual ring. As with real Token
Ring, some mechanisms need to be in place to monitor for token loss.
Token Bus Ethernet never caught on. The additional software complexity was no doubt part of the problem,
but perhaps the real issue was that it was not necessary.
74
3 Other LANs
75
taken to be bidirectional, a VCI used from S1 to S3 cannot be reused from S3 to S1 until the first connection
closes.
A to F: A 4
S1 6
S2 4 S4 8
S1 7
S3 3
B to D: B 4
S3 8 S1 7
A to F: A 7
S1 8
C
S2 8
S2 5 S4 9
D
S5 2 F
One may verify that on any one link no two different paths use the same VCI.
We now construct the actual VCI,port tables for the switches S1-S4, from the above; the table for S5 is left
as an exercise. Note that either the VCIin ,portin or the VCIout ,portout can be used as the key; we cannot
have the same pair in both the in columns and the out columns. It may help to display the port numbers for
each switch, as in the upper numbers in following diagram of the above red connection from A to F (lower
numbers are the VCIs):
Switch S1:
VCIin
4
5
6
8
7
portin
0
0
0
1
0
VCIout
6
6
7
7
8
portout
2
1
1
2
2
connection
AF #1
AE
AC
BD
AF #2
Switch S2:
76
3 Other LANs
VCIin
6
7
8
portin
0
0
0
VCIout
4
8
5
portout
1
2
1
connection
AF #1
BD
AF #2
VCIout
3
3
8
portout
2
1
3
connection
AE
AC
BD
VCIout
8
8
9
portout
2
1
2
connection
AF #1
AE
AF #2
Switch S3:
VCIin
6
7
4
portin
3
3
0
Switch S4:
VCIin
4
3
5
portin
3
0
3
The namespace for VCIs is small, and compact (eg contiguous). Typically the VCI and port bitfields can be
concatenated to produce a VCI,Port composite value small enough that it is suitable for use as an array
index. VCIs work best as local identifiers. IP addresses, on the other hand, need to be globally unique, and
thus are often rather sparsely distributed.
Virtual-circuit switching offers the following advantages:
connections can get quality-of-service guarantees, because the switches are aware of connections and
can reserve capacity at the time the connection is made
headers are smaller, allowing faster throughput
headers are small enough to allow efficient support for the very small packet sizes that are optimal for
voice connections. ATM packets, for instance, have 48 bytes of data; see below.
Datagram forwarding, on the other hand, offers these advantages:
Routers have less state information to manage.
Router crashes and partial connection state loss are not a problem.
If a router or link is disabled, rerouting is easy and does not affect any connection state. (As mentioned
in Chapter 1, this was Paul Barans primary concern in his 1962 paper introducing packet switching.)
Per-connection billing is very difficult.
The last point above may once have been quite important; in the era when the ARPANET was being developed, typical daytime long-distance rates were on the order of $1/minute. It is unlikely that early TCP/IP
protocol development would have been as fertile as it was had participants needed to justify per-minute
billing costs for every project.
It is certainly possible to do virtual-circuit switching with globally unique VCIs say the concatenation of
source and destination IP addresses and port numbers. The IP-based RSVP protocol (19.6 RSVP) does
exactly this. However, the fast-lookup and small-header advantages of a compact namespace are then lost.
Note that virtual-circuit switching does not suffer from the problem of idle channels still consuming resources, which is an issue with circuits using time-division multiplexing (eg shared T1 lines)
77
78
3 Other LANs
enters the network; reassembly is done at exit from the ATM path. IPv4 fragmentation, on the other hand,
applies conceptually to IP packets, and may be performed by routers within the network.
For AAL 3/4, we first define a high-level wrapper for an IP packet, called the CS-PDU (Convergence
Sublayer - Protocol Data Unit). This prefixes 32 bits on the front and another 32 bits (plus padding) on the
rear. We then chop this into as many 44-byte chunks as are needed; each chunk goes into a 48-byte ATM
payload, along with the following 32 bits worth of additional header/trailer:
2-bit type field:
10: begin new CS-PDU
00: continue CS-PDU
01: end of CS-PDU
11: single-segment CS-PDU
4-bit sequence number, 0-15, good for catching up to 15 dropped cells
10-bit MessageID field
CRC-10 checksum.
We now have a total of 9 bytes of header for 44 bytes of data; this is more than 20% overhead. This did not
sit well with the IP-over-ATM community (such as it was), and so AAL 5 was developed.
AAL 5 moved the checksum to the CS-PDU and increased it to 32 bits from 10 bits. The MID field was
discarded, as no one used it, anyway (if you wanted to send several different types of messages, you simply
created several virtual circuits). A bit from the ATM header was taken over and used to indicate:
1: start of new CS-PDU
0: continuation of an existing CS-PDU
The CS-PDU is now chopped into 48-byte chunks, which are then used as the entire body of each ATM
cell. With 5 bytes of header for 48 bytes of data, overhead is down to 10%. Errors are detected by the
CS-PDU CRC-32. This also detects lost cells (impossible with a per-cell CRC!), as we no longer have any
cell sequence number.
For both AAL3/4 and AAL5, reassembly is simply a matter of stringing together consecutive cells in order
of arrival, starting a new CS-PDU whenever the appropriate bits indicate this. For AAL3/4 the receiver
has to strip off the 4-byte AAL3/4 headers; for AAL5 the receiver has to verify the CRC-32 checksum once
all cells are received. Different cells from different virtual circuits can be jumbled together on the ATM
backbone, but on any one virtual circuit the cells from one higher-level packet must be sent one right after
the other.
A typical IP packet divides into about 20 cells. For AAL 3/4, this means a total of 200 bits devoted to CRC
codes, versus only 32 bits for AAL 5. It might seem that AAL 3/4 would be more reliable because of this,
but, paradoxically, it was not! The reason for this is that errors are rare, and so we typically have one or at
most two per CS-PDU. Suppose we have only a single error, ie a single cluster of corrupted bits small enough
that it is likely confined to a single cell. In AAL 3/4 the CRC-10 checksum will fail to detect that error (that
is, the checksum of the corrupted packet will by chance happen to equal the checksum of the original packet)
with probability 1/210 . The AAL 5 CRC-32 checksum, however, will fail to detect the error with probability
1/232 . Even if there are enough errors that two cells are corrupted, the two CRC-10s together will fail to
79
detect the error with probability 1/220 ; the CRC-32 is better. AAL 3/4 is more reliable only when we have
errors in at least four cells, at which point we might do better to switch to an error-correcting code.
Moral: one checksum over the entire message is often better than multiple shorter checksums over parts of
the message.
3.9 Epilog
Along with a few niche protocols, we have focused primarily here on wireless and on virtual circuits. Wireless, of course, is enormously important: it is the enabler for mobile devices, and has largely replaced
traditional Ethernet for home and office workstations.
While it is sometimes tempting (in the IP world at least) to write off ATM as a niche technology, virtual
circuits are a serious conceptual alternative to datagram forwarding. As we shall see in 19 Quality of
Service, IP has problems handling real-time traffic, and virtual circuits offer a solution. The Internet has
so far embraced only small steps towards virtual circuits (such as MPLS, 19.12 Multi-Protocol Label
Switching (MPLS)), but they remain a tantalizing strategy.
3.10 Exercises
1. Suppose remote host A uses a VPN connection to connect to host B, with IP address [Link]. As
normal Internet connection is via device eth0 with IP address [Link]; As VPN connection is via device
ppp0 with IP address [Link]. Whenever A wants to send a packet via ppp0, it is encapsulated and
forwarded over the connection to B at [Link].
(a). Suppose As IP forwarding table is set up so that all traffic to [Link] uses eth0 and all traffic to
anywhere else uses ppp0. What happens if an intruder M attempts to open a connection to A at [Link]?
What route will packets from A to M take?
(b). Suppose As IP forwarding table is (mis)configured so that all outbound traffic uses ppp0. Describe
what will happen when A tries to send a packet.
2. Suppose remote host A wishes to use a TCP-based VPN connection to connect to host B, with IP address
[Link]. However, the VPN software is not available for host A. Host A is, however, able to run that
software on a virtual machine V hosted by A; A and V have respective IP addresses [Link] and [Link]
on the virtual network connecting them. V reaches the outside world through network address translation
(1.14 Network Address Translation), with A acting as Vs NAT router. When V runs the VPN software,
it forwards packets addressed to B the usual way, through A using NAT. Traffic to any other destination it
encapsulates over the VPN.
Can A configure its IP forwarding table so that it can make use of the VPN? If not, why not? If so, how? (If
you prefer, you may assume V is a physical host connecting to a second interface on A; A still acts as Vs
NAT router.)
80
3 Other LANs
3. Token Bus was a proprietary Ethernet-based network. It worked like Token Ring in that a small token
packet was sent from one station to the next in agreed-upon order, and a station could transmit only when it
had just received the token.
(a). If the data rate is 10 Mbps and the token is 64 bytes long (the 10-Mbps Ethernet minimum packet size),
how long does it take on average to send a packet on an idle network with 40 stations? Ignore the
propagation delay and the gap Ethernet requires between packets.
(b). Repeat part (a) assuming the tokens are only 16 bytes long.
(c). Sketch a protocol by which stations can sort themselves out to decide the order of token transmission;
that is, an order of the stations S0 ... Sn-1 where station Si sends the token to station S(i+1) mod n .
4. The IEEE 802.11 standard states transmission of the ACK frame shall commence after a SIFS period,
without regard to the busy/idle state of the medium; that is, the ACK sender does not listen first for an idle
network. Give a scenario in which the Wi-Fi ACK frame would fail to be delivered in the absence of this
rule, but succeed with it. Hint: this is another example of the hidden-node problem, [Link] Hidden-Node
Problem.
5. Suppose the average contention interval in a Wi-Fi network (802.11g) is 64 SlotTimes. The average
packet size is 1 KB, and the data rate is 54 Mbps. At that data rate, it takes about (81000)/54 = 148 sec
to transmit a packet.
6. WiMAX subscriber stations are not expected to hear one another at all. For Wi-Fi non-access-point
stations in an infrastructure (access-point) setting, on the other hand, listening to other non-access-point
transmissions is encouraged.
(a). List some ways in which Wi-Fi non-access-point stations in an infrastructure (access-point) network do
sometimes respond to packets sent by other non-access-point stations. The responses need not be in the
form of transmissions.
(b). Explain why Wi-Fi stations cannot be required to respond as in part (a).
7. Suppose WiMAX subscriber stations can be moving, at speeds of up to 33 meters/sec (the maximum
allowed under 802.16e).
(a). How much earlier (or later) can one subscriber packet arrive? Assume that the ranging process updates
the stations propagation delay once a minute. The speed of light is about 300 meters/sec.
3.10 Exercises
81
(b). With 5000 senders per second, how much time out of each second must be spent on guard intervals
accommodating the early/late arrivals above? You will need to double the time from part (a), as the base
station cannot tell whether the signal from a moving subscriber will arrive earlier or later.
8. [SM90] contained a proposal for sending IP packets over ATM as N cells as in AAL-5, followed by one
cell containing the XOR of all the previous cells. This way, the receiver can recover from the loss of any
one cell. Suppose N=20 here; with the SM90 mechanism, each packet would require 21 cells to transmit;
that is, we always send 5% more. Suppose the cell loss-rate is p (presumably very small). If we send 20
cells without the SM90 mechanism, we have a probability of about 20p that any one cell will be lost, and we
will have to retransmit the entire 20 again. This gives an average retransmission amount of about 20p extra
packets. For what value of p do the with-SM90 and the without-SM90 approaches involve about the same
total number of cell transmissions?
9. In the example in 3.7 Virtual Circuits, give the VCI table for switch S5.
10. Suppose we have the following network:
A
S1
S2
S3
S4
The virtual-circuit switching tables are below. Ports are identified by the node at the other end. Identify all
the connections. Give the path for each connection and the VCI on each link of the path.
Switch S1:
VCIin
1
2
3
portin
A
A
A
VCIout
2
2
3
portout
S3
S2
S2
VCIout
1
3
4
portout
B
S4
S4
VCIout
2
2
portout
S4
C
VCIout
2
3
1
portout
S2
S3
D
Switch S2:
VCIin
2
2
3
portin
S4
S1
S1
Switch S3:
VCIin
2
3
portin
S1
S4
Switch S4:
VCIin
2
3
4
82
portin
S3
S2
S2
3 Other LANs
S1
S2
S3
S4
Give virtual-circuit switching tables for the following connections. Route via a shortest path.
(a). AD
(b). CB, via S4
(c). BD
(d). AD, via whichever of S2 or S3 was not used in part (a)
12. Below is a set of switches S1 through S4. Define VCI-table entries so the virtual circuit from A to B
follows the path
A S1 S2 S4 S3 S1 S2 S4 S3 B
That is, each switch is visited twice.
A
S1
S2
S3
S4
3.10 Exercises
83
84
3 Other LANs
4 LINKS
At the lowest (logical) level, network links look like serial lines. In this chapter we address how packet
structures are built on top of serial lines, via encoding and framing. Encoding determines how bits and
bytes are represented on a serial line; framing allows the receiver to identify the beginnings and endings of
packets.
We then conclude with the high-speed serial lines offered by the telecommunications industry, T-carrier and
SONET, upon which almost all long-haul point-to-point links that tie the Internet together are based.
4.1.1 NRZ
NRZ (Non-Return to Zero) is perhaps the simplest encoding; it corresponds to direct bit-by-bit transmission
of the 0s and 1s in the data. We have two signal levels, lo and hi, we set the signal to one or the other
of these depending on whether the data bit is 0 or 1, as in the diagram below. Note that in the diagram the
signal bits have been aligned with the start of the pulse representing that signal value.
85
NRZ replaces an earlier RZ (Return to Zero) encoding, in which hi and lo corresponded to +1 and -1, and
between each pair of pulses corresponding to consecutive bits there was a brief return to the 0 level.
One drawback to NRZ is that we cannot distinguish between 0-bits and a signal that is simply idle. However,
the more serious problem is the lack of synchronization: during long runs of 0s or long runs of 1s, the
receiver can lose count, eg if the receivers clock is running a little fast or slow. The receivers clock can
and does resynchronize whenever there is a transition from one level to the other. However, suppose bits
are sent at one per s, the sender sends 5 1-bits in a row, and the receivers clock is running 10% fast. The
signal sent is a 5-s hi pulse, but when the pulse ends the receivers clock reads 5.5 s due to the clock
speedup. Should this represent 5 1-bits or 6 1-bits?
4.1.2 NRZI
An alternative that helps here (though not obviously at first) is NRZI, or NRZ Inverted. In this encoding,
we represent a 0-bit as no change, and a 1-bit as a transition from lo to hi or hi to lo:
Now there is a signal transition aligned above every 1-bit; a 0-bit is represented by the lack of a transition.
This solves the synchronization problem for runs of 1-bits, but does nothing to address runs of 0-bits.
However, NRZI can be combined with techniques to minimize runs of 0-bits, such as 4B/5B (below).
4.1.3 Manchester
Manchester encoding sends the data stream using NRZI, with the addition of a clock transition between
each pair of consecutive data bits. This means that the signaling rate is now double the data rate, eg 20
MHz for 10Mbps Ethernet (which does use Manchester encoding). The signaling is as if we doubled the
bandwidth and inserted a 1-bit between each pair of consecutive data bits, removing this extra bit at the
receiver:
86
4 Links
All these transitions mean that the longest the clock has to count is 1 bit-time; clock synchronization is
essentially solved, at the expense of the doubled signaling rate.
4.1.4 4B/5B
In 4B/5B encoding, for each 4-bit nybble of data we actually transmit a designated 5-bit symbol, or code,
selected to have enough 1-bits. A symbol in this sense is a digital or analog transmission unit that decodes
to a set of data bits; the data bits are not transmitted individually.
Specifically, every 5-bit symbol used by 4B/5B has at most one leading 0-bit and at most two trailing 0-bits.
The 5-bit symbols corresponding to the data are then sent with NRZI, where runs of 1s are safe. Note that
the worst-case run of 0-bits has length three. Note also that the signaling rate here is 1.25 times the data
rate. 4B/5B is used in 100-Mbps Ethernet, 2.2 100 Mbps (Fast) Ethernet. The mapping between 4-bit data
values and 5-bit symbols is fixed by the 4B/5B standard:
data
0000
0001
0010
0011
0100
0101
0110
0111
1000
1001
1010
symbol
11110
01001
10100
10101
01010
01011
01110
01111
10010
10011
10110
data
1011
1100
1101
1110
1111
IDLE
HALT
START
END
RESET
DEAD
symbol
10111
11010
11011
11100
11101
11111
00100
10001
01101
00111
00000
There are more than sixteen possible symbols; this allows for some symbols to be used for signaling rather
than data. IDLE, HALT, START, END and RESET are shown above, though there are others. These can be
used to include control and status information without fear of confusion with the data. Some combinations
of control symbols do lead to up to four 0-bits in sequence; HALT and RESET have two leading 0-bits.
10-Mbps and 100-Mbps Ethernet pads short packets up to the minimum packet size with 0-bytes, meaning
that the next protocol layer has to be able to distinguish between padding and actual 0-byte data. Although
100-Mbps Ethernet uses 4B/5B encoding, it does not make use of special non-data symbols for packet
padding. Gigabit Ethernet uses PAM-5 encoding (2.3 Gigabit Ethernet), and does use special non-data
symbols (inserted by the hardware) to pad packets; there is thus no ambiguity at the receiving end as to
where the data bytes ended.
The choice of 5-bit symbols for 4B/5B is in principle arbitrary; note however that for data from 0100 to
1101 we simply insert a 1 in the fourth position, and in the last two we insert a 0 in the fourth position. The
4.1 Encoding and Framing
87
first four symbols (those with the most zeroes) follow no obvious pattern, though.
4.1.5 Framing
How does a receiver tell when one packet stops and the next one begins, to keep them from running together?
We have already seen the following techniques for addressing this framing problem: determining where
packets end:
Interpacket gaps (as in Ethernet)
4B/5B and special bit patterns
Putting a length field in the header would also work, in principle, but seems not to be widely used. One
problem with this technique is that restoring order after desynchronization can be difficult.
There is considerable overlap of framing with encoding; for example, the existence of non-data bit patterns
in 4B/5B is due to an attempt to solve the encoding problem; these special patterns can also be used as
unambiguous frame delimiters.
[Link] HDLC
HDLC (High-level Data Link Control) is a general link-level packet format used for a number of applications, including Point-to-Point Protocol (PPP) (which in turn is used for PPPoE PPP over Ethernet which
is how a great many Internet subscribers connect to their ISP), and Frame Relay, still used as the low-level
protocol for delivering IP packets to many sites via telecommunications lines. HDLC supports the following
two methods for frame separation:
HDLC over asynchronous links: byte stuffing
HDLC over synchronous links: bit stuffing
The basic encapsulation format for HDLC packets is to begin and end each frame with the byte 0x7E, or, in
binary, 0111 1110. The problem is that this byte may occur in the data as well; we must make sure we dont
misinterpret such a data byte as the end of the frame.
Asynchronous serial lines are those with some sort of start/stop indication, typically between bytes; such
lines tend to be slower. Over this kind of line, HDLC uses the byte 0x7D as an escape character. Any
data bytes of 0x7D and 0x7E are escaped by preceding them with an additional 0x7D. (Actually, they are
transmitted as 0x7D followed by (original_byte xor 0x20).) This strategy is fundamentally the same as that
used by C-programming-language character strings: the string delimiter is and the escape character is \.
Any occurrences of or \ within the string are escaped by preceding them with \.
Over synchronous serial lines (typically faster than asynchronous), HDLC generally uses bit stuffing. The
underlying bit encoding involves, say, the reverse of NRZI, in which transitions denote 0-bits and lack of
transitions denote 1-bits. This means that long runs of 1s are now the problem and runs of 0s are safe.
Whenever five consecutive 1-bits appear in the data, eg 011111, a 0-bit is then inserted, or stuffed, by the
transmitting hardware (regardless of whether or not the next data bit is also a 1). The HDLC frame byte of
0x7E = 0111 1110 thus can never appear as encoded data, because it contains six 1-bits in a row. If we had
0x7E in the data, it would be transmitted as 0111 11010.
The HDLC receiver knows that
88
4 Links
This double-violation is the clue to the receiver that the special pattern is to be removed and replaced with
the original eight 0-bits.
89
Once upon a time it was not uncommon to link computers with serial lines, rather than packet networks.
This was most often done for file transfers, but telnet logins were also done this way. The problem with this
approach is that the line had to be dedicated to one application (or one user) at a time.
Packet switching naturally implements multiplexing (sharing) on links; the demultiplexer is the destination
address. Port numbers allow demultiplexing of multiple streams to same destination host.
There are other ways for multiple channels to share a single wire. One approach is frequency-division
multiplexing, or putting each channel on a different carrier frequency. Analog cable television did this.
Some fiber-optic protocols also do this, calling it wavelength-division multiplexing.
But perhaps the most pervasive alternative to packets is the voice telephone systems time division multiplexing, or TDM, sometimes prefixed with the adjective synchronous. The idea is that we decide on a
number of channels, N, and the length of a timeslice, T, and allow each sender to send over the channel for
time T, with the senders taking turns in round-robin style. Each sender gets to send for time T at regular
intervals of NT, thus receiving 1/N of the total bandwidth. The timeslices consume no bandwidth on headers
or addresses, although sometimes there is a small amount of space dedicated to maintaining synchronization
between the two endpoints. Here is a diagram of sending with N=8:
Note, however, that if a sender has nothing to send, its timeslice cannot be used by another sender. Because
so much data traffic is bursty, involving considerable idle periods, TDM has traditionally been rejected for
data networks.
4 Links
The next most common T-carrier / Digital Signal line is perhaps T3/DS3; this represents the TDM multiplexing of 28 DS1 signals. The problem is that some individual DS1s may run a little slow, so an elaborate pulse
stuffing protocol has been developed. This allows extra bits to be inserted at specific points, if necessary, in
such a way that the original component T1s can be exactly recovered even if there are clock irregularities.
The pulse-stuffing solution did not scale well, and so T-carrier levels past T3 were very rarely used.
While T-carrier was originally intended as a way of bundling together multiple DS0 channels on a single
high-speed line, it also allows providers to offer leased digital point-to-point links with data rates in almost
any multiple of the DS0 rate.
4.2.2 SONET
SONET stands for Synchronous Optical NETwork; it is the telecommunications industrys standard mechanism for very-high-speed TDM over optical fiber. While there is now flexibility regarding the the optical
part, the synchronous part is taken quite seriously indeed, and SONET senders and receivers all use very
precisely synchronized clocks (often atomic). The actual bit encoding is NRZI.
Due to the frame structure, below, the longest possible run of 0-bits is ~250 bits (~30 bytes), but is usually
much less. Accurate reception of 250 0-bits requires a clock accurate to within (at a minimum) one part in
500, which is generally within reach. This mechanism solves most of the clock-synchronization problem,
though SONET also has a resynchronization protocol in case the receiver gets lost.
The primary reason for SONETs accurate clocking, however, is not the clock-synchronization problem as
we have been using the term, but rather the problem of demultiplexing and remultiplexing multiple component bitstreams in a setting in which some of the streams may run slow. One of the primary design goals for
SONET was to allow such multiplexing without the need for pulse stuffing, as is used in the Digital Signal
hierarchy. SONET tributary streams are in effect not allowed to run slow (although SONET does provide
for occasional very small byte slips, below). Furthermore, as multiple SONET streams are demultiplexed
at a switching center and then remultiplexed into new SONET streams, synchronization means that none of
the streams falls behind or gets ahead.
The basic SONET format is known as STS-1. Data is organized as a 9x90 byte grid. The first 3 bytes of
each row (that is, the first three columns) form the frame header. Frames are not addressed; SONET is a
point-to-point protocol and a node sends a continuous sequence of frames to each of its neighbors. When the
frames reach their destination, in principle they need to be fully demultiplexed for the data to be forwarded
on. In practice, there are some shortcuts to full demultiplexing.
91
The actual bytes sent are scrambled: the data is XORed with a standard, fixed pseudorandom pattern before
transmission. This introduces many 1-bits, on which clock resynchronization can occur, with a high degree
of probability.
There are two other special columns in a frame, each guaranteed to contain at least one 1-bit, so the maximum run of data bytes is limited to ~30; this is thus the longest run of possible 0s.
The first two bytes of each frame are 0xF628. SONETs frame-synchronization check is based on verifying
these byte values at the start of each frame. If the receiver is ever desynchronized, it begins a frame resynchronization procedure: the receiver searches for those 0xF628 bytes at regular 810-byte (6480-bit)
spacing. After a few frames with 0xF628 in the right place, the receiver is very sure it is looking at the
synchronization bytes and not at a data-byte position. Note that there is no evident byte boundary to a
SONET frame, so the receiver must check for 0xF628 beginning at every bit position.
SONET frames are transmitted at a rate of 8,000 frames/second. This is the canonical byte sampling rate
for standard voice-grade (DS0, or 64 Kbps) lines. Indeed, the classic application of SONET is to transmit
multiple DS0 voice calls using TDM: within a frame, each data byte position is given over to one voice
channel. The same byte position in consecutive frames constitutes one byte every 1/8000 seconds. The
basic STS-1 data rate of 51.84 Mbps is exactly 810 bytes/frame 8 bits/byte 8000 frames/sec.
To a customer who has leased a SONET-based channel to transmit data, a SONET link looks like a very fast
bitstream. There are several standard ways of encoding data packets over SONET. One is to encapsulate the
data as ATM cells, and then embed the cells contiguously in the bitstream. Another is to send IP packets
encoded in the bitstream using HDLC-like bit stuffing, which means that the SONET bytes and the IP bytes
may no longer correspond. The advantage of HDLC encoding is that it makes SONET re-synchronization
vanishingly infrequent. Most IP backbone traffic today travels over SONET links.
Within the 990-byte STS-1 frame, the payload envelope is the 987 region nominally following the three
header columns; this payload region has its own three reserved columns meaning that there are 84 columns
(984 bytes) available for data. This 987-byte payload envelope can float within the physical 990byte frame; that is, if the input frames are running slow then the output physical frames can be transmitted
at the correct rate by letting the payload frames slip backwards, one byte at a time. Similarly, if the input
frames are arriving slightly too fast, they can slip forwards by up to one byte at a time; the extra byte is
stored in a reserved location in the three header columns of the 990 physical frame.
Faster SONET streams are made by multiplexing slower ones. The next step up is STS-3, an STS-3 frame
is three STS-1 frames, for 9270 bytes. STS-3 (or, more properly, the physical layer for STS-3) is also
called OC-3, for Optical Carrier. Beyond STS-3, faster lines are multiplexed combinations of four of the
next-slowest lines. Here are some of the higher levels:
STS
STS-1
STS-3
STS-12
STS-48
STS-192
STS-768
STM
STM-0
STM-1
STM-4
STM-16
STM-64
STM-256
bandwidth
51.84 Mbps
155.52 Mbps
622.08 Mbps (=12*51.84, exactly)
2.488 Gbps
9.953 Gbps
39.8 Gbps
Faster SONET lines have been defined, but a simpler way to achieve very high data rates over optical fiber is
to use wavelength-division multiplexing (that is, frequency-division multiplexing at optical frequencies);
this means we have separate SONET channels at different wavelengths of light.
92
4 Links
SONET provides a wide variety of leasing options at various bandwidths. High-volume customers can lease
an entire STS-1 or larger unit. Alternatively, the 84 columns of an STS-1 frame can be divided into seven
virtual tributary groups, each of twelve columns; these groups can be leased individually or in multiples,
or be further divided into as few as three columns (which works out to be just over the T1 data rate).
[Link] Other Optical Fiber
4.3 Epilog
This completes our discussion of common physical links. Perhaps the main takeaway point is that transmitting bits over any distance is not quite as simple as it may appear; simple NRZ transmission is not effective.
4.4 Exercises
1. What is encoded by the following NRZI signal? The first two bits are shown.
4.3 Epilog
93
2. Argue that sending 4 0-bits via NRZI requires a clock accurate to within 1 part in 8. Assume that the
receiver resynchronizes its clock whenever a 1-bit transition is received, but that otherwise it attempts to
sample a bit in the middle of the bits timeslot.
3.(a) What bits are encoded by the following Manchester-encoded sequence?
(b). Why is there no ambiguity as to whether the first transition is a clock transition or a data (1-bit)
transition?
(c). Give an example of a signal pattern consisting of an NRZI encoding of 0-bits and 1-bits that does not
contain two consecutive 0-bits and which is not a valid Manchester encoding of data. Such a pattern could
thus could be used as a special non-data marker.
4. What three ASCII letters (bytes) are encoded by the following 4B/5B pattern? (Be careful about uppercase
vs lowercase.)
010110101001110101010111111110
5.(a) Suppose a device is forwarding SONET STS-1 frames. How much clock drift, as a percentage, on
the incoming line would mean that the output payload envelopes must slip backwards by one byte per three
physical frames? | (b). In 4.2.2 SONET it was claimed that sending 250 0-bits required a clock accurate to
within 1 part in 500. Describe how a SONET clock might meet the requirement of part (a) above, and yet
fail at this second requirement. (Hint: in part (a) the requirement is a long-term average).
94
4 Links
5 PACKETS
In this chapter we address a few abstract questions about packets, and take a close look at transmission
times. We also consider how big packets should be, and how to detect transmission errors. These issues are
independent of any particular set of protocols.
95
Finally, a switch may or may not also introduce queuing delay; this will often depend on competing traffic.
We will look at this in more detail in 14 Dynamics of TCP Reno, but for now note that a steady queuing delay
(eg due to a more-or-less constant average queue utilization) looks to each sender more like propagation
delay than bandwidth delay, in that if two packets are sent back-to-back and arrive that way at the queue,
then the pair will experience only a single queuing delay.
Like the previous example except that the propagation delay is increased to 4 ms
The total transmit time is now 4200 sec = 200 sec + 4000 sec
Case 3: A
We now have two links, each with propagation delay 40 sec; bandwidth and packet size as in
Case 1
The total transmit time for one 200-byte packet is now 480 sec = 240 + 240. There are two propagation
delays of 40 sec each; A introduces a bandwidth delay of 200 sec and R introduces a store-and-forward
delay (or second bandwidth delay) of 200 sec.
Case 4: A
96
5 Packets
These ladder diagrams represent the full transmission; a snapshot state of the transmission at any one
instant can be obtained by drawing a horizontal line. In the middle, case 3, diagram, for example, at no
instant are both links active. Note that sending two smaller packets is faster than one large packet. We
expand on this important point below.
Now let us consider the situation when the propagation delay is the most significant component. The crosscontinental US roundtrip delay is typically around 50-100 ms (propagation speed 200 km/ms in cable, 5,00010,000 km cable route, or about 3-6000 miles); we will use 100 ms in the examples here. At 1.0 Mbit, 100ms
is about 12KB, or eight full-sized Ethernet packets. At this bandwidth, we would have four packets and four
returning ACKs strung out along the path. At 1.0 Gbit, in 100ms we can send 12,000 KB, or 800 Ethernet
packets, before the first ACK returns.
At most non-LAN scales, the delay is typically simplified to the round-trip time, or RTT: the time between
sending a packet and receiving a response.
Different delay scenarios have implications for protocols: if a network is bandwidth-limited then protocols
are easier to design. Extra RTTs do not cost much, so we can build in a considerable amount of back-andforth exchange. However, if a network is delay-limited, the protocol designer must focus on minimizing
extra RTTs. As an extreme case, consider wireless transmission to the moon (0.3 sec RTT), or to Jupiter (1
hour RTT).
At my home I formerly had satellite Internet service, which had a roundtrip propagation delay of ~600 ms.
This is remarkably high when compared to purely terrestrial links.
When dealing with reasonably high-bandwidth large-scale networks (eg the Internet), to good approximation most of the non-queuing delay is propagation, and so bandwidth and total delay are effectively
independent. Only when propagation delay is small are the two interrelated. Because propagation delay
dominates at this scale, we can often make simplifications when diagramming. In the illustration below, A
sends a data packet to B and receives a small ACK in return. In (a), we show the data packet traversing
several switches; in (b) we show the data packet as if it were sent along one long unswitched link, and in (c)
we introduce the idealization that bandwidth delay (and thus the width of the packet line) no longer matters.
(Most later ladder diagrams in this book are of this type.)
97
The bandwidth delay product (usually involving round-trip delay, or RTT), represents how much we can
send before we hear anything back, or how much is pending in the network at any one time if we send
continuously. Note that, if we use RTT instead of one-way time, then half the pending packets will be
returning ACKs. Here are a few values
RTT
1 ms
100 ms
100 ms
bandwidth
10 Mbps
1.5 Mbps
600 Mbps
bandwidth delay
1.2 KB
20 KB
8,000 KB
5 Packets
Alternatively, perhaps routers are allowed to reserve a varying amount of bandwidth for high-priority traffic,
depending on demand, and so the bandwidth allocated to the best-effort traffic can vary. Perceived link
bandwidth can also vary over time if packets are compressed at the link layer, and some packets are able to
be compressed more than others.
Finally, if mobile nodes are involved, then the distance and thus the propagation delay can change. This can
be quite significant if one is communicating with a wireless device that is being taken on a cross-continental
road trip.
Despite these sources of fluctuation, we will usually assume that RTTnoLoad is fixed and well-defined, especially when we wish to focus on the queuing component of delay.
R1
R2
R3
R4
Suppose we send either one big packet or five smaller packets. The relative times from A to B are illustrated
in the following figure:
99
The point is that we can take advantage of parallelism: while the R4B link above is handling packet 1,
the R3R4 link is handling packet 2 and the R2R3 link is handling packet 3 and so on. The five smaller
packets would have five times the header capacity, but as long as headers are small relative to the data, this
is not a significant issue.
The sliding-windows algorithm, used by TCP, uses this idea as a continuous process: the sender sends a
continual stream of packets which travel link-by-link so that, in the full-capacity case, all links may be in
use at all times.
100
5 Packets
average seven times; lossless transmission would require 50 packets but we in fact need 750 = 350 packets,
or 7,000,000 bits.
Moral: choose the packet size small enough that most packets do not encounter errors.
To be fair, very large packets can be sent reliably on most cable links (eg TDM and SONET). Wireless,
however, is more of a problem.
101
in binary; if one adds two positive integers and the sum does not overflow the hardware word size, then
ones-complement and the now-universal twos-complement are identical. To form the ones-complement sum
of 16-bit words A and B, first take the ordinary twos-complement sum A+B. Then, if there is an overflow
bit, add it back in as low-order bit. Thus, if the word size is 4 bits, the ones-complement sum of 0101 and
0011 is 1000 (no overflow). Now suppose we want the ones-complement sum of 0101 and 1100. First we
take the exact sum and get 1|0001, where the leftmost 1 is an overflow bit past the 4-bit wordsize. Because
of this overflow, we add this bit back in, and get 0010.
The ones-complement numeric representation has two forms for zero: 0000 and 1111 (it is straightforward
to verify that any 4-bit quantity plus 1111 yields the original quantity; in twos-complement notation 1111
represents -1, and an overflow is guaranteed, so adding back the overflow bit cancels the -1 and leaves us
with the original number). It is a fact that the ones-complement sum is never 0000 unless all bits of all the
summands are 0; if the summands add up to zero by coincidence, then the actual binary representation will
be 1111. This means that we can use 0000 in the checksum to represent checksum not calculated, which
the UDP protocol used to permit.
Ones-complement
Long ago, before Loyola had any Internet connectivity, I wrote a primitive UDP/IP stack to allow me
to use the Ethernet to back up one machine that did not have TCP/IP to another machine that did. We
used private IP addresses of the form 10.0.0.x. I set as many header fields to zero as I could. I paid
no attention to how to implement ones-complement addition; I simply used twos-complement, for the
IP header only, and did not use a UDP checksum at all. Hey, it worked.
Then we got a real Class B address block [Link]/16, and changed IP addresses. My software no
longer worked. It turned out that, in the original version, the IP header bytes were all small enough
that when I added up the 16-bit words there were no carries, and so ones-complement was the same as
twos-complement. With the new addresses, this was no longer true. As soon as I figured out how to
implement ones-complement addition properly, my backups worked again.
There is another way to look at the (16-bit) ones-complement sum: it is in fact the remainder upon dividing
the message (seen as a very long binary number) by 216 - 1. This is similar to the decimal casting out nines
rule: if we add up the digits of a base-10 number, and repeat the process until we get a single digit, then that
digit is the remainder upon dividing the original number by 10-1 = 9. The analogy here is that the message
is looked at as a very large number written in base-216 , where the digits are the 16-bit words. The process
of repeatedly adding up the digits until we get a single digit amounts to taking the ones-complement
sum of the words.
A weakness of any error-detecting code based on sums is that transposing words leads to the same sum, and
the error is not detected. In particular, if a message is fragmented and the fragments are reassembled in the
wrong order, the ones-complement sum will likely not detect it.
While some error-detecting codes are better than others at detecting certain kinds of systematic errors (for
example, CRC, below, is usually better than the Internet checksum at detecting transposition errors), ultimately the effectiveness of an error-detecting code depends on its length. Suppose a packet P1 is corrupted
randomly into P2, but still has its original N-bit error code EC(P1). This N-bit code will fail to detect the
error that has occurred if EC(P2) is, by chance, equal to EC(P1). The probability that two random N-bit
codes will match is 1/2N (though a small random change in P1 might not lead to a uniformly distributed
random change in EC(P1); see the tail end of the CRC section below).
102
5 Packets
This does not mean, however, that one packet in 2N will be received incorrectly, as most packets are errorfree. If we use a 16-bit error code, and only 1 packet in 100,000 is actually corrupted, then the rate at which
corrupted packets will sneak by is only 1 in 100,000 65536, or about one in 6 109 . If packets are 1500
bytes, you have a good chance (90+%) of accurately transferring a terabyte, and a 37% chance (1/e) at ten
terabytes.
103
P2 guarantees that CRC(P1) = CRC(P2). For the Internet checksum, this is not guaranteed even if we know
only two bits were changed.
Finally, there are also secure hashes, such as MD-5 and SHA-1 and their successors. Nobody knows
(or admits to knowing) how to produce two messages with same hash here. However, these secure-hash
codes are generally not used in network error-correction as they take considerable time to compute; they are
generally used only for secure authentication and other higher-level functions.
Now suppose one bit is corrupted; for simplicity, assume it is one of the data bits. Then exactly one columnparity bit will be incorrect, and exactly one row-parity bit will be incorrect. These two incorrect bits mark
the column and row of the incorrect data bit, which we can then flip to the correct state.
We can make N large, but an essential requirement here is that there be only a single corrupted bit per square.
We are thus likely either to keep N small, or to choose a different code entirely that allows correction of
multiple bits. Either way, the addition of error-correcting codes can easily increase the size of a packet
significantly; some codes double or even triple the total number of bits sent.
The Hamming code is another popular error-correction code; it adds O(log N) additional bits, though if N is
large enough for this to be a material improvement over the O(N1/2 ) performance of 2-D parity then errors
must be very infrequent. If we have 8 data bits, let us number the bit positions 0 through 7. We then write
each bits position as a binary value between 000 and 111; we will call these the position bits of the given
data bit. We now add four code bits as follows:
1. a parity bit over all 8 data bits
104
5 Packets
2. a parity bit over those data bits for which the first digit of the position bits is 1 (these are positions 4,
5, 6 and 7)
3. a parity bit over those data bits for which the second digit of the position bits is 1 (these are positions
010, 011, 110 and 111, or 2, 3, 6 and 7)
4. a parity bit over those data bits for which the third digit of the position bits is 1 (these are positions
001, 011, 101, 111, or 1, 3, 5 and 7)
We can tell whether or not an error has occurred by the first code bit; the remaining three code bits then tell
us the respective three position bits of the incorrect bit. For example, if the #2 code bit above is correct, then
the first digit of the position bits is 0; otherwise it is one. With all three position bits, we have identified the
incorrect data bit.
As a concrete example, suppose the data word is 10110010. The four code bits are thus
1. 0, the (even) parity bit over all eight bits
2. 1, the parity bit over the second half, 10110010
3. 1, the parity bit over the bold bits: 10110010
4. 1, the parity bit over these bold bits: 10110010
If the received data+code is now 10111010 0111, with the bold bit flipped, then the fact that the first code
bit is wrong tells the receiver there was an error. The second code bit is also wrong, so the first bit of the
position bits must be 1. The third code bit is right, so the second bit of the position bits must be 0. The
fourth code bit is also right, so the third bit of the position bits is 0. The position bits are thus binary 100, or
4, and so the receiver knows that the incorrect bit is in position 4 (counting from 0) and can be flipped to the
correct state.
5.5 Epilog
The issues presented here are perhaps not very glamorous, and often play a supporting, behind-the-scenes
role in protocol design. Nonetheless, their influence is pervasive; we may even think of them as part of the
underlying physics of the Internet.
As the early Internet became faster, for example, and propagation delay became the dominant limiting
factor, protocols were often revised to limit the number of back-and-forth exchanges. A classic example is
the Simple Mail Transport Protocol (SMTP), amended by RFC 1854 to allow multiple commands to be sent
together called pipelining instead of individually.
While there have been periodic calls for large-packet support in IPv4, and IPv6 protocols exist for jumbograms in excess of a megabyte, these are very seldom used, due to the store-and-forward costs of large
packets as described in 5.3 Packet Size.
Almost every LAN-level protocol, from Ethernet to Wi-Fi to point-to-point links, incorporates an errordetecting code chosen to reflect the underlying transportation reliability. Ethernet includes a 32-bit CRC
code, for example, while Wi-Fi includes extensive error-correcting codes due to the noisier wireless environment. The Wi-Fi fragmentation option ([Link] Wi-Fi Fragmentation) is directly tied to 5.3.1 Error
Rates and Packet Size.
5.5 Epilog
105
5.6 Exercises
1. Suppose a link has a propagation delay of 20 sec and a bandwidth of 2 bytes/sec.
(a). How long would it take to transmit a 600-byte packet over such a link?
(b). How long would it take to transmit the 600-byte packet over two such links, with a store-and-forward
switch in between?
(a). How long would it take to send a single 600-byte packet from A to B?
(b). How long would it take to send two back-to-back 300-byte packets from A to B?
(c). How long would it take to send three back-to-back 200-byte packets from A to B?
3. Repeat parts (a) and (b) of the previous exercise, except change the per-link propagation delay from 60
sec to 600 sec.
4. Again suppose the path from A to B has a single switch S in between: A
width and propagation delays are as follows:
link
A S
S B
bandwidth
5 bytes/sec
3 bytes/sec
propagation delay
24 sec
13 sec
(a). How long would it take to send a single 600-byte packet from A to B?
(b). How long would it take to send two back-to-back 300-byte packets from A to B? Note that, because
the S B link is slower, packet 2 arrives at S from A well before S has finished transmitting packet 1 to B.
5. Suppose in the previous exercise, the AS link has the smaller bandwidth of 3 bytes/sec and the SB
link has the larger bandwidth of 5 bytes/sec. Now how long does it take to send two back-to-back 300-byte
packets from A to B?
6. Suppose we have five links, A R1
R2 R3
R4
bytes/ms. Assume we model the per-link propagation delay as 0.
5 Packets
7. Suppose there are N equal-bandwidth links on the path between A and B, as in the diagram below, and
we wish to send M consecutive packets.
S1
...
SN-1
Let BD be the bandwidth delay of a single packet on a single link, and let PD be the propagation delay on a
single link. Show that the total bandwidth delay is (M+N-1)BD, and the total propagation delay is NPD.
Hint: When does the Mth packet leave A? What is its total transit time to B? Why do no packets have to wait
at any Si for the completion of the transmission of an earlier packet?
8. Repeat the analysis in 5.3.1 Error Rates and Packet Size to compare the probable total number of bytes
that need to be sent to transmit 107 bytes using
Assume the bit error rate is 1 in 16 105 , making the error rate per byte about 1 in 2 105 .
9. In the text it is claimed there is no N-bit error code that catches all N-bit errors for N2 (for N=1, a
parity bit works). Prove this claim for N=2. Hint: pick a length M, and consider all M-bit messages with a
single 1-bit. Any such message can be converted to any other with a 2-bit error. Show, using the Pigeonhole
Principle, that for large enough M two messages m1 and m2 must have the same error code, that is, e(m1 ) =
e(m2 ). If this occurs, then the error code fails to detect the error that converted m1 into m2 .
10. In the description in the text of the Internet checksum, the overflow bit was added back in after each
ones-complement addition. Show that the same final result will be obtained if we add up the 16-bit words
using 32-bit twos-complement arithmetic (the normal arithmetic on all modern hardware), and then add the
upper 16 bits of the sum to the lower 16 bits. (If there is an overflow at this last step, we have to add that
back in as well.)
11. Suppose a message is 110010101. Calculate the CRC-3 checksum using the polynomial X3 + 1, that is,
find the 3-bit remainder using divisor 1001.
12. The CRC algorithm presented above requires that we process one bit at a time. It is possible to do the
algorithm N bits at a time (eg N=8), with a precomputed lookup table of size 2N . Complete the steps in the
following description of this strategy for N=3 and polynomial X3 + X + 1, or 1011.
13. Consider the following set of bits sent with 2-D even parity; the data bits are in the 44 upper-left block
and the parity bits are in the rightmost column and bottom row. Which bit is corrupted?
1
5.6 Exercises
107
14. (a) Show that 2-D parity can detect any three errors.
(b). Find four errors that cannot be detected by 2-D parity.
(c). Show that that 2-D parity cannot correct all two-bit errors. Hint: put both bits in the same row or
column.
15. Each of the following 8-bit messages with 4-bit Hamming code contains a single error. Correct the
message.
16. (a) What happens in 2-D parity if the corrupted bit is in the parity column or parity row?
(b). In the following 8-bit message with 4-bit Hamming code, there is an error in the code portion. How
can this be determined?
1001 1110 0100
108
5 Packets
In this chapter we take a general look at how to build reliable data-transport layers on top of unreliable
lower layers. This is achieved through a retransmit-on-timeout policy; that is, if a packet is transmitted
and there is no acknowledgment received during the timeout interval then the packet is resent. As a class,
protocols where one side implements retransmit-on-timeout are known as ARQ protocols, for Automatic
Repeat reQuest.
In addition to reliability, we also want to keep as many packets in transit as the network can support. The
strategy used to achieve this is known as sliding windows. It turns out that the sliding-windows algorithm
is also the key to managing congestion; we return to this in 13 TCP Reno and Congestion Management.
The End-to-End principle, 12.1 The End-to-End Principle, suggests that trying to achieve a reliable transport layer by building reliability into a lower layer is a misguided approach; that is, implementing reliability
at the endpoints of a connection as is described here is in fact the correct mechanism.
109
110
The right half of the diagram, by comparison, illustrates the case of a lost ACK. The receiver has received
a duplicate Data[N]. We have assumed here that the receiver has implemented a retransmit-on-duplicate
strategy, and so its response upon receipt of the duplicate Data[N] is to retransmit ACK[N].
As a final example, note that it is possible for ACK[N] to have been delayed (or, similarly, for the first
Data[N] to have been delayed) longer than the timeout interval. Not every packet that times out is actually
lost!
In this case we see that, after sending Data[N], receiving a delayed ACK[N] (rather than the expected
ACK[N+1]) must be considered a normal event.
In principle, either side can implement retransmit-on-timeout if nothing is received. Either side can also
implement retransmit-on-duplicate; this was done by the receiver in the second example above but not by
the sender in the third example (the sender received a second ACK[N] but did not retransmit Data[N+1]).
111
At least one side must implement retransmit-on-timeout; otherwise a lost packet leads to deadlock as the
sender and the receiver both wait forever. The other side must implement at least one of retransmit-onduplicate or retransmit-on-timeout; usually the former alone. If both sides implement retransmit-on-timeout
with different timeout values, generally the protocol will still work.
Sorcerers Apprentice
The Sorcerers Apprentice bug is named for the legend in which the apprentice casts a spell
on a broom to carry water, one bucket at a time. When the basin is full, the apprentice
chops the broom in half, only to find both halves carrying water. See Disneys Fantasia,
[Link] at around T = 5:35.
A strange thing happens if one side implements retransmit-on-timeout but both sides implement retransmiton-duplicate, as can happen if the implementer takes the naive view that retransmitting on duplicates is
safer; the moral here is that too much redundancy can be the Wrong Thing. Let us imagine that an
implementation uses this strategy, and that the initial ACK[3] is delayed until after Data[3] is retransmitted
on timeout. In the following diagram, the only packet retransmitted due to timeout is the second Data[3]; all
the other duplications are due to the retransmit-on-duplicate strategy.
112
All packets are sent twice from Data[3] on. The transfer completes normally, but takes double the normal
bandwidth.
113
Window Size
In this chapter we will assume winsize does not change. TCP, however, varies winsize up and down
with the goal of making it as large as possible without introducing congestion; we will return to this in
13 TCP Reno and Congestion Management.
At any instant, the sender may send packets numbered last_ACKed + 1 through last_ACKed + winsize; this
packet range is known as the window. Generally, if the first link in the path is not the slowest one, the sender
will most of the time have sent all these.
If ACK[N] arrives with N > last_ACKed (typically N = last_ACKed+1), then the window slides forward; we
set last_ACKed = N. This also increments the upper edge of the window, and frees the sender to send more
packets. For example, with winsize = 4 and last_ACKed = 10, the window is [11,12,13,14]. If ACK[11]
arrives, the window slides forward to [12,13,14,15], freeing the sender to send Data[15]. If instead ACK[13]
arrives, then the window slides forward to [14,15,16,17] (recall that ACKs are cumulative), and three more
packets become eligible to be sent. If there is no packet reordering and no packet losses (and every packet
is ACKed individually) then the window will slide forward in units of one packet at a time; the next arriving
ACK will always be ACK[last_ACKed+1].
Note that the rate at which ACKs are returned will always be exactly equal to the rate at which the slowest
link is delivering packets. That is, if the slowest link (the bottleneck link) is delivering a packet every
50 ms, then the receiver will receive those packets every 50 ms and the ACKs will return at a rate of
one every 50 ms. Thus, new packets will be sent at an average rate exactly matching the delivery rate;
this is the sliding-windows self-clocking property. Self-clocking has the effect of reducing congestion by
automatically reducing the senders rate whenever the available fraction of the bottleneck bandwidth is
reduced.
Here is a video of sliding windows in action, with winsize = 5. The second link, RB, has a capacity of five
packets in transit either way; the AR link has a capacity of one packet in transit either way.
The packets in the very first RTT represent connection setup. This particular video also demonstrates TCP
Slow Start: in the first data-packet RTT, two packets are sent, and in the second data RTT four packets
are sent. The full window size of five is reached in the third data RTT. For the rest of the connection, at
any moment (except those instants where packets have just been received) there are five packets in flight,
either being transmitted on a link as data, or being transmitted as an ACK, or sitting in a queue (this last
does not happen in the video).
will become clearer below, a winsize smaller than this means underutilization of the network, while a larger
winsize means each packet spends time waiting in a queue somewhere.
Below are simplified diagrams for sliding windows with window sizes of 1, 4 and 6, each with a path
bandwidth of 6 packets/RTT (so bandwidth RTT = 6 packets). The diagram shows the initial packets sent
as a burst; these then would be spread out as they pass through the bottleneck link so that, after the first
burst, packet spacing is uniform. (Real sliding-windows protocols such as TCP generally attempt to avoid
such initial bursts.)
With winsize=1 we send 1 packet per RTT; with winsize=4 we always average 4 packets per RTT. To put
this another way, the three window sizes lead to bottle-neck link utilizations of 1/6, 4/6 and 6/6 = 100%,
respectively.
While it is tempting to envision setting winsize to bandwidth RTT, in practice this can be complicated;
neither bandwidth nor RTT is constant. Available bandwidth can fluctuate in the presence of competing
traffic. As for RTT, if a sender sets winsize too large then the RTT is simply inflated to the point that
bandwidth RTT matches winsize; that is, a connections own traffic can inflate RTTactual to well above
RTTnoLoad . This happens even in the absence of competing traffic.
6.2 Sliding Windows
115
116
The slow links are R2R3 and R3R4. We will refer to the slowest link as the bottleneck link; if there are
(as here) ties for the slowest link, then the first such link is the bottleneck. The bottleneck link is where the
queue will form. If traffic is sent at a rate of 4 packets/ms from A to B, it will pile up in an ever-increasing
queue at R2. Traffic will not pile up at R3; it arrives at R3 at the same rate by which it departs.
Furthermore, if sliding windows is used (rather than a fixed-rate sender), traffic will eventually not queue up
at any router other than R2: data cannot reach B faster than the 3 packets/ms rate, and so B will not return
ACKs faster than this rate, and so A will eventually not send data faster than this rate. At this 3 packets/ms
rate, traffic will not pile up at R1 (or R3 or R4).
There is a significant advantage in speaking in terms of winsize rather than transmission rate. If A sends to
B at any rate greater than 3 packets/ms, then the situation is unstable as the bottleneck queue grows without
bound and there is no convergence to a steady state. There is no analogous instability, however, if A uses
sliding windows, even if the winsize chosen is quite large (although a large-enough winsize will overflow
the bottleneck queue). If a sender specifies a sending window size rather than a rate, then the network will
converge to a steady state in relatively short order; if a queue develops it will be steadily replenished at the
same rate that packets depart, and so will be of fixed size.
We will assume that in the backward BA direction, all connections are infinitely fast, meaning zero
delay; this is often a good approximation because ACK packets are what travel in that direction and they
are negligibly small. In the AB direction, we will assume that the AR1 link is infinitely fast, but
the other four each have a bandwidth of 1 packet/second (and no propagation-delay component). This
makes the R1R2 link the bottleneck link; any queue will now form at R1. The path bandwidth is 1
packet/second, and the RTT is 4 seconds.
As a roughly equivalent alternative example, we might use the following:
117
with the following assumptions: the CS1 link is infinitely fast (zero delay), S1S2 and S2D each
take 1.0 sec bandwidth delay (so two packets take 2.0 sec, per link, etc), and ACKs also have a 1.0 sec
bandwidth delay in the reverse direction.
In both scenarios, if we send one packet, it takes 4.0 seconds for the ACK to return, on an idle network. This
means that the no-load delay, RTTnoLoad , is 4.0 seconds.
(These models will change significantly if we replace the 1 packet/sec bandwidth delay with a 1-second
propagation delay; in the former case, 2 packets take 2 seconds, while in the latter, 2 packets take 1 second.
See exercise 4.)
We assume a single connection is made; ie there is no competition. Bandwidth delay here is 4 packets (1
packet/sec 4 sec RTT)
[Link] Case 1: winsize = 2
In this case winsize < bandwidthdelay (where delay = RTT). The table below shows what is sent by A and
each of R1-R4 for each second. Every packet is acknowledged 4 seconds after it is sent; that is, RTTactual
= 4 sec, equal to RTTnoLoad ; this will remain true as the winsize changes by small amounts (eg to 1 or 3).
Throughput is proportional to winsize: when winsize = 2, throughput is 2 packets in 4 seconds, or 2/4 =
1/2 packet/sec. During each second, two of the routers R1-R4 are idle. The overall path will have less than
100% utilization.
Time
T
0
1
2
3
4
5
6
7
8
9
A
sends
1,2
3
4
5
6
R1
queues
2
R1
sends
1
2
3
4
5
6
R2
sends
1
2
3
4
R3
sends
1
2
3
4
R4
sends
1
2
3
4
B
ACKs
1
2
3
4
Note the brief pile-up at R1 (the bottleneck link!) on startup. However, in the steady state, there is no
queuing. Real sliding-windows protocols generally have some way of minimizing this initial pileup.
[Link] Case 2: winsize = 4
When winsize=4, at each second all four slow links are busy. There is again an initial burst leading to a brief
surge in the queue; RTTactual for Data[4] is 7 seconds. However, RTTactual for every subsequent packet is 4
seconds, and there are no queuing delays (and nothing in the queue) after T=2. The steady-state connection
118
throughput is 4 packets in 4 seconds, ie 1 packet/second. Note that overall path throughput now equals the
bottleneck-link bandwidth, so this is the best possible throughput.
T
0
1
2
3
4
5
6
7
8
A sends
1,2,3,4
R1 queues
2,3,4
3,4
4
5
6
7
8
9
R1 sends
1
2
3
4
5
6
7
8
9
R2 sends
R3 sends
R4 sends
B ACKs
1
2
3
4
5
6
7
8
1
2
3
4
5
6
7
1
2
3
4
5
6
1
2
3
4
5
At T=4, R1 has just finished sending Data[4] as Data[5] arrives from A; R1 can begin sending packet 5
immediately. No queue will develop.
Case 2 is the congestion knee of Chiu and Jain [CJ89], defined here in 1.7 Congestion.
[Link] Case 3: winsize = 6
T
0
1
2
3
4
5
6
7
8
9
10
A sends
1,2,3,4,5,6
7
8
9
10
11
12
13
R1 queues
2,3,4,5,6
3,4,5,6
4,5,6
5,6
6,7
7,8
8,9
9,10
10,11
11,12
12,13
R1 sends
1
2
3
4
5
6
7
8
9
10
11
R2 sends
R3 sends
R4 sends
B ACKs
1
2
3
4
5
6
7
8
9
10
1
2
3
4
5
6
7
8
9
1
2
3
4
5
6
7
8
1
2
3
4
5
6
7
Note that packet 7 is sent at T=4 and the acknowledgment is received at T=10, for an RTT of 6.0 seconds. All
later packets have the same RTTactual . That is, the RTT has risen from RTTnoLoad = 4 seconds to 6 seconds.
Note that we continue to send one windowful each RTT; that is, the throughput is still winsize/RTT, but RTT
is now 6 seconds.
One might initially conjecture that if winsize is greater than the bandwidthRTTnoLoad product, then the
entire window cannot be in transit at one time. In fact this is not the case; the sender does usually have the
entire window sent and in transit, but RTT has been inflated so it appears to the sender that winsize equals
the bandwidthRTT product.
In general, whenever winsize > bandwidthRTTnoLoad , what happens is that the extra packets pile up at
a router somewhere along the path (specifically, at the router in front of the bottleneck link). RTTactual is
inflated by queuing delay to winsize/bandwidth, where bandwidth is that of the bottleneck link; this means
winsize = bandwidthRTTactual . Total throughput is equal to that bandwidth. Of the 6 seconds of RTTactual
in the example here, a packet spends 4 of those seconds being transmitted on one link or another because
119
RTTnoLoad =4. The other two seconds, therefore, must be spent in a queue; there is no other place for packets
wait. Looking at the table, we see that each second there are indeed two packets in the queue at R1.
If the bottleneck link is the very first link, packets may begin returning before the sender has sent the entire
windowful. In this case we may argue that the full windowful has at least been queued by the sender, and
thus has in this sense been sent. Suppose the network, for example, is
where, as before, each link transports 1 packet/sec from A to B and is infinitely fast in the reverse direction.
Then, if A sets winsize = 6, a queue of 2 packets will form at A.
120
no competing traffic and winsize is below the congestion knee winsize < bandwidth RTTnoLoad then
winsize is the limiting factor in throughput. Finally, if there is no competition and winsize bandwidth
RTTnoLoad then the connection is using 100% of the capacity of the bottleneck link and throughput is equal
to the bottleneck-link physical bandwidth. To put this another way,
4. RTTactual = winsize/bottleneck_bandwidth
queue_usage = winsize bandwidth RTTnoLoad
Dividing the first equation by RTTnoLoad , and noting that bandwidth RTTnoLoad = winsize - queue_usage
= transit_capacity, we get
5. RTTactual /RTTnoLoad = winsize/transit_capacity = (transit_capacity + queue_usage) / transit_capacity
Regardless of the value of winsize, in the steady state the sender never sends faster than the bottleneck
bandwidth. This is because the bottleneck bandwidth determines the rate of packets arriving at the far
end, which in turn determines the rate of ACKs arriving back at the sender, which in turn determines the
continued sending rate. This illustrates the self-clocking nature of sliding windows.
We will return in 14 Dynamics of TCP Reno to the issue of bandwidth in the presence of competing traffic.
For now, suppose a sliding-windows sender has winsize > bandwidth RTTnoLoad , leading as above to a
fixed amount of queue usage, and no competition. Then another connection starts up and competes for the
bottleneck link. The first connections effective bandwidth will thus decrease. This means that bandwidth
RTTnoLoad will decrease, and hence the connections queue usage will increase.
The critical winsize value is equal to bandwidth RTTnoLoad ; this is known as the congestion knee. For
winsize below this, we have:
throughput is proportional to winsize
121
delay is constant
queue utilization in the steady state is zero
For winsize larger than the knee, we have
throughput is constant (equal to the bottleneck bandwidth)
delay increases linearly with winsize
queue utilization increases linearly with winsize
Ideally, winsize will be at the critical knee. However, the exact value varies with time: available bandwidth
changes due to the starting and stopping of competing traffic, and RTT changes due to queuing. Standard
TCP makes an effort to stay well above the knee much of the time, presumably on the theory that maximizing
throughput is more important than minimizing queue use.
The power of a connection is defined to be throughput/RTT. For sliding windows below the knee, RTT is
constant and power is proportional to the window size. For sliding windows above the knee, throughput is
constant and delay is proportional to winsize; power is thus proportional to 1/winsize. Here is a graph, akin
to those above, of winsize versus power:
122
}
send ACK[LA]
There are a couple details omitted.
A possible implementation of EA is as an array of packet objects, of size W. We always put packet Data[K]
into position K % W.
At any point between packet arrivals, Data[LA+1] is not in EA, but some later packets may be present.
For the sender side, we begin by sending a full windowful of packets Data[1] through Data[W], and setting
LA=0. When ACK[M] arrives:
if MLA or M>LA+W, ignore the packet
otherwise:
set K = LA+1
set LA = M, the new bottom edge of the window
for (i=K; iLA; i++) send Data[i]
Note that new ACKs may arrive while we are in the loop at that last line. We assume here that the sender
stolidly sends what it may send and only after that does it start to process arriving ACKs. Some implementations may take a more asynchronous approach, perhaps with one thread processing arriving ACKs and
incrementing LA and another thread sending everything it is allowed to send.
6.4 Epilog
This completes our discussion of the sliding-windows algorithm in the abstract setting. We will return to
concrete implementations of this in 11.3.2 TFTP Stop-and-Wait (stop-and-wait) and in 12.13 TCP Sliding
Windows; the latter is one of the most important mechanisms on the Internet.
6.5 Exercises
1. Sketch a ladder diagram for stop-and-wait if Data[3] is lost the first time it is sent. Continue the diagram
to the point where Data[4] is successfully transmitted. Assume an RTT of 1 second, no sender timeout (but
the sender retransmits on duplicate), and a receiver timeout of 2 seconds.
2. Suppose a stop-and-wait receiver has an implementation flaw. When Data[1] arrives, ACK[1] and ACK[2]
are sent, separated by a brief interval; after that, the receiver transmits ACK[N+1] when Data[N] arrives,
rather than the correct ACK[N].
(a). Assume the sender responds to each ACK as it arrives. What is the first ACK[N] that it will be able to
determine is incorrect? Assume no packets are lost.
(b). Is there anything the transmitter can do to detect this receiver problem earlier?
3.
A
6.4 Epilog
123
packet/sec bandwidth delay for the R1R2, R2R3, R3R4 and R4B links. The AR link and
all reverse links (from B to A) are infinitely fast. Carry out the table for 10 seconds.
B. The A
4. Create a table as in 6.3.1 Simple fixed-window-size analysis for a network A R1 R2
R1 ink is infinitely fast; the R1R2 and R2B each have a 1-second propagation delay, in each direction,
and zero bandwidth delay (that is, one packet takes 1.0 sec to travel from R1 to R2; two packets also take
1.0 sec to travel from R1 to R2). Assume winsize=6. Carry out the table for 8 seconds. Note that with zero
bandwidth delay, multiple packets sent together will remain together until the destination.
5. Suppose RTTnoLoad = 4 seconds and the bottleneck bandwidth is 1 packet every 2 seconds.
(a). What window size is needed to remain just at the knee of congestion?
(b). Suppose winsize=6. How many packets are in the queue, at the steady state, and what is RTTactual ?
R2 sends
1
1
2
2
3
3
...
7. Argue that, if A sends to B using sliding windows, and in the path from A to B the slowest link is not the
first link out of A, then eventually A will have the entire window outstanding (except at the instant just after
each new ACK comes in).
8. Suppose RTTnoLoad is 50 ms and the available bandwidth is 2,000 packets/sec. Sliding windows is used
for transmission.
(a). What window size is needed to remain just at the knee of congestion?
(b). If RTTactual rises to 60 ms (due to use of a larger winsize), how many packets are in a queue at any one
time?
(c). What value of winsize would lead to RTTactual = 60 ms?
(d). What value of winsize would make RTTactual rise to 100 ms?
124
9. Suppose winsize=4 in a sliding-windows connection, and assume that while packets may be lost, they
are never reordered (that is, if two packets P1 and P2 are sent in that order, and both arrive, then they arrive
in that order). Show that if Data[8] is in the receivers window (meaning that everything up through Data[4]
has been received and acknowledged), then it is no longer possible for even a late Data[0] to arrive at the
receiver. (A consequence of the general principle here is that in the absence of reordering we can replace
the packet sequence number with (sequence_number) mod (2winsize+1) without ambiguity.)
10. Suppose winsize=4 in a sliding-windows connection, and assume as in the previous exercise that while
packets may be lost, they are never reordered. Give an example in which Data[8] is in the receivers window
(so the receiver has presumably sent ACK[4]), and yet Data[1] legitimately arrives. (Thus, the late-packet
bound in the previous exercise is the best possible.)
11. Suppose the network is A R1
R2 B, where the AR1 ink is infinitely fast and the R1R2
link has a bandwidth of 1 packet/second each way, for an RTTnoLoad of 2 seconds. Suppose also that A
begins sending with winsize = 6. By the analysis in [Link] Case 3: winsize = 6, RTT should rise to
winsize/bandwidth = 6 seconds. Give the RTTs of the first eight packets. How long does it take for RTT to
rise to 6 seconds?
6.5 Exercises
125
126
7 IP VERSION 4
There are multiple LAN protocols below the IP layer and multiple transport protocols above, but IP itself
stands alone. The Internet is essentially the IP Internet. If you want to run your own LAN somewhere or if
you want to run your own transport protocol, the Internet backbone will still work for you. But if you want
to change the IP layer, you may encounter difficulty. (Just talk to the IPv6 people, or the IP-multicasting or
IP-reservations groups.)
IP is, in effect, a universal routing and addressing protocol. The two are developed together; every node
has an IP address and every router knows how to handle IP addresses. IP was originally seen as a way to
interconnect multiple LANs, but it may make more sense now to view IP as a virtual LAN overlaying all
the physical LANs.
A crucial aspect of IP is its scalability. As the Internet has grown to ~109 hosts, the forwarding tables are
not much larger than 105 (perhaps now 105.5 ). Ethernet, in comparison, scales poorly.
Furthermore, IP, unlike Ethernet, offers excellent support for multiple redundant links. If the network
below were an IP network, each node would communicate with each immediate neighbor via their shared
direct link. If, on the other hand, this were an Ethernet network with the spanning-tree algorithm, then one
of the four links would simply be disabled completely.
The IP network service model is to act like a LAN. That is, there are no acknowledgments; delivery is
generally described as best-effort. This design choice is perhaps surprising, but it has also been quite fruitful.
Currently the Internet uses (almost exclusively) IP version 4, with its 32-bit address size. As the Internet has
run out of new large blocks of IPv4 addresses (1.10 IP - Internet Protocol), there is increasing pressure to
convert to IPv6, with its 128-bit address size. Progress has been slow, however, and delaying tactics such
as IPv4-address markets and NAT have allowed IPv4 to continue. Aside from the major change in address
structure, there are relatively few differences in the routing models of IPv4 and IPv6. We will study IPv4 in
this chapter and IPv6 in the following.
If you want to provide a universal service for delivering any packet anywhere, what else do you need besides
routing and addressing? Every network (LAN) needs to be able to carry any packet. The protocols spell
out the use of octets (bytes), so the only possible compatibility issue is that a packet is too large for a given
network. IP handles this by supporting fragmentation: a network may break a too-large packet up into
units it can transport successfully. While IP fragmentation is inefficient and clumsy, it does guarantee that
any packet can potentially be delivered to any node.
127
The IP header, and basics of IP protocol operation, were defined in RFC 791; some minor changes have
since occurred. Most of these changes were documented in RFC 1122, though the DS field was defined in
RFC 2474 and the ECN bits were first proposed in RFC 2481.
The Version field is, for IPv4, the number 4: 0100. The IHL field represents the total IP Header Length, in
32-bit words; an IP header can thus be at most 15 words long. The base header takes up five words, so the
IP Options can consist of at most ten words. If one looks at IPv4 packets using a packet-capture tool that
displays the packets in hex, the first byte will most often be 0x45.
The Differentiated Services (DS) field is used by the Differentiated Services suite to specify preferential handling for designated packets, eg those involved in VoIP or other real-time protocols. The Explicit
Congestion Notification bits are there to allow routers experiencing congestion to mark packets, thus indicating to the sender that the transmission rate should be reduced. We will address these in 14.8.2 Explicit
Congestion Notification (ECN). These two fields together replace the old 8-bit Type of Service field.
The Total Length field is present because an IP packet may be smaller than the minimum LAN packet
size (see Exercise 1) or larger than the maximum (if the IP packet has been fragmented over several LAN
packets. The IP packet length, in other words, cannot be inferred from the LAN-level packet size. Because
the Total Length field is 16 bits, the maximum IP packet size is 216 bytes. This is probably much too large,
even if fragmentation were not something to be avoided.
The second word of the header is devoted to fragmentation, discussed below.
The Time-to-Live (TTL) field is decremented by 1 at each router; if it reaches 0, the packet is discarded. A
typical initial value is 64; it must be larger than the total number of hops in the path. In most cases, a value of
32 would work. The TTL field is there to prevent routing loops always a serious problem should they occur
from consuming resources indefinitely. Later we will look at various IP routing-table update protocols and
how they minimize the risk of routing loops; they do not, however, eliminate it. By comparison, Ethernet
headers have no TTL field, but Ethernet also disallows cycles in the underlying topology.
128
7 IP version 4
The Protocol field contains a value to indicate if the body of the IP packet represents a TCP packet or a
UDP packet, or, in unusual cases, something else altogether.
The Header Checksum field is the Internet checksum applied to the header only, not the body. Its only
purpose is to allow the discarding of packets with corrupted headers. When the TTL value is decremented
the router must update the header checksum. This can be done algebraically by adding a 1 in the correct
place to compensate, but it is not hard simply to re-sum the 8 halfwords of the average header.
The Source and Destination Address fields contain, of course, the IP addresses. These would be updated
only by NAT firewalls.
One option is the Record Route option, in which routers are to insert their own IP address into the IP
header option area. Unfortunately, with only ten words available, there is not enough space to record most
longer routes (but see 7.9.1 Traceroute and Time Exceeded, below). Another option, now deprecated as
security risk, is to support source routing. The sender would insert into the IP header option area a list of IP
addresses; the packet would be routed to pass through each of those IP addresses in turn. With strict source
routing, the IP addresses had to represent adjacent neighbors; no router could be used if its IP address were
not on the list. With loose source routing, the listed addresses did not have to represent adjacent neighbors
and ordinary IP routing was used to get from one listed IP address to the next. Both forms are essentially
never used, again for security reasons: if a packet has been source-routed, it may have been routed outside
of the at-least-somewhat trusted zone of the Internet backbone.
7.2 Interfaces
IP addresses (IPv4 and IPv6) are, strictly speaking, assigned not to hosts or nodes, but to interfaces. In the
most common case, where each node has a single LAN interface, this is a distinction without a difference. In
a room full of workstations each with a single Ethernet interface eth0 (or perhaps Ethernet adapter
Local Area Connection), we might as well view the IP address assigned to the interface as assigned
to the workstation itself.
Each of those workstations, however, likely also has a loopback interface (at least conceptually), providing
a way to deliver IP packets to other processes on the same machine. On many systems, the name localhost
resolves to the IPv4 address [Link] (although the IPv6 address ::1 is also used). Delivering packets to
the localhost address is simply a form of interprocess communication; a functionally similar alternative is
named pipes. Loopback delivery avoids the need to use the LAN at all, or even the need to have a LAN.
For simple client/server testing, it is often convenient to have both client and server on the same machine,
in which case the loopback interface is convenient and fast. On unix-based machines the loopback interface
represents a genuine logical interface, commonly named lo. On Windows systems the interface may not
represent an actual entity, but this is of practical concern only to those interested in sniffing all loopback
traffic; packets sent to the loopback address are still delivered as expected.
Workstations often have special other interfaces as well. Most recent versions of Microsoft Windows have
a Teredo Tunneling pseudo-interface and an Automatic Tunneling pseudo-interface; these are both intended
(when activated) to support IPv6 connectivity when the local ISP supports only IPv4. The Teredo protocol
is documented in RFC 4380.
When VPN connections are created, as in 3.1 Virtual Private Network, each end of the logical connection
typically terminates at a virtual interface (one of these is labeled tun0 in the diagram of 3.1 Virtual Private
7.2 Interfaces
129
Network). These virtual interfaces appear, to the systems involved, to be attached to a point-to-point link
that leads to the other end.
When a computer hosts a virtual machine, there is almost always a virtual network to connect the host and
virtual systems. The host will have a virtual interface to connect to the virtual network. The host may act as
a NAT router for the virtual machine, hiding that virtual machine behind its own IP address, or it may act
as an Ethernet switch, in which case the virtual machine will need an additional public IP address.
Whats My IP Address?
This simple-seeming question is in fact not very easy to answer, if by my IP address one means the IP
address assigned to the interface that connects directly to the Internet. One strategy is to find the address
of the default router, and then iterate through all interfaces (eg with the Java NetworkInterface
class) to find an IP address with a matching network prefix. Unfortunately, finding the default router
is hard to do in an OS-independent way, and even then this approach can fail if the Wi-Fi and Ethernet
interfaces both are assigned IP addresses on the same network, but only one is actually connected.
Many workstations have both an Ethernet interface and a Wi-Fi interface. Both of these can be used simultaneously (with different IP addresses assigned to each), either on the same IP network or on different IP
networks.
Routers always have at least two interfaces on two separate LANs. Generally this means a separate IP
address for each interface, though some point-to-point interfaces can be used without being assigned an IP
address (7.10 Unnumbered Interfaces).
Finally, it is usually possible to assign multiple IP addresses to a single interface. Sometimes this is done
to allow two IP networks to share a single LAN; the interface might be assigned one IP address for each
IP network. Other times a single interface is assigned multiple IP addresses that are on the same LAN; this
is often done so that one physical machine can act as a server (eg a web server) for multiple distinct IP
addresses corresponding to multiple distinct domain names.
While it is important to be at least vaguely aware of all these special cases, we emphasize again that in most
ordinary contexts each end-user workstation has one IP address that corresponds to a LAN connection.
130
7 IP version 4
[Link]/12
[Link]/16
The last block is the one from which addresses are most commonly allocated by DHCP servers
(7.8.1 DHCP and the Small Office) built into NAT routers.
Broadcast addresses are a special form of IP address intended to be used in conjunction with LAN-layer
broadcast. The most common forms are broadcast to this network, consisting of all 1-bits, and broadcast
to network D, consisting of Ds network-address bits followed by all 1-bits for the host bits. If you try to
send a packet to the broadcast address of a remote network D, the odds are that some router involved will
refuse to forward it, and the odds are even higher that, once the packet arrives at a router actually on network
D, that router will refuse to broadcast it. Even addressing a broadcast to ones own network will fail if the
underlying LAN does not support LAN-level broadcast (eg ATM).
The highly influential early Unix implementation Berkeley 4.2 BSD used 0-bits for the broadcast bits, instead of 1s. As a result, to this day host bits cannot be all 1-bits or all 0-bits in order to avoid confusion with
the IP broadcast address. One consequence of this is that a Class C network has 254 usable host addresses,
not 256.
7.4 Fragmentation
If you are trying to interconnect two LANs (as IP does), what else might be needed besides Routing and
Addressing? IP explicitly assumes all packets are composed on 8-bit bytes (something not universally true
in the early days of IP; to this day the RFCs refer to octets to emphasize this requirement). IP also defines
bit-order within a byte, and it is left to the networking hardware to translate properly. Neither byte size nor
bit order, therefore, can interfere with packet forwarding.
There is one more feature IP must provide, however, if the goal is universal connectivity: it must accommodate networks for which the maximum packet size, or Maximum Transfer Unit, MTU, is smaller than
the packet that needs forwarding. Otherwise, if we were using IP to join Token Ring (MTU = 4KB, at least
7.4 Fragmentation
131
originally) to Ethernet (MTU = 1500B), the token-ring packets might be too large to deliver to the Ethernet side, or to traverse an Ethernet backbone en route to another Token Ring. (Token Ring, in its day, did
commonly offer a configuration option to allow Ethernet interoperability.)
So, IP must support fragmentation, and thus also reassembly. There are two major strategies here: per-link
fragmentation and reassembly, where the reassembly is done at the opposite end of the link (as in ATM), and
path fragmentation and reassembly, where reassembly is done at the far end of the path. The latter approach
is what is taken by IP, partly because intermediate routers are too busy to do reassembly (this is as true today
as it was thirty years ago), and partly because IP fragmentation is seen as the strategy of last resort.
An IP sender is supposed to use a different value for the IDENT field for different packets, at least up until
the field wraps around. When an IP datagram is fragmented, the fragments keep the same IDENT field, so
this field in effect indicates which fragments belong to the same packet.
After fragmentation, the Fragment Offset field marks the start position of the data portion of this fragment
within the data portion of the original IP packet. Note that the start position can be a number up to 216 , the
maximum IP packet length, but the FragOffset field has only 13 bits. This is handled by requiring the data
portions of fragments to have sizes a multiple of 8 (three bits), and left-shifting the FragOffset value by 3
bits before using it.
As an example, consider the following network, where MTUs are excluding the LAN header:
Suppose A addresses a packet of 1500 bytes to B, and sends it via the LAN to the first router R1. The packet
contains 20 bytes of IP header and 1480 of data.
R1 fragments the original packet into two packets of sizes 20+976 = 996 and 20+504=544. Having 980
bytes of payload in the first fragment would fit, but violates the rule that the sizes of the data portions be
divisible by 8. The first fragment packet has FragOffset = 0; the second has FragOffset = 976.
R2 refragments the first fragment into three packets as follows:
first: size = 20+376=396, FragOffset = 0
second: size = 20+376=396, FragOffset = 376
third: size = 20+224 = 244 (note 376+376+224=976), FragOffset = 752.
R2 refragments the second fragment into two:
first: size = 20+376 = 396, FragOffset = 976+0 = 976
second: size = 20+128 = 148, FragOffset = 976+376=1352
R3 then sends the fragments on to B, without reassembly.
Note that it would have been slightly more efficient to have fragmented into four fragments of sizes 376,
376, 376, and 352 in the beginning. Note also that the packet format is designed to handle fragments of
different sizes easily. The algorithm is based on multiple fragmentation with reassembly only at the final
destination.
Each fragment has its IP-header Total Length field set to the length of that fragment.
132
7 IP version 4
We have not yet discussed the three flag bits. The first bit is reserved, and must be 0. The second bit is the
Dont Fragment bit. If it is set to 1 by the sender then a router must not fragment the packet and must
drop it instead; see 12.12 Path MTU Discovery for an application of this. The third bit is set to 1 for all
fragments except the final one (this bit is thus set to 0 if no fragmentation has occurred). The third bit tells
the receiver where the fragments stop.
The receiver must take the arriving fragments and reassemble them into a whole packet. The fragments
may not arrive in order unlike in ATM networks and may have unrelated packets interspersed. The
reassembler must identify when different arriving packets are fragments of the same original, and must
figure out how to reassemble the fragments in the correct order; both these problems were essentially trivial
for ATM.
Fragments are considered to belong to the same packet if they have the same IDENT field and also the same
source and destination addresses and same protocol.
As all fragment sizes are a multiple of 8 bytes, the receiver can keep track of whether all fragments have
been received with a bitmap in which each bit represents one 8-byte fragment chunk. A 1 KB packet could
have up to 128 such chunks; the bitmap would thus be 16 bytes.
If a fragment arrives that is part of a new (and fragmented) packet, a buffer is allocated. While the receiver
cannot know the final size of the buffer, it can usually make a reasonable guess. Because of the FragOffset
field, the fragment can then be stored in the buffer in the appropriate position. A new bitmap is also allocated,
and a reassembly timer is started.
As subsequent fragments arrive, not necessarily in order, they too can be placed in the proper buffer in the
proper position, and the appropriate bits in the bitmap are set to 1.
If the bitmap shows that all fragments have arrived, the packet is sent on up as a completed IP packet. If, on
the other hand, the reassembly timer expires, then all the pieces received so far are discarded.
TCP connections usually engage in Path MTU Discovery, and figure out the largest packet size they can
send that will not entail fragmentation (12.12 Path MTU Discovery). But it is not unusual, for example,
for UDP protocols to use fragmentation, especially over the short haul. In the Network File Sharing (NFS)
protocol, for example, UDP is used to carry 8KB disk blocks. These are often sent as a single 8+ KB IP
packet, fragmented over Ethernet to five full packets and a fraction. Fragmentation works reasonably well
here because most of the time the packets do not leave the Ethernet they started on. Note that this is an
example of fragmentation done by the sender, not by an intermediate router.
Finally, any given IP link may provide its own link-layer fragmentation and reassembly; we saw in
3.8.1 ATM Segmentation and Reassembly that ATM does just this. Such link-layer mechanisms are, however, generally invisible to the IP layer.
133
lookup and assume there is a lookup() method available that, when given a destination address, returns
the next_hop neighbor.
Instead of class-based divisions, we will assume that each of the IP addresses assigned to a nodes interfaces
is configured with an associated length of the network prefix; following the slash notation of 1.10 IP Internet Protocol, if B is an address and the prefix length is k = kB then the prefix itself is B/k. As usual, an
ordinary host may have only one IP interface, while a router will always have multiple interfaces.
Let D be the given IP destination address; we want to decide if D is local or nonlocal. The host or router
involved may have multiple IP interfaces, but for each interface the length of the network portion of the
address will be known. For each network address B/k assigned to one of the hosts interfaces, we compare
the first k bits of B and D; that is, we ask if D matches B/k.
If one of these comparisons yields a match, delivery is local; the host delivers the packet to its final
destination via the LAN connected to the corresponding interface. This means looking up the LAN
address of the destination, if applicable, and sending the packet to that destination via the interface.
If there is no match, delivery is nonlocal, and the host passes D to the lookup() routine of the
forwarding table and sends to the associated next_hop (which must represent a physically connected
neighbor). It is now up to lookup() routine to make any necessary determinations as to how D
might be split into Dnet and Dhost .
The forwarding table is, abstractly, a set of network addresses now also with lengths each of the form
B/k, with an associated next_hop destination for each. The lookup() routine will, in principle, compare
D with each table entry B/k, looking for a match (that is, equality of the first k bits). As with the interfaces
check above, the net/host division point (that is, k) will come from the table entry; it will not be inferred
from D or from any other information borne by the packet. There is, in fact, no place in the IP header to
store a net/host division point, and furthermore different routers along the path may use different values of k
with the same destination address D. In 10 Large-Scale IP Routing we will see that in some cases multiple
matches in the forwarding table may exist; the longest-match rule will be introduced to pick the best match.
Here is a simple example for a router with immediate neighbors A-E:
destination
[Link]/16
[Link]/24
[Link]/24
[Link]/24
[Link]/24
next_hop
A
B
C
D
E
The IP addresses [Link] and [Link] both route to A. The addresses [Link], [Link] and
[Link] route to B, C and D respectively. Finally, [Link] matches both A and E, but the E match is
longer so the packet is routed that way.
The forwarding table may also contain a default entry for the next_hop, which it may return in cases when
the destination D does not match any known network. We take the view here that returning such a default
entry is a valid result of the routing-table lookup() operation, rather than a third option to the algorithm
above; one approach is for the default entry to be the next_hop corresponding to the destination [Link]/0,
which does indeed match everything (use of this would definitely require the above longest-match rule,
though).
Default routes are hugely important in keeping leaf forwarding tables small. Even backbone routers some-
134
7 IP version 4
times expend considerable effort to keep the network address prefixes in their forwarding tables as short as
possible, through consolidation.
Routers may also be configured to allow passing quality-of-service information to the lookup() method,
as mentioned in Chapter 1, to support different routing paths for different kinds of traffic (eg bulk file-transfer
versus real-time).
For a modest exception to the local-delivery rule described here, see below in 7.10 Unnumbered Interfaces.
7.6 IP Subnets
Subnets were the first step away from Class A/B/C routing: a large network (eg a class A or B) could be
divided into smaller IP networks called subnets. Consider, for example, a typical Class B network such
as Loyola Universitys (originally [Link]/16); the underlying assumption is that any packet can be
delivered via the underlying LAN to any internal host. This would require a rather large LAN, and would
require that a single physical LAN be used throughout the site. What if our site has more than one physical
LAN? Or is really too big for one physical LAN? It did not take long for the IP world to run into this
problem.
Subnets were first proposed in RFC 917, and became official with RFC 950.
Getting a separate IP network prefix for each subnet is bad for routers: the backbone forwarding tables now
must have an entry for every subnet instead of just for every site. What is needed is a way for a site to appear
to the outside world as a single IP network, but for further IP-layer routing to be supported inside the site.
This is what subnets accomplish.
Subnets introduce hierarchical routing: first we route to the primary network, then inside that site we route
to the subnet, and finally the last hop delivers to the host.
Routing with subnets involves in effect moving the IPnet division line rightward. (Later, when we consider
CIDR, we will see the complementary case of moving the division line to the left.) For now, observe that
moving the line rightward within a site does not affect the outside world at all; outside routers are not even
aware of site-internal subnetting.
In the following diagram, the outside world directs traffic addressed to [Link]/16 to the router R. Internally, however, the site is divided into subnets. The idea is that traffic from [Link]/24 to [Link]/24
is routed, not switched; the two LANs involved may not even be compatible. Most of the subnets shown
are of size /24, meaning that the third byte of the IP address has become part of the network portion of the
subnets address; one /20 subnet is also shown. RFC 950 would have disallowed the subnet with third byte
0, but having 0 for the subnet bits generally does work.
7.6 IP Subnets
135
What we want is for the internal routing to be based on the extended network prefixes shown, while externally continuing to use only the single routing entry for [Link]/16.
To implement subnets, we divide the sites IP network into some combination of physical LANs the subnets
, and assign each a subnet address: an IP network address which has the sites IP network address as prefix.
To put this more concretely, suppose the sites IP network address is A, and consists of n network bits (so
the site address may be written with the slash notation as A/n); in the diagram above, A/n = [Link]/16.
A subnet address is an IP network address B/k such that:
The address B/k is within the site: the first n bits of B are the same as A/ns
B/k extends A/n: kn
An example B/k in the diagram above is [Link]/24. (There is a slight simplification here in that subnet
addresses do not absolutely have to be prefixes; see below.)
We now have to figure out how packets will be routed to the correct subnet. For incoming packets we could
set up some proprietary protocol at the entry router to handle this. However, the more complicated situation
is all those existing internal hosts that, under the class A/B/C strategy, would still believe they can deliver
via the LAN to any site host, when in fact they can now only do that for hosts on their own subnet. We need
a more general solution.
We proceed as follows. For each subnet address B/k, we create a subnet mask for B consisting of k 1-bits
followed by enough 0-bits to make a total of 32. We then make sure that every host and router in the site
knows the subnet mask for every one of its interfaces. Hosts usually find their subnet mask the same way
they find their IP address (by static configuration if necessary, but more likely via DHCP, below).
Hosts and routers now apply the IP delivery algorithm of the previous section, with the proviso that, if a
subnet mask for an interface is present, then the subnet mask is used to determine the number of address bits
rather than the Class A/B/C mechanism. That is, we determine whether a packet addressed to destination
D is deliverable locally via an interface with subnet address B/k and corresponding mask M by comparing
D&M with B&M, where & represents bitwise AND; if the two match, the packet is local. This will generally
136
7 IP version 4
involve a match of more bits than if we used the Class A/B/C strategy to determine the network portion of
addresses D and B.
As stated previously, given an address D with no other context, we will not be able to determine the network/host division point in general (eg for outbound packets). However, that division point is not in fact
what we need. All that is needed is a way to tell if a given destination host address D belongs to the current
subnet, say B; that is, we need to compare the first k bits of D and B where k is the (known) length of B.
In the diagram above, the subnet mask for the /24 subnets would be [Link]; bitwise ANDing any IP
address with the mask is the same as extracting the first 24 bits of the IP address, that is, the subnet portion.
The mask for the /20 subnet would be [Link] (240 in binary is 1111 0000).
In the diagram above none of the subnets overlaps or conflicts: the subnets [Link]/24 and
[Link]/24 are disjoint. It takes a little more effort to realize that [Link]/20 does not overlap
with the others, but note that an IP address matches this network prefix only if the first four bits of the third
byte are 0001, so the third byte itself ranges from decimal 32 to decimal 63 = binary 0001 1111.
Note also that if host A = [Link] wishes to send to destination D = [Link], and A is not subnetaware, then delivery will fail: A will infer that the interface is a Class B, and therefore compare the first
two bytes of A and D, and, finding a match, will attempt direct LAN delivery. But direct delivery is now
likely impossible, as the subnets are not joined by a switch. Only with the subnet mask will A realize that
its network is [Link]/24 while Ds is [Link]/24 and that these are not the same. A would still be
able to send packets to its own subnet. In fact A would still be able to send packets to the outside world:
it would realize that the destination in that case does not match [Link]/16 and will thus forward to its
router. Hosts on other subnets would be the only unreachable ones.
Properly, the subnet address is the entire prefix, eg [Link]/24. However, it is often convenient to
identify the subnet address with just those bits that represent the extension of the site IP-network address;
we might thus say casually that the subnet address here is 65.
The class-based IP-address strategy allowed any host anywhere on the Internet to properly separate any
address into its net and host portions. With subnets, this division point is now allowed to vary; for example,
the address [Link] divides into 147.126 | 65.48 outside of Loyola, but into 147.126.65 | 48 inside.
This means that the net-host division is no longer an absolute property of addresses, but rather something
that depends on where the packet is on its journey.
Technically, we also need the requirement that given any two subnet addresses of different, disjoint subnets,
neither is a proper prefix of the other. This guarantees that if A is an IP address and B is a subnet address
with mask M (so B = B&M), then A&M = B implies A does not match any other subnet. Regardless of
the net/host division rules, we cannot possibly allow subnet [Link]/20 to represent one LAN while
[Link]/24 represents another; the second subnet address block is a subset of the first. (We can,
and sometimes do, allow the first LAN to correspond to everything in [Link]/20 that is not also in
[Link]/24; this is the longest-match rule.)
The strategy above is actually a slight simplification of what the subnet mechanism actually allows: subnet
address bits do not in fact have to be contiguous, and masks do not have to be a series of 1-bits followed by
0-bits. The mask can be any bit-mask; the subnet address bits are by definition those where there is a 1 in
the mask bits. For example, we could at a Class-B site use the fourth byte as the subnet address, and the
third byte as the host address. The subnet mask would then be [Link]. While this generality was
once sometimes useful in dealing with legacy IP addresses that could not easily be changed, life is simpler
when the subnet bits precede the host bits.
7.6 IP Subnets
137
138
size
128
64
32
32
decimal range
128-255
0-63
64-95
96-127
7 IP version 4
As desired, none of the subnet addresses in the third column is a prefix of any other subnet address.
The end result of all of this is that routing is now hierarchical: we route on the site IP address to get to a
site, and then route on the subnet address within the site.
7.6 IP Subnets
139
7 IP version 4
containing DLAN . Because the original request contained ALAN , Ds response can be sent directly to A, that
is, unicast.
Additionally, all hosts maintain an ARP cache, consisting of IP,LAN address pairs for other hosts on the
network. After the exchange above, A has DIP ,DLAN in its table; anticipating that A will soon send it a
packet to which it needs to respond, D also puts AIP ,ALAN into its cache.
ARP-cache entries eventually expire. The timeout interval used to be on the order of 10 minutes, but linux
systems now use a much smaller timeout (~30 seconds observed in 2012). Somewhere along the line, and
probably related to this shortened timeout interval, repeat ARP queries about a timed-out entry are first sent
unicast, not broadcast, to the previous Ethernet address on record. This cuts down on the total amount of
broadcast traffic; LAN broadcasts are, of course, still needed for new hosts. The ARP cache on a linux
system can be examined with the command ip -s neigh; the corresponding windows command is arp
-a.
The above protocol is sufficient, but there is one further point. When A sends its broadcast who-has D?
ARP query, all other hosts C check their own cache for an entry for A. If there is such an entry (that is, if
AIP is found there), then the value for ALAN is updated with the value taken from the ARP message; if there
is no pre-existing entry then no action is taken. This update process serves to avoid stale ARP-cache entries,
which can arise is if a host has had its Ethernet card replaced.
141
make any necessary cache updates. Finally, ACD requires that hosts that do detect a duplicate address must
discontinue using it.
It is also possible for other stations to answer an ARP query on behalf of the actual destination D; this is
called proxy ARP. An early common scenario for this was when host C on a LAN had a modem connected
to a serial port. In theory a host D dialing in to this modem should be on a different subnet, but that requires
allocation of a new subnet. Instead, many sites chose a simpler arrangement. A host that dialed in to Cs
serial port might be assigned IP address DIP , from the same subnet as C. C would be configured to route
packets to D; that is, packets arriving from the serial line would be forwarded to the LAN interface, and
packets sent to CLAN addressed to DIP would be forwarded to D. But we also have to handle ARP, and as
D is not actually on the LAN it will not receive broadcast ARP queries. Instead, C would be configured to
answer on behalf of D, replying with DIP ,CLAN . This generally worked quite well.
Proxy ARP is also used in Mobile IP, for the so-called home agent to intercept traffic addressed to the
home address of a mobile device and then forward it (eg via tunneling) to that device. See 7.11 Mobile
IP.
One delicate aspect of the ARP protocol is that stations are required to respond to a broadcast query. In the
absence of proxies this theoretically should work quite well: there should be only one respondent. However,
there were anecdotes from the Elder Days of networking when a broadcast ARP query would trigger an
avalanche of responses. The protocol-design moral here is that determining who is to respond to a broadcast
message should be done with great care. (RFC 1122 section 3.2.2 addresses this same point in the context
of responding to broadcast ICMP messages.)
ARP-query implementations also need to include a timeout and some queues, so that queries can be resent
if lost and so that a burst of packets does not lead to a burst of queries. A naive ARP algorithm without these
might be:
To send a packet to destination DIP , see if DIP is in the ARP cache. If it is, address the packet
to DLAN ; if not, send an ARP query for D
To see the problem with this approach, imagine that a 32KB packet arrives at the IP layer, to be sent over
Ethernet. It will be fragmented into 22 fragments (assuming an Ethernet MTU of 1500 bytes), all sent at
once. The naive algorithm above will likely send an ARP query for each of these. What we need instead is
something like the following:
To send a packet to destination DIP :
If DIP is in the ARP cache, send to DLAN and return
If not, see if an ARP query for DIP is pending.
If it is, put the current packet in a queue for D.
If there is no pending ARP query for DIP , start one,
again putting the current packet in the (new) queue for D
We also need:
If an ARP query for some CIP times out, resend it (up to a point)
If an ARP query for CIP is answered, send off any packets in Cs queue
7 IP version 4
Here is an ARP-based strategy, sometimes known as ARP Spoofing. First, B makes sure the real S is down,
either by waiting until scheduled downtime or by launching a denial-of-service attack against S.
When A tries to connect, it will begin with an ARP who-has S?. All B has to do is answer, S is-at B.
There is a trivial way to do this: B simply needs to set its own IP address to that of S.
A will connect, and may be convinced to give its password to B. B now simply responds with something
plausible like backup in progress; try later, and meanwhile use As credentials against the real S.
This works even if the communications channel A uses is encrypted! If A is using the SSH protocol
(21.9.1 SSH), then A will get a message that the other sides key has changed (B will present its own
SSH key, not Ss). Unfortunately, many users (and even some IT departments) do not recognize this as a
serious problem. Some organizations especially schools and universities use personal workstations with
frozen configuration, so that the filesystem is reset to its original state on every reboot. Such systems may
be resistant to viruses, but in these environments the user at A will always get a message to the effect that
Ss credentials are not known.
143
Recall that ARP is based on the idea of someone broadcasting an ARP query for a host, containing the hosts
IP address, and the host answering it with its LAN address. DHCP involves a host, at startup, broadcasting a
query containing its own LAN address, and having a server reply telling the host what IP address is assigned
to it. The DHCP response is likely to contain several other essential startup options as well, including:
IP address
subnet mask
default router
DNS Server
These four items are a pretty standard minimal network configuration.
Default Routers and DHCP
If you lose your default router, you cannot communicate. Here is something that used to happen to me,
courtesy of DHCP:
1. I am connected to the Internet via Ethernet, and my default router is via my Ethernet interface
2. I connect to my institutions wireless network.
3. Their DHCP server sends me a new default router on the wireless network. However, this default
router will only allow access to a tiny private network, because I have neglected to complete the
Wi-Fi network registration process.
4. I therefore disconnect from the wireless network, and my wireless-interface default router goes
away. However, my system does not automatically revert to my Ethernet default-router entry;
DHCP does not work that way. As a result, I will have no router at all until the next scheduled
DHCP lease renegotiation, and must fix things manually.
The DHCP server has a range of IP addresses to hand out, and maintains a database of which IP address has
been assigned to which LAN address. Reservations can either be permanent or dynamic; if the latter, hosts
typically renew their DHCP reservation periodically (typically one to several times a day).
7 IP version 4
DHCP servers on the same subnet. This often results in chaos, as two different hosts may be assigned the
same IP address, or a hosts IP address may suddenly change if it gets a new IP address from the other server.
Disabling one of the DHCP servers fixes this.
While omnipresent DHCP servers have made IP autoconfiguration work out of the box in many cases, in
the era in which IP was designed the need for such servers would have been seen as a significant drawback in
terms of expense and reliability. IPv6 has an autoconfiguration strategy (8.7.2 Stateless Autoconfiguration
(SLAAC)) that does not require DHCP, though DHCPv6 may well end up displacing it.
145
Type
Echo Request
Echo Reply
Destination Unreachable
Source Quench
Redirect Message
Router Solicitation
Time Exceeded
Bad IP Header or Parameter
Description
ping queries
ping responses
Destination network unreachable
Destination host unreachable
Destination port unreachable
Fragmentation required but DF flag set
Network administratively prohibited
Congestion control
Redirect datagram for the network
Redirect datagram for the host
Redirect for TOS and network
Redirect for TOS and host
Router discovery/selection/solicitation
TTL expired in transit
Fragment reassembly time exceeded
Pointer indicates the error
Missing a required option
Bad length
ICMP is perhaps best known for Echo Request/Reply, on which the ping tool is based. Ping remains very
useful for network troubleshooting: if you can ping a host, then the network is reachable, and any problems
are higher up the protocol chain. Unfortunately, ping replies are blocked by default by many firewalls, on
the theory that revealing even the existence of computers is a security risk. While this may be an appropriate
decision, it does significantly impair the utility of ping. Most routers do still pass ping requests, but some
site routers block them.
Source Quench was used to signal that congestion has been encountered. A router that drops a packet due to
congestion experience was encouraged to send ICMP Source Quench to the originating host. Generally the
TCP layer would handle these appropriately (by reducing the overall sending rate), but UDP applications
never receive them. ICMP Source Quench did not quite work out as intended, and was formally deprecated
by RFC 6633. (Routers can inform TCP connections of impending congestion by using the ECN bits.)
The Destination Unreachable type has a large number of subtypes:
Network unreachable: some router had no entry for forwarding the packet, and no default route
Host unreachable: the packet reached a router that was on the same LAN as the host, but the host
failed to respond to ARP queries
Port unreachable: the packet was sent to a UDP port on a given host, but that port was not open.
TCP, on the other hand, deals with this situation by replying to the connecting endpoint with a reset
packet. Unfortunately, the UDP Port Unreachable message is sent to the host, not to the application on
that host that sent the undeliverable packet, and so is close to useless as a practical way for applications
to be informed when packets cannot be delivered.
Fragmentation required but DF flag set: a packet arrived at a router and was too big to be forwarded
without fragmentation. However, the Dont Fragment bit in the IP header was set, forbidding fragmentation. Later we will see how TCP uses this option as part of Path MTU Discovery, the process
of finding the largest packet we can send to a specific destination without fragmentation. The basic
146
7 IP version 4
idea is that we set the DF bit on some of the packets we send; if we get back this message, that packet
was too big.
Administratively Prohibited: this is sent by a router that knows it can reach the network in question,
but has configured to drop the packet and send back Administratively Prohibited messages. A router
can also be configured to blackhole messages: to drop the packet and send back nothing.
7.9.2 Redirects
Most non-router hosts start up with an IP forwarding table consisting of a single (default) router, discovered
along with their IP address through DHCP. ICMP Redirect messages help hosts learn of other useful routers.
Here is a classic example:
A is configured so that its default router is R1. It addresses a packet to B, and sends it to R1. R1 receives
the packet, and forwards it to R2. However, R1 also notices that R2 and A are on the same network, and
so A could have sent the packet to R2 directly. So R1 sends an appropriate ICMP redirect message to A
(Redirect Datagram for the Network), and A adds a route to B via R2 to its own forwarding table.
7.9 Internet Control Message Protocol
147
The endpoints of L could always be assigned private IP addresses (7.3 Special Addresses), such as [Link]
and [Link]. To do this we would need to create a subnet; because the host bits cannot be all 0s or all 1s,
the minimum subnet size is four (eg [Link]/30). Furthermore, the routing protocols to be introduced in
9 Routing-Update Algorithms will distribute information about the subnet throughout the organization or
routing domain, meaning care must be taken to ensure that each links subnet is unique. Use of unnumbered links avoids this.
If R1 were to originate a packet to be sent to (or forwarded via) R2, the standard strategy is for it to treat
its link0 interface as if it shared the IP address of its Ethernet interface eth0, that is, [Link]; R2
would do likewise. This still leaves R1 and R2 violating the IP local-delivery rule of 7.5 The Classless
IP Delivery Algorithm; R1 is expected to deliver packets via local delivery to [Link] but has no interface
that is assigned an IP address on the destination subnet [Link]/24. The necessary dispensation, however,
is granted by RFC 1812. All that is necessary by way of configuration is that R1 be told R2 is a directly
connected neighbor reachable via its link0 interface. On linux systems this might be done with the ip
route command on R1 as follows:
ip route
The linux ip route command illustrated here was tested on a virtual point-to-point link created with
ssh and pppd; the link interface name was in fact ppp0. While the command appeared to work as
advertised, it was only possible to create the link if endpoint IP addresses were assigned at the time
of creation; these were then removed with ip route del and then re-assigned with the command
shown here.
148
7 IP version 4
7.11 Mobile IP
In the original IPv4 model, there was a strong if implicit assumption that each IP host would stay put. One
role of an IP address is simply as a unique endpoint identifier, but another role is as a locator: some prefix
of the address (eg the network part, in the class-A/B/C strategy, or the provider prefix) represents something
about where the host is physically located. Thus, if a host moves far enough, it may need a new address.
When laptops are moved from site to site, it is common for them to receive a new IP address at each
location, eg via DHCP as the laptop connects to the local Wi-Fi. But what if we wish to support devices like
smartphones that may remain active and communicating while moving for thousands of miles? Changing
IP addresses requires changing TCP connections; life (and application development) might be simpler if a
device had a single, unchanging IP address.
One option, commonly used with smartphones connected to some so-called 3G networks, is to treat the
phones data network as a giant wireless LAN. The phones IP address need not change as it moves within
this LAN, and it is up to the phone provider to figure out how to manage LAN-level routing, much as is
done in 3.3.5 Wi-Fi Roaming.
But Mobile IP is another option, documented in RFC 5944. In this scheme, a mobile host has a permanent
home address and, while roaming about, will also have a temporary care-of address, which changes from
place to place. The care-of address might be, for example, an IP address assigned by a local Wi-Fi network,
and which in the absence of Mobile IP would be the IP address for the mobile host. (This kind of care-of
address is known as co-located; the care-of address can also be associated with some other device known
as a foreign agent in the vicinity of the mobile host.) The goal of Mobile IP is to make sure that the mobile
host is always reachable via its home address.
To maintain connectivity to the home address, a Mobile IP host needs to have a home agent back on the
home network; the job of the home agent is to maintain an IP tunnel that always connects to the devices
current care-of address. Packets arriving at the home network addressed to the home address will be forwarded to the mobile device over this tunnel by the home agent. Similarly, if the mobile device wishes to
send packets from its home address that is, with the home address as IP source address it can use the
tunnel to forward the packet to the home agent.
The home agent may use proxy ARP (7.7.1 ARP Finer Points) to declare itself to be the appropriate
destination on the home LAN for packets addressed to the home (IP) address; it is then straightforward for
the home agent to forward the packets.
An agent discovery process is used for the mobile host to decide whether it is mobile or not; if it is, it then
needs to notify its home agent of its current care-of address.
There are several forms of packet encapsulation that can be used for Mobile IP tunneling, but the default one
is IP-in-IP encapsulation, defined in RFC 2003. In this process, the entire original IP packet (with header
addressed to the home address) is used as data for a new IP packet, with a new IP header (the outer header)
addressed to the care-of address.
7.11 Mobile IP
149
A special value in the IP-header Protocol field indicates that IP-in-IP tunneling was used, so the receiver
knows to forward the packet on using the information in the inner header. The MTU of the tunnel will be
the original MTU of the path to the care-of address, minus the size of the outer header.
7.12 Epilog
At this point we have concluded the basic mechanics of IPv4. Still to come is a discussion of how IP
routers build their forwarding tables. This turns out to be a complex topic, divided into routing within single
organizations and ISPs 9 Routing-Update Algorithms and routing between organizations 10 LargeScale IP Routing.
But before that, in the next chapter, we compare IPv4 with IPv6, now twenty years old but still seeing very
limited adoption. The biggest issue fixed by IPv6 is IPv4s lack of address space, but there are also several
other less dramatic improvements.
7.13 Exercises
1. Suppose an Ethernet packet represents a TCP acknowledgment; that is, the packet contains an IP header
and a 20-byte TCP header but nothing else. Is the IP packet here smaller than the Ethernet minimum-packet
size, and, if so, by how much?
2. How can a receiving host tell if an arriving IP packet is unfragmented?
3. How long will it take the IDENT field of the IP header to wrap around, if the sender host A sends a stream
of packets to host B as fast as possible? Assume the packet size is 1500 bytes and the bandwidth is 600
Mbps.
4. The following diagram has routers A, B, C, D and E; E is the border router connecting the site to
the Internet. All router-to-router connections are via Ethernet-LAN /24 subnets with addresses of the form
200.0.x. Give forwarding tables for each of A, B, C and D. Each table should include each of the listed
subnets and also a default entry that routes traffic toward router E.
200.0.5
200.0.6
200.0.7
200.0.8
Internet
200.0.9
C
150
7 IP version 4
200.0.10
5. (This exercise is an attempt at modeling Internet-2 routing.) Suppose sites S1 ... Sn each have a single
connection to the standard Internet, and each site Si has a single IP address block Ai . Each sites connection
to the Internet is through a single router Ri ; each Ri s default route points towards the standard Internet. The
sites also maintain a separate, higher-speed network among themselves; each site has a single link to this
separate network, also through Ri . Describe what the forwarding tables on each Ri will have to look like so
that traffic from one Si to another will always use the separate higher-speed network.
6. For each IP network prefix given (with length), identify which of the subsequent IP addresses are part of
the same subnet.
7. Suppose that the subnet bits below for the following five subnets A-E all come from the beginning of the
fourth byte of the IP address; that is, these are subnets of a /24 block.
A: 00
B: 01
C: 110
D: 111
E: 1010
(a). What are the sizes of each subnet, and the corresponding decimal ranges? Count the addresses with
host bits all 0s or with host bits all 1s as part of the subnet.
(b). How many IP addresses in the class-C block do not belong to any of the subnets A, B, C, D and E?
8. In 7.7 Address Resolution Protocol: ARP it was stated that, in newer implementations, repeat ARP
queries about a timed out entry are first sent unicast, in order to reduce broadcast traffic. Why is this unicast
approach likely to succeed most of the time? Can you give an example of a situation in which the unicast
query would fail, but a followup broadcast query would succeed?
9. Suppose A broadcasts an ARP query who-has B?, receives Bs response, and proceeds to send B a
regular IP packet. If B now wishes to reply, why is it likely that A will already be present in Bs ARP cache?
Identify a circumstance under which this can fail.
10. Suppose A broadcasts an ARP request who-has B, but inadvertently lists the physical address of
another machine C instead of its own (that is, As ARP query has IPsrc = A, but LANsrc = C). What will
happen? Will A receive a reply? Will any other hosts on the LAN be able to send to A? What entries will
be made in the ARP caches on A, B and C?
7.13 Exercises
151
152
7 IP version 4
8 IP VERSION 6
What has been learned from experience with IPv4? First and foremost, more than 32 bits are needed for
addresses; the primary motive in developing IPv6 was the specter of running out of IPv4 addresses (something which, at the highest level, has already happened; see the discussion at the end of 1.10 IP - Internet
Protocol). Another important issue is that IPv4 requires a modest amount of effort at configuration; IPv6
was supposed to improve this.
By 1990 the IETF was actively interested in proposals to replace IPv4. A working group for the so-called
IP next generation, or IPng, was created in 1993 to select the new version; RFC 1550 was this groups
formal solicitation of proposals. In July 1994 the IPng directors voted to accept a modified version of the
Simple Internet Protocol, or SIP (unrelated to the Session Initiation Protocol), as the basis for IPv6.
SIP addresses were originally to be 64 bits in length, but in the month leading up to adoption this was
increased to 128. 64 bits would probably have been enough, but the problem is less the actual number than
the simplicity with which addresses can be allocated; the more bits, the easier this becomes, as sites can be
given relatively large address blocks without fear of waste. A secondary consideration in the 64-to-128 leap
was the potential to accommodate now-obsolete CLNP addresses, which were up to 160 bits in length (but
compressible).
IPv6 has to some extent returned to the idea of a fixed division between network and host portions: in most
ordinary-host cases, the first 64 bits of the address is the network portion (including any subnet portion)
and the remaining 64 bits represent the host portion. While there are some configuration alternatives here,
and while the IETF occasionally revisits the issue, at the present time the 64/64 split seems here to stay.
Routing, however, can, as in IPv4, be done on different prefixes of the address at different points of the
network. Thus, it is misleading to think of IPv6 as a return to Class A/B/C address allocation.
IPv6 is now twenty years old, and yet usage remains quite modest. However, the shortage in IPv4 addresses
has begun to loom ominously; IPv6 adoption rates may rise quickly if IPv4 addresses begin to climb in
price.
153
154
8 IP version 6
The above is an example of the standard IPv6 format for representing IPv4 addresses. A separate representation of IPv4 addresses, with the FFFF block replaced by 0-bits, is used for tunneling IPv6 traffic over
IPv4. The IPv6 loopback address is ::1 (that is, 127 0-bits followed by a 1-bit).
Network address prefixes may be written with the / notation, as in IPv4:
[Link]/60
RFC 3513 suggested that initial IPv6 unicast-address allocation be initially limited to addresses beginning
with the bits 001, that is, the 2000::/3 block (20 in binary is 0010 0000).
Generally speaking, IPv6 addresses consist of a 64-bit network prefix (perhaps including subnet bits) and a
64-bit host identifier. See 8.5 Network Prefixes.
155
156
8 IP version 6
157
Generally speaking, fragmentation should be avoided at the application layer when possible. UDP-based
applications that attempt to transmit filesystem-sized (usually 8 KB) blocks of data remain persistent users
of fragmentation.
158
8 IP version 6
example, if I use IPv4 block [Link]/8 at home, and connect using VPN to a site also using [Link]/8, it is
possible that my printer will have the same IPv4 address as their application server.
159
another via a router on the LAN, even though they should in principle be able to communicate directly. IPv6
drops this restriction.
The Router Advertisement packets sent by the router should contain a complete list of valid network-address
prefixes, as the Prefix Information option. In simple cases this list may contain a single externally routable
64-bit prefix. If a particular LAN is part of multiple (overlapping) physical subnets, the prefix list will
contain an entry for each subnet; these 64-bit prefixes will themselves likely have a common prefix of length
N<64. For multihomed sites the prefix list may contain multiple unrelated prefixes corresponding to the
different address blocks. Finally, private and local IPv6 address prefixes may also be included.
Each prefix will have an associated lifetime; nodes receiving a prefix from an RA packet are to use it only
for the duration of this lifetime. On expiration (and likely much sooner) a node must obtain a newer RA
packet with a newer prefix list. The rationale for inclusion of the prefix lifetime is ultimately to allow sites
to easily renumber; that is, to change providers and switch to a new network-address prefix provided by a
new router. Each prefix is also tagged with a bit indicating whether it can be used for autoconfiguration, as
in 8.7.2 Stateless Autoconfiguration (SLAAC) below.
8 IP version 6
to support complete plug-and-play network setup: hosts on a completely isolated LAN could talk to one
another out of the box, and if a router was introduced connecting the LAN to the Internet, then hosts would
be able to determine unique, routable addresses from information available from the router.
In the early days of IPv6 development, in fact, DHCPv6 may have been intended only for address assignments to routers and servers, with SLAAC meant for ordinary hosts. In that era, it was still common for
IPv4 addresses to be assigned statically, via per-host configuration files. RFC 4862 states that SLAAC is
to be used when a site is not particularly concerned with the exact addresses hosts use, so long as they are
unique and properly routable.
SLAAC and DHCPv6 evolved to some degree in parallel. While SLAAC solves the autoconfiguration problem quite neatly, at this point DHCPv6 solves it just as effectively, and provides for greater administrative
control. For this reason, SLAAC may end up less widely deployed. On the other hand, SLAAC gives hosts
greater control over their IPv6 addresses, and so may end up offering hosts a greater degree of privacy by
allowing endpoint management of the use of private and temporary addresses (below).
When a host first begins the Neighbor Discovery process, it receives a Router Advertisement packet. In this
packet are two special bits: the M (managed) bit and the O (other configuration) bit. The M bit is set to
indicate that DHCPv6 is available on the network for address assignment. The O bit is set to indicate that
DHCPv6 is able to provide additional configuration information (eg the name of the DNS server) to hosts
that are using SLAAC to obtain their addresses.
161
The next step is to see if there is a router available. The host sends a Router Solicitation (RS) message
to the all-routers multicast address. A router if present should answer with a Router Advertisement
(RA) message that also contains a Prefix Information option; that is, a list of IPv6 network-address prefixes
(8.6.2 Prefix Discovery). The RA message will mark with a flag those prefixes eligible for use with
SLAAC; if no prefixes are so marked, then SLAAC should not be used. All prefixes will also be marked
with a lifetime, indicating how long the host may continue to use the prefix; once the prefix expires, the host
must obtain a new one via a new RA message.
The host chooses an appropriate prefix, stores the prefix-lifetime information, and, in the original version
of SLAAC, appends the prefix to the front of its host identifier to create what should now be a routable
address. The prefix length plus the host-identifier length must equal 128 bits; in the most common case each
is 64 bits. The address so formed must now be verified through the duplicate-address-detection mechanism
above.
An address generated in this way will, because of the embedded host identifier, uniquely identify the host
for all time. This includes identifying the host even when it is connected to a new network and is given
a different network prefix. Therefore, RFC 4941 defines a set of privacy extensions to SLAAC: optional
mechanisms for the generation of alternative host identifiers, based on pseudorandom generation using the
original LAN-address-based host identifier as a seed value. The probability of two hosts accidentally
choosing the same host identifier in this manner is very small; the Neighbor Solicitation mechanism with
DAD must, however, still be used to verify that the address is in fact unique. DHCPv6 also provides an
option for temporary address assignments, also to improve privacy, but one of the potential advantages of
SLAAC is that this process is entirely under the control of the end system.
Regularly (eg every few hours, or less) changing the host portion of an IPv6 address will make external
tracking of a host slightly more difficult. However, for a residential site with only a handful of hosts, a
considerable degree of tracking may be obtained simply by using the common 64-bit prefix.
In theory, if another host B on the LAN wishes to contact host A with a SLAAC-configured address containing the original host identifier, and B knows As IPv6 address AIPv6 , then B might extract As LAN address
from the low-order bits of AIPv6 . This was never actually allowed, however, even before the RFC 4941
privacy options, as there is no way for B to know that As address was generated via SLAAC at all. B would
always find As LAN address through the usual process of IPv6 Neighbor Solicitation.
A host using SLAAC may receive multiple network prefixes, and thus generate for itself multiple addresses.
RFC 6724 defines a process for a host to determine, when it wishes to connect to destination address D,
which of its own multiple addresses to use. For example, if D is a site-local address, not globally visible,
then the host will likely want to use an address that is also site-local. RFC 6724 also includes mechanisms
to allow a host with a permanent public address (eg corresponding to a DNS entry) to prefer alternative
temporary or privacy addresses for outbound connections.
At the end of the SLAAC process, the host knows its IPv6 address (or set of addresses) and its default router.
In IPv4, these would have been learned through DHCP along with the identity of the hosts DNS server; one
concern with SLAAC is that there is no obvious way for a host to find its DNS server. One strategy is to fall
back on DHCPv6 for this. However, RFC 6106 now defines a process by which IPv6 routers can include
DNS-server information in the RA packets they send to hosts as part of the SLAAC process; this completes
the final step of the autoconfiguration process.
How to get DNS names for SLAAC-configured IPv6 hosts into the DNS servers is an entirely separate
issue. One approach is simply not to give DNS names to such hosts. In the NAT-router model for IPv4
autoconfiguration, hosts on the inward side of the NAT router similarly do not have DNS names (although
162
8 IP version 6
they are also not reachable directly, while SLAAC IPv6 hosts would be reachable). If DNS names are needed
for hosts, then a site might choose DHCPv6 for address assignment instead of SLAAC. It is also possible to
figure out the addresses SLAAC would use (by identifying the host-identifier bits) and then creating DNS
entries for these hosts. Hosts can also use Dynamic DNS (RFC 2136) to update their own DNS records.
8.7.3 DHCPv6
The job of the DHCPv6 server is to tell an inquiring host its network prefix(es) and also supply a 64-bit host-identifier.
Hosts begin the process by sending a DHCPv6 request to the
All_DHCP_Relay_Agents_and_Servers multicast IPv6 address FF02::1:2 (versus the broadcast address for
IPv4). As with DHCPv4, the job of a relay agent is to tag a DHCP request with the correct current subnet, and then to forward it to the actual DCHPv6 server. This allows the DHCP server to be on a different subnet from the requester. Note that the use of multicast does nothing to diminish the need for
relay agents; use of the multicast group does not necessarily identify a requesters subnet. In fact, the
All_DHCP_Relay_Agents_and_Servers multicast address scope is limited to the current link; relay agents
then forward to the actual DHCP server using the site-scoped address All_DHCP_Servers.
Hosts using SLAAC to obtain their address can still use a special Information-Request form of DHCPv6 to
obtain their DNS server and any other static DHCPv6 information.
Clients may ask for temporary addresses. These are identified as such in the DHCPv6 request, and are
handled much like permanent address requests, except that the client may ask for a new temporary address
only a short time later. When the client does so, a different temporary address will be returned; a repeated
request for a permanent address, on the other hand, would usually return the same address as before.
When the DHCPv6 server returns a temporary address, it may of course keep a log of this address. The
absence of such a log is one reason SLAAC may provide a greater degree of privacy. Another concern
is that the DHCPv6 temporary-address sequence might have a flaw that would allow a remote observer to
infer a relationship between different temporary addresses; with SLAAC, a host is responsible itself for the
security of its temporary-address sequence and is not called upon to trust an external entity.
A DHCPv6 response contains a list (perhaps of length 1) of IPv6 addresses. Each separate address has an
expiration date. The client must send a new request before the expiration of any address it is actually using;
unlike for DHCPv4, there is no separate address lease lifetime.
In DHCPv4, the host portion of addresses typically comes from address pools representing small ranges
of integers such as 64-254; these values are generally allocated consecutively. A DHCPv6 server, on the
other hand, should take advantage of the enormous range (264 ) of possible host portions by allocating values
more sparsely, through the use of pseudorandomness. This makes it very difficult for an outsider who knows
one of a sites host addresses to guess the addresses of other hosts. Some DHCPv6 servers, however, do not
yet support this; such servers make the SLAAC approach more attractive.
163
be convenient to distribute only the /64 prefix via manual configuration, and have SLAAC supply the loworder 64 bits, this option is not described in the SLAAC RFCs and seems not to be available in common
implementations.
8.9 ICMPv6
RFC 4443 defines an updated version of the ICMP protocol. It includes an IPv6 version of ICMP Echo
Request / Echo Reply, upon which the ping command is based. It also handles the error conditions below;
this list is somewhat cleaner than the corresponding ICMPv4 list:
Destination Unreachable
In this case, one of the following numeric codes is returned:
0. No route to destination, returned when a router has no next_hop entry.
1. Communication with destination administratively prohibited, returned when a router has a
next_hop entry, but declines to use it for policy reasons. Codes 5 and 6 are special cases; these
more-specific codes are returned when appropriate.
2. Beyond scope of source address, returned when a router is, for example, asked to route a packet
to a global address, but the return address is site-local. In IPv4, when a host with a private address
attempts to connect to a global address, NAT is almost always involved.
164
8 IP version 6
3. Address unreachable, a catchall category for routing failure not covered by any other message. An
example is if the packet was successfully routed to the last_hop router, but Neighbor Discovery failed
to find a LAN address corresponding to the IPv6 address.
4. Port unreachable, returned when, as in ICMPv4, the destination host does not have the requested
UDP port open.
5. Source address failed ingress/egress policy, see code 1.
6. Reject route to destination, see code 1.
Packet Too Big
This is like ICMPv4s Fragmentation Required but DontFragment flag sent; IPv6 however has no routerbased fragmentation.
Time Exceeded
This is used for cases where the Hop Limit was exceeded, and also where source-based fragmentation was
used and the fragment-reassembly timer expired.
Parameter Problem
This is used when there is a malformed entry in the IPv6 header, eg an unrecognized Next Header value.
8.10.1 ping6
We will start with the linux version of ping6, the IPv6 analogue of the familiar ping command. It is
used to send ICMPv6 Echo Requests. The ping6 command supports an option to specify the interface (-I
eth0); as noted above, this is mandatory when sending to link-local addresses.
ping6 ::1: This allows me to ping my own loopback address.
ping6 -I eth0 ff02::1: This pings the all-nodes multicast group on interface eth0. I get these answers:
64 bytes from fe80::3e97:eff:fe2c:2beb (this is the host I am pinging from)
64 bytes from fe80::2a0:ccff:fe24:b0e4 (another linux host)
My VoIP phone on the same subnet but apparently supporting IPv4 only remains mute.
ping6 -I eth0 fe80::6267:20ff:fe72:8960: This pings the link-local address of the other linux host answering
the previous query. Note the ff:fe in the host identifier. Also note the flipped seventh bit of the two bytes
02a0; the other linux host has Ethernet address [Link].
165
166
8 IP version 6
8.12 Epilog
IPv4 has run out of large address blocks, as of 2011. IPv6 has reached a mature level of development. Most
common operating systems provide excellent IPv6 support.
Yet conversion has been slow. Many ISPs still provide limited (to nonexistent) support, and inexpensive
IPv6 firewalls to replace the ubiquitous consumer-grade NAT routers do not really exist. Time will tell how
all this evolves. However, while IPv6 has now been around for twenty years, top-level IPv4 address blocks
disappeared just three years ago. It is quite possible that this will prove to be just the catalyst IPv6 needs.
8.13 Exercises
1. Each IPv6 address is associated with a specific solicited-node multicast address. Explain why, on a
typical Ethernet, if the original IPv6 host address was obtained via SLAAC then the LAN multicast group
corresponding to the hosts solicited-node multicast addresses is likely to be small, in many cases consisting
of one host only. (Packet delivery to small LAN multicast groups can be much more efficient than delivery
to large multicast groups.)
(b). What steps might a DHCPv6 server take to ensure that, for the IPv6 addresses it hands out, the LAN
multicast groups corresponding to the host addresses solicited-node multicast addresses will be small?
2. If an attacker sends a large number of probe packets via IPv4, you can block them by blocking the
attackers IP address. Now suppose the attacker uses IPv6 to launch the probes; for each probe, the attacker
changes the low-order 64 bits of the address. Can these probes be blocked efficiently? If so, what do you
have to block? Might you also be blocking other users?
3. Suppose someone tried to implement ping6 so that, if the address was a link-local address and no interface
was specified, the ICMPv6 Echo Request was sent out all non-loopback interfaces. Could the end result be
different than conventional ping6 with the correct interface supplied? If so, how likely is this?
4. Create an IPv6 ssh connection as in 8.10 Routerless Connection Examples. Examine the connections
packets using WireShark or the equivalent. Does the TCP handshake (12.3 TCP Connection Establishment)
look any different over IPv6?
8.11 IPv6-to-IPv4 connectivity
167
5. Create an IPv6 ssh connection using manually configured addresses as in 8.10.3 Manual address
configuration. Again use WireShark or the equivalent to monitor the connection. Is DAD (8.7.1 Duplicate
Address Detection) used?
168
8 IP version 6
9 ROUTING-UPDATE ALGORITHMS
169
Routers identify their router neighbors (through some sort of neighbor-discovery mechanism), and
add a third column to their forwarding tables for cost; table entries are thus of the form
destination,next_hop,cost. The simplest case is to assign a cost of 1 to each link (the hopcount metric); it is also possible to assign more complex numbers.
Each router then reports the destination,cost portion of its table to its neighboring routers at regular
intervals (these table portions are the vectors of the algorithm name). It does not matter if neighbors
exchange reports at the same time, or even at the same rate.
Each router also monitors its continued connectivity to each neighbor; if neighbor N becomes unreachable
then its reachability cost is set to infinity.
Actual destinations in IP would be networks attached to routers; one router might be directly connected
to several such destinations. In the following, however, we will identify all a routers directly connected
networks with the router itself. That is, we will build forwarding tables to reach every router. While
it is possible that one destination network might be reachable by two or more routers, thus breaking our
identification of a router with its set of attached networks, in practice this is of little concern. See exercise 4
for an example in which networks are not identified with adjacent routers.
9.1.2 Example 1
For our first example, no links will break and thus only the first two rules above will be used. We will start
out with the network below with empty forwarding tables; all link costs are 1.
170
9 Routing-Update Algorithms
After initial neighbor discovery, here are the forwarding tables. Each node has entries only for its directly
connected neighbors:
A: B,B,1 C,C,1 D,D,1
B: A,A,1 C,C,1
C: A,A,1 B,B,1 E,E,1
D: A,A,1 E,E,1
E: C,C,1 D,D,1
Now let D report to A; it sends records A,1 and E,1. A ignores Ds A,1 record, but E,1 represents a
new destination; A therefore adds E,D,2 to its table. Similarly, let A now report to D, sending B,1 C,1
D,1 E,2 (the last is the record we just added). D ignores As records D,1 and E,2 but As records
B,1 and C,1 cause D to create entries B,A,2 and C,A,2. A and Ds tables are now, in fact, complete.
Now suppose C reports to B; this gives B an entry E,C,2. If C also reports to E, then Es table will have
A,C,2 and B,C,2. The tables are now:
A: B,B,1 C,C,1 D,D,1 E,D,2
B: A,A,1 C,C,1 E,C,2
C: A,A,1 B,B,1 E,E,1
D: A,A,1 E,E,1 B,A,2 C,A,2
E: C,C,1 D,D,1 A,C,2 B,C,2
We have two missing entries: B and C do not know how to reach D. If A reports to B and C, the tables will
be complete; B and C will each reach D via A at cost 2. However, the following sequence of reports might
also have occurred:
E reports to C, causing C to add D,E,2
C reports to B, causing B to add D,C,3
In this case we have 100% reachability but B routes to D via the longer-than-necessary path BCED.
However, one more report will fix this: suppose A reports to B. B will received D,1 from A, and will
update its entry D,C,3 to D,A,2.
Note that A routes to E via D while E routes to A via C; this asymmetry was due to indeterminateness in the
order of initial table exchanges.
If all link weights are 1, and if each pair of neighbors exchange tables once before any pair starts a second
exchange, then the above process will discover the routes in order of length, ie the shortest paths will be the
first to be discovered. This is not, however, a particularly important consideration.
171
9.1.3 Example 2
The next example illustrates link weights other than 1. The first route discovered between A and B is the
direct route with cost 8; eventually we discover the longer ACDB route with cost 2+1+3=6.
9.1.4 Example 3
Our third example will illustrate how the algorithm proceeds when a link breaks. We return to the first
diagram, with all tables completed, and then suppose the DE link breaks. This is the bad-news case: a
link has broken, and is no longer available; this will bring the third rule into play.
172
9 Routing-Update Algorithms
We shall assume, as above, that A reaches E via D, but we will here assume contrary to Example 1 that
C reaches D via A (see exercise 3.5 for the original case).
Initially, upon discovering the break, D and E update their tables to E,-, and D,-, respectively
(whether or not they actually enter into their tables is implementation-dependent; we may consider this
as equivalent to removing their entries for one another; the - as next_hop indicates there is no next_hop).
Eventually D and E will report the break to their respective neighbors A and C. A will apply the bad-news
rule above and update its entry for E to E,-,. We have assumed that C, however, routes to D via A, and
so it will ignore Es report.
We will suppose that the next steps are for C to report to E and to A. When C reports its route D,2 to E,
E will add the entry D,C,3, and will again be able to reach D. When C reports to A, A will add the route
E,C,2. The final step will be when A next reports to D, and D will have E,A,3. Connectivity is restored.
9.1.5 Example 4
The previous examples have had a global perspective in that we looked at the entire network. In the next
example, we look at how one specific router, R, responds when it receives a distance-vector report from its
neighbor S. Neither R nor S nor we have any idea of what the entire network looks like. Suppose Rs table
is initially as follows, and the SR link has cost 1:
destination
A
B
C
D
next_hop
S
T
S
U
cost
3
4
5
6
S now sends R the following report, containing only destinations and its costs:
destination
A
B
C
D
E
cost
2
3
5
4
2
173
destination
A
B
C
D
E
next_hop
S
T
S
S
S
cost
3
4
6
5
3
reason
No change; S probably sent this report before
No change; Rs cost via S is tied with Rs cost via T
Next_hop increase
Lower-cost route via S
New destination
If A immediately reports to B that D is no longer reachable (distance = ), then all is well. However, it is
possible that B reports to A first, telling A that it has a route to D, with cost 2, which B still believes it has.
This means A now installs the entry D,B,3. At this point we have what we called in 1.6 Routing Loops a
linear routing loop: if a packet is addressed to D, A will forward it to B and B will forward it back to A.
Worse, this loop will be with us a while. At some point A will report D,3 to B, at which point B will
update its entry to D,A,4. Then B will report D,4 to A, and As entry will be D,B,5, etc. This process
is known as slow convergence to infinity. If A and B each report to the other once a minute, it will take
2,000 years for the costs to overflow an ordinary 32-bit integer.
174
9 Routing-Update Algorithms
Suppose the A-D link breaks, and A updates to D,-,. A then reports D, to B, which updates its
table to D,-,. But then, before A can also report D, to C, C reports D,2 to B. B then updates to
D,C,3, and reports D,3 back to A; neither this nor the previous report violates split-horizon. Now As
entry is D,B,4. Eventually A will report to C, at which point Cs entry becomes D,A,5, and the numbers
keep increasing as the reports circulate counterclockwise. The actual routing proceeds in the other direction,
clockwise.
Split horizon often also includes poison reverse: if A uses N as its next_hop to D, then A in fact reports
D, to N, which is a more definitive statement that A cannot reach D by itself. However, coming up with
a scenario where poison reverse actually affects the outcome is not trivial.
[Link] Triggered Updates
In the original example, if A was first to report to B then the loop resolved immediately; the loop occurred
if B was first to report to A. Nominally each outcome has probability 50%. Triggered updates means that
any router should report immediately to its neighbors whenever it detects any change for the worse. If A
reports first to B in the first example, the problem goes away. Similarly, in the second example, if A reports
to both B and C before B or C report to one another, the problem goes away. There remains, however, a
small window where B could send its report to A just as A has discovered the problem, before A can report
to B.
[Link] Hold Down
Hold down is sort of a receiver-side version of triggered updates: the receiver does not use new alternative
routes for a period of time (perhaps two router-update cycles) following discovery of unreachability. This
gives time for bad news to arrive. In the first example, it would mean that when A received Bs report D,2,
it would set this aside. It would then report D, to B as usual, at which point B would now report D,
back to A, at which point Bs earlier report D,2 would be discarded. A significant drawback of hold down
is that legitimate new routes are also delayed by the hold-down period.
175
These mechanisms for preventing slow convergence are, in the real world, quite effective. The Routing
Information Protocol (RIP, RFC 2453) implements all but hold-down, and has been widely adopted at
smaller installations.
However, the potential for routing loops and the limited value for infinity led to the development of alternatives. One alternative is the link-state strategy, 9.5 Link-State Routing-Update Algorithm. Another
alternative is Ciscos Enhanced Interior Gateway Routing Protocol, or EIGRP, 9.4.2 EIGRP. While part of
the distance-vector family, EIGRP is provably loop-free, though to achieve this it must sometimes suspend
forwarding to some destinations while tables are in flux.
Now suppose that A and B use distance-vector but are allowed to choose the shortest route to within 10%. A
would get a report from C that D could be reached with cost 1, for a total cost of 21. The forwarding entry
via C would be D,C,21. Similarly, A would get a report from B that D could be reached with cost 21, for
a total cost of 22: D,B,22. Similarly, B has choices D,C,21 and D,A,22.
If A and B both choose the minimal route, no loop forms. But if A and B both use the 10%-overage rule,
they would be allowed to choose the other route: A could choose D,B,22 and B could choose D,A,22.
176
9 Routing-Update Algorithms
If this happened, we would have a routing loop: A would forward packets for D to B, and B would forward
them right back to A.
As we apply distance-vector routing, each router independently builds its tables. A router might have some
notion of the path its packets would take to their destination; for example, in the case above A might believe
that with forwarding entry D,B,22 its packets would take the path ABCD (though in distance-vector
routing, routers do not particularly worry about the big picture). Consider again the accurate-cost question
above. This fails in the 10%-overage example, because the actual path is now infinite.
We now prove that, in distance-vector routing, the network will have accurate costs, provided
each router selects what it believes to be the shortest path to the final destination, and
the network is stable, meaning that further dissemination of any reports would not result in changes
To see this, suppose the actual route taken by some packet from source to destination, as determined by
application of the distributed algorithm, is longer than the cost calculated by the source. Choose an example
of such a path with the fewest number of links, among all such paths in the network. Let S be the source,
D the destination, and k the number of links in the actual path P. Let Ss forwarding entry for D be D,N,c,
where N is Ss next_hop neighbor. To have obtained this route through the distance-vector algorithm, S must
have received report D,cD from N, where we also have the cost of the SN link as cN and c = cD + cN . If
we follow a packet from N to D, it must take the same path P with the first link deleted; this sub-path has
length k-1 and so, by our hypothesis that k was the length of the shortest path with non-accurate costs, the
cost from N to D is cD . But this means that the cost along path P, from S to D via N, must be cD + cN = c,
contradicting our selection of P as a path longer than its advertised cost.
There is one final observation to make about route costs: any cost-minimization can occur only within
a single routing domain, where full information about all links is available. If a path traverses multiple
routing domains, each separate routing domain may calculate the optimum path traversing that domain. But
these local minimums do not necessarily add up to a globally minimal path length, particularly when
one domain calculates the minimum cost from one of its routers only to the other domain rather than to a
router within that other domain. Here is a simple example. Routers BR1 and BR2 are the border routers
connecting the domain LD to the left of the vertical dotted line with domain RD to the right. From A to B,
LD will choose the shortest path to RD (not to B, because LD is not likely to have information about links
within RD). This is the path of length 3 through BR2. But this leads to a total path length of 3+8=11 from
A to B; the global minimum path length, however, is 4+1=5, through BR1.
In this example, domains LD and RD join at two points. For a route across two domains joined at only a
single point, the domain-local shortest paths do add up to the globally shortest path.
177
9.4.1 DSDV
DSDV, or Destination-Sequenced Distance Vector, was proposed in [PB94]. It avoids routing loops by
the introduction of sequence numbers: each router will always prefer routes with the most recent sequence
number, and bad-news information will always have a lower sequence number then the next cycle of corrected information.
DSDV was originally proposed for MANETs (3.3.8 MANETs) and has some additional features for traffic
minimization that, for simplicity, we ignore here. It is perhaps best suited for wired networks and for small,
relatively stable MANETs.
DSDV forwarding tables contain entries for every other reachable node in the system. One successor of
DSDV, Ad Hoc On-Demand Distance Vector routing or AODV, allows forwarding tables to contain only
those destinations in active use; a mechanism is provided for discovery of routes to newly active destinations.
See [PR99] and RFC 3561.
Under DSDV, each forwarding table entry contains, in addition to the destination, cost and next_hop, the current sequence number for that destination. When neighboring nodes exchange their distance-vector reachability reports, the reports include these per-destination sequence numbers.
When a router R receives a report from neighbor N for destination D, and the report contains a sequence
number larger than the sequence number for D currently in Rs forwarding table, then R always updates to
use the new information. The three cost-minimization rules of 9.1.1 Distance-Vector Update Rules above
are used only when the incoming and existing sequence numbers are equal.
Each time a router R sends a report to its neighbors, it includes a new value for its own sequence number,
which it always increments by 2. This number is then entered into each neighbors forwarding-table entry for
R, and is then propagated throughout the network via continuing report exchanges. Any sequence number
originating this way will be even, and whenever another nodes forwarding-table sequence number for R is
even, then its cost for R will be finite.
Infinite-cost reports are generated in the usual way when former neighbors discover they can no longer reach
one another; however, in this case each node increments the sequence number for its former neighbor by 1,
thus generating an odd value. Any forwarding-table entry with infinite cost will thus always have an odd
sequence number. If A and B are neighbors, and As current sequence number is s, and the AB link breaks,
then B will start reporting A at distance with sequence number s+1 while A will start reporting its own
new sequence number s+2. Any other node now receiving a report originating with B (with sequence number
s+1) will mark A as having cost , but will obtain a valid route to A upon receiving a report originating
from A with new (and larger) sequence number s+2.
The triggered-update mechanism is used: if a node receives a report with some destinations newly marked
with infinite cost, it will in turn forward this information immediately to its other neighbors, and so on. This
is, however, not essential; bad and good reports are distinguished by sequence number, not by relative
arrival time.
178
9 Routing-Update Algorithms
It is now straightforward to verify that the slow-convergence problem is solved. After a link break, if there is
some alternative path from router R to destination D, then R will eventually receive Ds latest even sequence
number, which will be greater than any sequence number associated with any report listing D as unreachable.
If, on the other hand, the break partitioned the network and there is no longer any path to D from R, then the
highest sequence number circulating in Rs half of the original network will be odd and the associated table
entries will all list D at cost . One way or another, the network will quickly settle down to a state where
every destinations reachability is accurately described.
In fact, a stronger statement is true: not even transient routing loops are created. We outline a proof. First,
whenever router R has next_hop N for a destination D, then Ns sequence number for D must be greater
than or equal to Rs, as R must have obtained its current route to D from one of Ns reports. A consequence
is that all routers participating in a loop for destination D must have the same (even) sequence number s for
D throughout. This means that the loop would have been created if only the reports with sequence number
s were circulating. As we noted in 9.1.1 Distance-Vector Update Rules, any application of the next_hopincrease rule must trace back to a broken link, and thus must involve an odd sequence number. Thus, the
loop must have formed from the sequence-number-s reports by the application of the first two rules only.
But this violates the claim in Exercise 10.
There is one drawback to DSDV: nodes may sometimes briefly switch to routes that are longer than optimum
(though still correct). This is because a router is required to use the route with the newest sequence number,
even if that route is longer than the existing route. If A and B are two neighbors of router R, and B is closer
to destination D but slower to report, then every time Ds sequence number is incremented R will receive
As longer route first, and switch to using it, and Bs shorter route shortly thereafter.
DSDV implementations usually address this by having each router R keep track of the time interval between
the first arrival at R of a new route to a destination D with a given sequence number, and the arrival of the
best route with that sequence number. During this interval following the arrival of the first report with a new
sequence number, R will use the new route, but will refrain from including the route in the reports it sends
to its neighbors, anticipating that a better route will soon arrive.
This works best when the hopcount cost metric is being used, because in this case the best route is likely
to arrive first (as the news had to travel the fewest hops), and at the very least will arrive soon after the first
route. However, if the networks cost metric is unrelated to the hop count, then the time interval between
first-route and best-route arrivals can involve multiple update cycles, and can be substantial.
9.4.2 EIGRP
EIGRP, or the Enhanced Interior Gateway Routing Protocol, is a once-proprietary Cisco distance-vector
protocol that was released as an Internet Draft in February 2013. As with DSDV, it eliminates the risk of
routing loops, even ephemeral ones. It is based on the distributed update algorithm (DUAL) of [JG93].
EIGRP is an actual protocol; we present here only the general algorithm. Our discussion follows [CH99].
Each router R keeps a list of neighbor routers NR , as with any distance-vector algorithm. Each R also
maintains a data structure known (somewhat misleadingly) as its topology table. It contains, for each
destination D and each N in NR , an indication of whether N has reported the ability to reach D and, if so, the
reported cost c(D,N). The router also keeps, for each N in NR , the cost cN of the link from R to N. Finally,
the forwarding-table entry for any destination can be marked passive, meaning safe to use, or active,
meaning updates are in process and the route is temporarily unavailable.
179
Initially, we expect that for each router R and each destination D, Rs next_hop to D in its forwarding table
is the neighbor N for which the following total cost is a minimum:
c(D,N) + cN
Now suppose R receives a distance-vector report from neighbor N1 that it can reach D with cost c(D,N1 ).
This is processed in the usual distance-vector way, unless it represents an increased cost and N1 is Rs
next_hop to D; this is the third case in 9.1.1 Distance-Vector Update Rules. In this case, let C be Rs
current cost to D, and let us say that neighbor N of R is a feasible next_hop (feasible successor in Ciscos
terminology) if Ns cost to D (that is, c(D,N)) is strictly less than C. R then updates its route to D to use the
feasible neighbor N for which c(D,N) + cN is a minimum. Note that this may not in fact be the shortest path;
it is possible that there is another neighbor M for which c(D,M)+cM is smaller, but c(D,M)C. However,
because Ns path to D is loop-free, and because c(D,N) < C, this new path through N must also be loop-free;
this is sometimes summarized by the statement one cannot create a loop by adopting a shorter route.
If no neighbor N of R is feasible which would be the case in the DAB example of 9.2 Distance-Vector
Slow-Convergence Problem, then R invokes the DUAL algorithm. This is sometimes called a diffusion
algorithm as it invokes a diffusion-like spread of table changes proceeding away from R.
Let C in this case denote the new cost from R to D as based on N1 s report. R marks destination D as
active (which suppresses forwarding to D) and sends a special query to each of its neighbors, in the form
of a distance-vector report indicating that its cost to D has now increased to C. The algorithm terminates
when all Rs neighbors reply back with their own distance-vector reports; at that point R marks its entry for
D as passive again.
Some neighbors may be able to process Rs report without further diffusion to other nodes, remain passive,
and reply back to R immediately. However, other neighbors may, like R, now become active and continue
the DUAL algorithm. In the process, R may receive other queries that elicit its distance-vector report; as
long as R is active it will report its cost to D as C. We omit the argument that this process and thus the
network must eventually converge.
9 Routing-Update Algorithms
that every router sees every LSP, and also that no LSPs circulate repeatedly. (The acronym LSP is used by
a link-state implementation known as IS-IS; the preferred acronym used by the Open Shortest Path First
(OSPF) implementation is LSA, where A is for advertisement.) LSPs are sent immediately upon link-state
changes, like triggered updates in distance-vector protocols except there is no race between bad news
and good news.
It is possible for ephemeral routing loops to exist; for example, if one router has received a LSP but another
has not, they may have an inconsistent view of the network and thus route to one another. However, as soon
as the LSP has reached all routers involved, the loop should vanish. There are no race conditions, as with
distance-vector routing, that can lead to persistent routing loops.
The link-state flooding algorithm avoids the usual problems of broadcast in the presence of loops by having
each node keep a database of all LSP messages. The originator of each LSP includes its identity, information
about the link that has changed status, and also a sequence number. Other routers need only keep in their
databases the LSP packet with the largest sequence number; older LSPs can be discarded. When a router
receives a LSP, it first checks its database to see if that LSP is old, or is current but has been received before;
in these cases, no further action is taken. If, however, an LSP arrives with a sequence number not seen
before, then in typical broadcast fashion the LSP is retransmitted over all links except the arrival interface.
As an example, consider the following arrangement of routers:
Suppose the AE link status changes. A sends LSPs to C and B. Both these will forward the LSPs to D;
suppose Bs arrives first. Then D will forward the LSP to C; the LSP traveling CD and the LSP traveling
DC might even cross on the wire. D will ignore the second LSP copy that it receives from C and C will
ignore the second copy it receives from D.
It is important that LSP sequence numbers not wrap around. (Protocols that do allow a numeric field to wrap
around usually have a clear-cut idea of the active range that can be used to conclude that the numbering has
wrapped rather than restarted; this is harder to do in the link-state context.) OSPF uses lollipop sequencenumbering here: sequence numbers begin at -231 and increment to 231 -1. At this point they wrap around
back to 0. Thus, as long as a sequence number is less than zero, it is guaranteed unique; at the same time,
routing will not cease if more than 231 updates are needed. Other link-state implementations use 64-bit
sequence numbers.
Actual link-state implementations often give link-state records a maximum lifetime; entries must be periodically renewed.
181
182
9 Routing-Update Algorithms
We start with current = A. At the end of the first stage, B,B,3 is moved into R, T is {D,D,12}, and
current is B. The second stage adds C,B,5 to T, and then moves this to R; current then becomes C. The
third stage introduces the route (from A) D,B,10; this is an improvement over D,D,12 and so replaces it
in T; at the end of the stage this route to D is moved to R.
A link-state source node S computes the entire path to a destination D. But as far as the actual path that a
packet sent by S will take to D, S has direct control only as far as the first hop N. While the accurate-cost
rule we considered in distance-vector routing will still hold, the actual path taken by the packet may differ
from the path computed at the source, in the presence of alternative paths of the same length. For example,
S may calculate a path SNAD, and yet a packet may take path SNBD, so long as the NAD and
NBD paths have the same length.
Link-state routing allows calculation of routes on demand (results are then cached), or larger-scale calculation. Link-state also allows routes calculated with quality-of-service taken into account, via straightforward
extension of the algorithm above.
183
routers knew nothing about. Instead, the forwarding table is split up into multiple dest, next_hop (or dest,
QoS, next_hop) tables. One of these tables is the main table, and is the table that is updated by routingupdate protocols interacting with neighbors. Before a packet is forwarded, administratively supplied rules
are consulted to determine which table to apply; these rules are allowed to consult other packet attributes.
The collection of tables and rules is known as the routing policy database.
As a simple example, in the situation above the main table would have an entry default, L1 (more precisely,
it would have the IP address of the far end of the L1 link instead of L1 itself). There would also be another
table, perhaps named slow, with a single entry default, L2. If a rule is created to have a packet routed
using the slow table, then that packet will be forwarded via L2. Here is one such linux rule, applying to
traffic from host [Link]:
ip rule add from [Link] table slow
Now suppose we want to route traffic to port 25 (the SMTP port) via L2. This is harder; linux provides no
support here for routing based on port numbers. However, we can instead use the iptables mechanism to
mark all packets destined for port 25, and then create a routing-policy rule to have such marked traffic
use the slow table. The mark is known as the forwarding mark, or fwmark; its value is 0 by default. The
fwmark is not actually part of the packet; it is associated with the packet only while the latter remains
within the kernel.
iptables --table mangle --append PREROUTING \\
--protocol tcp --source-port 25 --jump MARK --set-mark 1
ip rule add fwmark 1 table slow
9.7 Epilog
At this point we have concluded the basics of IP routing, involving routing within large (relatively) homogeneous organizations such as multi-site corporations or Internet Service Providers. Every router involved
must agree to run the same protocol, and must agree to a uniform assignment of link costs.
At the very largest scales, these requirements are impractical. The next chapter is devoted to this issue of
very-large-scale IP routing, eg on the global Internet.
9.8 Exercises
1. Suppose the network is as follows, where distance-vector routing update is used. Each link has cost 1,
and each router has entries in its forwarding table only for its immediate neighbors (so As table contains
B,B,1, D,D,1 and Bs table contains A,A,1, C,C,1).
A
184
9 Routing-Update Algorithms
(a). Suppose each node creates a report from its initial configuration and sends that to each of its neighbors.
What will each nodes forwarding table be after this set of exchanges? The exchanges, in other words, are
all conducted simultaneously; each node first sends out its own report and then processes the reports
arriving from its two neighbors.
(b). What will each nodes table be after the simultaneous-and-parallel exchange process of part (a) is
repeated a second time?
Hint: you do not have to go through each exchange in detail; the only information added by an exchange is
additional reachability information.
2. Now suppose the configuration of routers has the link weights shown below.
A
2
D
C
12
(a). As in the previous exercise, give each nodes forwarding table after each node exchanges with its
immediate neighbors simultaneously and in parallel.
(b). How many iterations of such parallel exchanges will it take before C learns to reach F via B; that is,
before it creates the entry F,B,11? Count the answer to part (a) as the first iteration.
cost
5
6
7
8
9
next hop
R1
R1
R2
R2
R3
R now receives the following report from R1; the cost of the RR1 link is 1.
destination
A
B
C
D
E
F
9.8 Exercises
cost
4
7
7
6
8
8
185
Give Rs updated table after it processes R1s report. For each entry that changes, give a brief explanation,
in the style of 9.1.5 Example 4.
3.5. At the start of Example 3 (9.1.4 Example 3), we changed Cs routing table so that it reached D via A
instead of via E: Cs entry D,E,2 was changed to D,A,2. This meant that C had a valid route to D at the
start.
How might the scenario of Example 3 play out if Cs table had not been altered? Give a sequence of reports
that leads to correct routing between D and E.
4. In the following exercise, A-D are routers and the attached networks N1-N6, which are the ultimate
destinations, are shown explicitly. Routers still exchange distance-vector reports with neighboring routers,
as usual. If a router has a direct connection to a network, you may report the next_hop as direct, eg, from
As table, N1,direct,0
A
N1
N3
N6
N5
N4
N2
(a). Give the initial tables for A through D, before any distance-vector exchanges.
(b). Give the tables after each router A-D exchanges with its immediate neighbors simultaneously and in
parallel.
(c). At the end of (b), what networks are not known by what routers?
5. Suppose A, B, C, D and E are connected as follows. Each link has cost 1, and so each forwarding table
is uniquely determined; Bs table is A,A,1, C,C,1, D,A,2, E,C,2. Distance-vector routing update is
used.
Now suppose the DE link fails, and so D updates its entry for E to E,-,.
186
9 Routing-Update Algorithms
6. Consider the network in [Link] Split Horizon:, using distance-vector routing updates.
D,A,2
D,A,2
(a). What reports (a pair should suffice) will lead to the formation of the routing loop?
(b). What (single) report will eliminate the possibility of the routing loop?
7. Suppose the network of 9.2 Distance-Vector Slow-Convergence Problem is changed to the following.
Distance-vector update is used; again, the AD link breaks.
D,D,1
A
D,E,2
B
(a). Explain why Bs report back to A, after A reports D,-,, is now valid.
(b). Explain why hold down ([Link] Hold Down) will delay the use of the new route ABED.
8. Suppose the routers are A, B, C, D, E and F, and all link costs are 1. The distance-vector forwarding
tables for A and F are below. Give the network with the fewest links that is consistent with these tables.
Hint: any destination reached at cost 1 is directly connected; if X reaches Y via Z at cost 2, then Z and Y
must be directly connected.
As table
dest
B
C
D
E
F
cost
1
1
2
2
3
next_hop
B
C
C
C
B
Fs table
9.8 Exercises
187
dest
A
B
C
D
E
cost
3
2
2
1
1
next_hop
E
D
D
D
E
9. (a) Suppose routers A and B somehow end up with respective forwarding-table entries D,B,n and
D,A,m, thus creating a routing loop. Explain why the loop may be removed more quickly if A and B both
use poison reverse with split horizon, versus if A and B use split horizon only.
(b). Suppose the network looks like the following. The AB link is extremely slow.
A
C
Suppose A and B send reports to each other advertising their routes to D, and immediately afterwards the
CD link breaks and C reports to A and B that D is unreachable. After those unreachability reports are
processed, A and Bs reports sent to each other before the break finally arrive. Explain why the network is
now in the state described in part (a).
10. Suppose the distance-vector algorithm is run on a network and no links break (so by the last paragraph
of 9.1.1 Distance-Vector Update Rules the next_hop-increase rule is never applied).
(a). Prove that whenever A is Bs next_hop to destination D, then As cost to D is strictly less than Bs.
Hint: assume that if this claim is true, then it remains true after any application of the rules in
9.1.1 Distance-Vector Update Rules. If the lower-cost rule is applied to B after receiving a report from A,
resulting in a change to Bs cost to D, then one needs to show As cost is less than Bs, and also Bs new
cost is less than that of any neighbor C that uses B as its next_hop to D.
(b). Use (a) to prove that no routing loops ever form.
11. Give a scenario illustrating how a (very temporary!) routing loop might form in link-state routing.
12. Use the Shortest Path First algorithm to find the shortest path from A to E in the network below. Show
the sets R and T, and the node current, after each step.
13. Suppose you take a laptop, plug it into an Ethernet LAN, and connect to the same LAN via Wi-Fi. From
laptop to LAN there are now two routes. Which route will be preferred? How can you tell which way traffic
is flowing? How can you configure your OS to prefer one path or another?
188
9 Routing-Update Algorithms
10 LARGE-SCALE IP ROUTING
In the previous chapter we considered two classes of routing-update algorithms: distance-vector and linkstate. Each of these approaches requires that participating routers have agreed not just to a common protocol,
but also to a common understanding of how link costs are to be assigned. We will address this further below
in 10.6 Border Gateway Protocol, BGP, but the basic problem is that if one site prefers the hop-count
approach, assigning every link a cost of 1, while another site prefers to assign link costs in proportion to
their bandwidth, then path cost comparisons between the two sites simply cannot be done. In general, we
cannot even translate costs from one site to another, because the paths themselves depend on the cost
assignment strategy.
The term routing domain is used to refer to a set of routers under common administration, using a common
link-cost assignment. Another term for this is autonomous system. While use of a common routingupdate protocol within the routing domain is not an absolute requirement for example, some subnets may
internally use distance-vector while the sites backbone routers use link-state we can assume that all
routers have a uniform view of the sites topology and cost metrics.
One of the things included in the term large-scale IP routing is the coordination of routing between multiple routing domains. Even in the earliest Internet there were multiple routing domains, if for no other
reason than that how to measure link costs was (and still is) too unsettled to set in stone. However, another
component of large-scale routing is support for hierarchical routing, above the level of subnets; we turn to
this next.
By the year 2000, CIDR had essentially eliminated the Class A/B/C mechanism from the backbone Internet,
and had more-or-less completely changed how backbone routing worked. You purchased an address block
from a provider or some other IP address allocator, and it could be whatever size you needed, from /27 to
/15.
What CIDR enabled is IP routing based on an address prefix of any length; the Class A/B/C mechanism of
course used fixed prefix lengths of 8, 16 and 24 bits. Furthermore, CIDR allows different routers, at different
levels of the backbone, to route on prefixes of different lengths.
CIDR was formally introduced by RFC 1518 and RFC 1519. For a while there were strategies in place to
support compatibility with non-CIDR-aware routers; these are now obsolete. In particular, it is no longer
appropriate for large-scale routers to fall back on the Class A/B/C mechanism in the absence of CIDR
information; if the latter is missing, the routing should fail.
The basic strategy of CIDR is to consolidate multiple networks going to the same destination into a single
entry. Suppose a router has four class Cs all to the same destination:
[Link]/24 foo
[Link]/24 foo
[Link]/24 foo
[Link]/24 foo
The router can replace all these with the single entry
[Link]/22 foo
It does not matter here if foo represents a single ultimate destination or if it represents four sites that just
happen to be routed to the same next_hop.
It is worth looking closely at the arithmetic to see why the single entry uses /22. This means that the first 22
bits must match [Link]; this is all of the first and second bytes and the first six bits of the third byte. Let
us look at the third byte of the network addresses above in binary:
200.7.000000 00.0/24 foo
200.7.000000 01.0/24 foo
200.7.000000 10.0/24 foo
200.7.000000 11.0/24 foo
The /24 means that the network addresses stop at the end of the third byte. The four entries above cover
every possible combination of the last two bits of the third byte; for an address to match one of the entries
above it suffices to begin 200.7 and then to have 0-bits as the first six bits of the third byte. This is another
way of saying the address must match [Link]/22.
Most implementations actually use a bitmask, eg [Link].00 (in hex) rather than the number 22; note 0xFC
= 1111 1100 with 6 leading 1-bits, so [Link].00 has 8+8+6=22 1-bits followed by 10 0-bits.
The IP delivery algorithm of 7.5 The Classless IP Delivery Algorithm still works with CIDR, with the
understanding that the routers forwarding table can now have a network-prefix length associated with any
entry. Given a destination D, we search the forwarding table for network-prefix destinations B/k until we
find a match; that is, equality of the first k bits. In terms of masks, given a destination D and a list of table
entries prefix,mask = B[i],M[i], we search for i such that (D & M[i]) = B[i].
It is possible to have multiple matches, and responsibility for avoiding this is much too distributed to be
190
10 Large-Scale IP Routing
declared illegal by IETF mandate. Instead, CIDR introduced the longest-match rule: if destination D
matches both B1 /k1 and B2 /k2 , with k1 < k2 , then the longer match B2 /k2 match is to be used. (Note that if
D matches two distinct entries B1 /k1 and B2 /k2 then either k1 < k2 or k2 < k1 ).
191
D: [Link]/16
E: [Link]/16
192
10 Large-Scale IP Routing
F: [Link]/16
G: [Link]/16
The routing model is that packets are first routed to the appropriate provider, and then to the customer.
While this model may not in general guarantee the shortest end-to-end path, it does in this case because
each provider has a single point of interconnection to the others. Here is the network diagram:
With this diagram, P0s forwarding table looks something like this:
destination
[Link]/16
[Link]/16
[Link]/20
[Link]/8
[Link]/8
next_hop
A
B
C
P1
P2
193
194
next_hop
P0
P2
D
E
C
10 Large-Scale IP Routing
This does work, but all Cs inbound traffic except for that originating in P1 will now be routed through
Cs ex-provider P0, which as an ex-provider may not be on the best of terms with C. Also, the routing is
inefficient: Cs traffic from P2 is routed P2P0P1 instead of the more direct P2P1.
A better solution is for all providers other than P1 to add the route [Link]/20, P1. While traffic to
[Link]/8 otherwise goes to P0, this particular sub-block is instead routed by each provider to P1. The
important case here is P2, as a stand-in for all other providers and their routers: P2 routes [Link]/8 traffic
to P0 except for the block [Link]/20, which goes to P1.
Having every other provider in the world need to add an entry for C is going to cost some money, and, one
way or another, C will be the one to pay. But at least there is a choice: C can consent to renumbering (which
is not difficult if they have been diligent in using DHCP and perhaps NAT too), or they can pay to keep their
old address block.
As for the second diagram above, with the various private links (shown as dashed lines), it is likely that the
longest-match rule is not needed for these links to work. As private link to P1 might only mean that
A can send outbound traffic via P1
P1 forwards As traffic to A via the private link
P2, in other words, is still free to route to A via P0. P1 may not advertise its route to A to anyone else.
The globally shortest path between A and B is via the r2s2 crossover, with total length 6+1+5=12. However,
traffic from A to B will be routed by P1 to its closest crossover to P2, namely the r3s3 link. The total path is
2+1+8+5=16. Traffic from B to A will be routed by P2 via the r1s1 crossover, for a length of 2+1+7+6=16.
195
This routing strategy is sometimes called hot-potato routing; each provider tries to get rid of any traffic (the
potatoes) as quickly as possible, by routing to the closest exit point.
Not only are the paths taken inefficient, but the AB and BA paths are now asymmetric. This can be
a problem if forward and reverse timings are critical, or if one of P1 or P2 has significantly more bandwidth
or less congestion than the other. In practice, however, route asymmetry is of little consequence.
As for the route inefficiency itself, this also is not necessarily a significant problem; the primary reason
routing-update algorithms focus on the shortest path is to guarantee that all computed paths are loop-free.
As long as each half of a path is loop-free, and the halves do not intersect except at their common midpoint,
these paths too will be loop-free.
The BGP MED value ([Link] MULTI_EXIT_DISC) offers an optional mechanism for P1 to agree that
AB traffic should take the r1s1 crossover. This might be desired if P1s network were better and
customer A was willing to pay extra to keep its traffic within P1s network as long as possible.
196
10 Large-Scale IP Routing
than other (shorter) paths. It is much easier to string terrestrial cable than undersea cable. However, within
a continent physical distance does not always matter as much as might be supposed. Furthermore, a large
geographically spread-out provider can always divide up its address blocks by region, allowing internal
geographical routing to the correct region.
Here is a diagram of IP address allocation as of 2006: [Link]
197
The BGP speakers must maintain a database of all routes received, not just of the routes actually used.
However, the speakers exchange with neighbors only the routes they (and thus their AS) use themselves;
this is a firm BGP rule.
The current BGP standard is RFC 4271.
10.6.1 AS-paths
At its most basic level, BGP involves the exchange of lists of reachable destinations, like distance-vector
routing without the distance information. But that strategy, alone, cannot avoid routing loops. BGP solves
the loop problem by having routers exchange, not just destination information, but also the entire path used
to reach each destination. Paths including each router would be too cumbersome; instead, BGP abbreviates
the path to the list of ASs traversed; this is called the AS-path. This allows routers to make sure their routes
do not traverse any AS more than once, and thus do not have loops.
As an example of this, consider the network below, in which we consider Autonomous Systems also to be
destinations. Initially, we will assume that each AS discovers its immediate neighbors. AS3 and AS5 will
then each advertise to AS4 their routes to AS2, but AS4 will have no reason at this level to prefer one route
to the other (BGP does use the shortest AS-path as part of its tie-breaking rule, but, before falling back on
that rule, AS4 is likely to have a commercial preference for which of AS3 and AS5 it uses to reach AS2).
Also, AS2 will advertise to AS3 its route to reach AS1; that advertisement will contain the AS-path
AS2,AS1. Similarly, AS3 will advertise this route to AS4 and then AS4 will advertise it to AS5. When
AS5 in turn advertises this AS1-route to AS2, it will include the entire AS-path AS5,AS4,AS3,AS2,AS1,
and AS2 would know not to use this route because it would see that it is a member of the AS-path. Thus,
BGP is spared the kind of slow-convergence problem that traditional distance-vector approaches were subject to.
It is theoretically possible that the shortest path (in the sense, say, of the hopcount metric) from one host to
another traverses some AS twice. If so, BGP will not allow this route.
AS-paths potentially add considerably to the size of the AS database. The number of paths a site must keep
track of is proportional to the number of ASs, because there will be one AS-path to each destination AS.
(Actually, an AS may have to record many times that many AS-paths, as an AS may hear of AS-paths that
it elects not to use.) Typically there are several thousand ASs in the world. Let A be the number of ASs.
Typically the average length of an AS-path is about log(A), although this depends on connectivity. The
amount of memory required by BGP is
CAlog(A) + KN,
where C and K are constants.
The other major goal of BGP is to allow some degree of administrative input to what, for interior routing,
198
10 Large-Scale IP Routing
is largely a technical calculation (though an interior-routing administrator can set link costs). BGP is the
interface between large ISPs, and can be used to implement contractual agreements made regarding which
ISPs will carry other ISPs traffic. If ISP2 tells ISP1 it has a good route to destination D, but ISP1 chooses
not to send traffic to ISP2, BGP can be used to implement this.
Despite the exchange of AS-path information, temporary routing loops may still exist. This is because BGP
may first decide to use a route and only then export the new AS-path; the AS on the other side may realize
there is a problem as soon as the AS-path is received but by then the loop will have at least briefly been in
existence. See the first example below in 10.6.8 Examples of BGP Instability.
BGPs predecessor was EGP, which guaranteed loop-free routes by allowing only a single route to any AS,
thus forcing the Internet into a tree topology, at least at the level of Autonomous Systems. The AS graph
could contain no cycles or alternative routes, and hence there could be no redundancy provided by alternative
paths. EGP also thus avoided having to make decisions as to the preferred path; there was never more than
one choice. EGP was sometimes described as a reachability protocol; its only concern was whether a given
network was reachable.
199
AS-sequence=AS2
AS-set={AS3,AS4}
AS2 thus both achieves the desired aggregation and also accurately reports the AS-path length.
The AS-path can in general be an arbitrary list of AS-sequence and AS-set parts, but in cases of simple
aggregation such as the example here, there will be one AS-sequence followed by one AS-set.
RFC 6472 now recommends against using AS-sets entirely, and recommends that aggregation as above be
avoided.
200
10 Large-Scale IP Routing
As an example of import filtering, a site might elect to ignore all routes from a particular neighbor, or to
ignore all routes whose AS-path contains a particular AS, or to ignore temporarily all routes from a neighbor
that has demonstrated too much recent route instability (that is, rapidly changing routes). Import filtering
can also be done in the best-path-selection stage. Finally, while it is not commonly useful, import filtering
can involve rather strange criteria; for example, in 10.6.8 Examples of BGP Instability we will consider
examples where AS1 prefers routes with AS-path AS3,AS2 to the strictly shorter path AS2.
The next stage is best-path selection, for which the first step is to eliminate AS-paths with loops. Even if the
neighbors have been diligent in not advertising paths with loops, an AS will still need to reject routes that
contain itself in the associated AS-path.
The next step in the best-path-selection stage, generally the most important in BGP configuration, is to assign
a local_preference, or weight, to each route received. An AS may have policies that add a certain amount to
the local_preference for routes that use a certain AS, etc. Very commonly, larger sites will have preferences
based on contractual arrangements with particular neighbors. Provider ASs, for example, will in general
prefer routes that go through their customers, as these are cheaper. A smaller ISP that connects to two
or more larger ones might be paying to route almost all its outbound traffic through a particular one of the
two; its local_preference values will then implement this choice. After BGP calculates the local_preference
value for every route, the routes with the best local_preference are then selected.
Domains are free to choose their local_preference rules however they wish. Some choices may lead to
instability, below, so domains are encouraged to set their rules in accordance with some standard principles,
also below.
In the event of ties two routes to the same destination with the same local_preference a first tie-breaker
rule is to prefer the route with the shorter AS-path. While this superficially resembles a shortest-path algorithm, the real work should have been done in administratively assigning local_preference values.
Local_preference values are communicated internally via the LOCAL_PREF path attribute, below. They
are not shared with other Autonomous Systems.
The final significant step of the route-selection phase is to apply the Multi_Exit_Discriminator value; we
postpone this until below. A site may very well choose to ignore this value entirely. There may then be
additional trivial tie-breaker rules; note that if a tie-breaker rule assigns significant traffic to one AS over
another, then it may have significant economic consequences and shouldnt be considered trivial. If this
situation is detected, it would probably be addressed in the local-preferences phase.
After the best-path-selection stage is complete, the BGP speaker has now selected the routes it will use. The
final stage is to decide what rules will be exported to which neighbors. Only routes the BGP speaker will
use that is, routes that have made it to this point can be exported; a site cannot route to destination D
through AS1 but export a route claiming D can be reached through AS2.
It is at the export-filtering stage that an AS can enforce no-transit rules. If it does not wish to carry transit
traffic to destination D, it will not advertise D to any of its AS-neighbors.
The export stage can lead to anomalies. Suppose, for example, that AS1 reaches D and AS5 via AS2, and
announces this to AS4.
201
Later AS1 switches to reaching D via AS3, but A is forbidden by policy to announce AS3-paths to AS4.
Then A must simply withdraw the announcement to AS4 that it could reach D at all, even though the route
to D via AS2 is still there.
202
10 Large-Scale IP Routing
[Link] LOCAL_PREF
If one BGP speaker in an AS has been configured with local_preference values, used in the best-pathselection phase above, it uses the LOCAL_PREF path attribute to share those preferences with all other
BGP speakers at a site.
[Link] MULTI_EXIT_DISC
The Multi-Exit Discriminator, or MED, attribute allows one AS to learn something of the internal structure
of another AS, should it elect to do so. Using the MED information provided by a neighbor has the potential
to cause an AS to incur higher costs, as it may end up carrying traffic for longer distances internally; MED
values received from a neighboring AS are therefore only recognized when there is an explicit administrative
decision to do so.
Specifically, if an autonomous system AS1 has multiple links to neighbor AS2, then AS1 can, when advertising an internal destination D to AS2, have each of its BGP speakers provide associated MED values so
that AS2 can know which link AS1 would prefer that AS2 use to reach D. This allows AS2 to route traffic
to D so that it is carried primarily by AS2 rather than by AS1. The alternative is for AS2 to use only the
closest gateway to AS1, which means traffic is likely carried primarily by AS1.
MED values are considered late in the best-path-selection process; in this sense the use of MED values is a
tie-breaker when two routes have the same local_preference.
As an example, consider the following network (from 10.4.3 Provider-Based Hierarchical Routing, with
providers now replaced by Autonomous Systems); the numeric values on links are their relative costs. We
will assume that border routers R1, R2 and R3 are also AS1s BGP speakers.
In the absence of the MED, AS1 will send traffic from A to B via the R3S3 link, and AS2 will return the
traffic via S1R1. These are the links that are closest to R and S, respectively, representing AS1 and AS2s
desire to hand off the outbound traffic as quickly as possible.
However, AS1s R1, R2 and R3 can provide MED values to AS2 when advertising destination A, indicating
a preference for AS2AS1 traffic to use the rightmost link:
R1: destination A has MED 200
R2: destination A has MED 150
203
204
10 Large-Scale IP Routing
the effect of allowing the original AS to configure itself without involving the receiving AS in the process. Communities are often used, for example, by (large) customers of an ISP to request specific routing
treatment.
A customer would have to find out from the provider what communities the provider defines, and what their
numeric codes are. At that point the customer can place itself into the providers community at will.
Here are some of the community values once supported by a no-longer-extant ISP that we shall call AS1.
The full community value would have included AS1s AS-number.
value
90
100
105
110
990
991
action
set local_preference used by AS1 to 90
set local_preference used by AS1 to 100, the default
set local_preference used by AS1 to 105
set local_preference used by AS1 to 110
the route will not leave AS1s domain; equivalent to NO_EXPORT
route will only be exported to AS1s other customers
If A and B fully advertise link1, by exporting to their respective ISPs routes to each other, then ISP1 (paid
by A) may end up carrying much of Bs traffic or ISP2 (paid by B) may end up carrying much of As traffic.
Economically, these options are not desirable unless fully agreed to by both parties. The primary issue here
is the use of the ISP1A link by B, and the ISP2B link by A; use of the shared link1 might be a secondary
issue depending on the relative bandwidths and A and Bs understandings of appropriate uses for link1.
Three common options A and B might agree to regarding link1 are no-transit, backup, and load-balancing.
For the no-transit option, A and B simply do not export the route to their respective ISPs at all. This is done
via export filtering. If ISP1 does not know A can reach B, it will not send any of Bs traffic to A.
For the backup option, the intent is that traffic to A will normally arrive via ISP1, but if the ISP1 link is down
then As traffic will be allowed to travel through ISP2 and B. To achieve this, A and B can export their link1route to each other, but arrange for ISP1 and ISP2 respectively to assign this route a low local_preference
value. As long as ISP1 hears of a route to B from its upstream provider, it will reach B that way, and will
not advertise the existence of the link1 route to B; ditto ISP2. However, if the ISP2 route to B fails, then As
upstream provider will stop advertising any route to B, and so ISP1 will begin to use the link1 route to B
205
and begin advertising it to the Internet. The link1 route will be the primary route to B until ISP2s service is
restored.
A and B must convince their respective ISPs to assign the link1 route a low local_preference; they cannot
mandate this directly. However, if their ISPs recognize community attributes that, as above, allow customers
to influence their local_preference value, then A and B can use this to create the desired local_preference.
For outbound traffic, A and B will need a way to send through one another if their own ISP link is down.
One approach is to consider their default-route path (eg to [Link]/0) to be a concrete destination within BGP.
ISP1 advertises this to A, using As interior routing protocol, but so does B, and A has configured things so
Bs route has a higher cost. Then A will route to [Link]/0 through ISP1 that is, will use ISP1 as its default
route as long as it is available, and will switch to B when it is not.
For inbound load balancing, there is no easy fix, in that if ISP1 and ISP2 both export routes to A, then A has
lost all control over how other sites will prefer one to the other. A may be able to make one path artificially
appear more expensive, and keep tweaking this cost until the inbound loads are comparable. Outbound
load-balancing is up to A and B.
Another basic policy question is which of the two available paths site (or regional AS) A uses to reach site
D, in the following diagram. B and C are Autonomous Systems.
How can A express preference for B over C, assuming B and C both advertise to A their routes to D?
Generally A will use a local_preference setting to make the carrier decision for AD traffic, though it is
D that makes the decision for the DA traffic. It is possible (though not customary) for one of the transit
providers to advertise to A that it can reach D, but not advertise to D that it can reach A.
Here is a similar diagram, showing two transit-providing Autonomous Systems B and C connecting at
Internet exchange points IXP1 and IXP2.
B and C each have routers within each IXP. B would probably like to make sure C does not attempt to save
on its long-haul transit costs by forwarding AD traffic over to B at IXP1, and DA traffic over to
B at IXP2. B can avoid this problem by not advertising to C that it can reach A and D. In general, transit
providers are often quite careful about advertising reachability to any other AS for whom they do not intend
to provide transit service, because to do so may implicitly mean getting stuck with that traffic.
If B and C were both to try try to get away with this, a routing loop would be created within IXP1! But in
that case in Bs next advertisement to C at IXP1, B would state that it reaches D via AS-path C (or C,D
206
10 Large-Scale IP Routing
if D were a full-fledged AS), and C would do similarly; the loop would not continue for long.
207
Following these rules creates a simplified BGP world. Special cases for special situations have the potential
to introduce non-convergence or instability.
The so-called tier-1 providers are those that are not customers of anyone; these represent the top-level
backbone providers. Each tier-1 AS must, as a rule, peer with every other tier-1 AS.
A consequence of the use of the above classification and attendant export rules is the no-valley theorem
[LG01]: if every AS has BGP policies consistent with the scheme above, then when we consider the full ASpath from A to B, there is at most one peer-peer link. Those to the left of the peer-peer link are (moving from
left to right) either customerprovider links or siblingsibling links; that is, they are non-downwards
(ie upwards or level). To the right of the peer-peer link, we see providercustomer or siblingsibling
links; that is, these are non-upwards. If there is no peer-peer link, then we can still divide the AS-path into
a non-downwards first part and a non-upwards second part.
The above constraints are not quite sufficient to guarantee convergence of the BGP system to a stable set
of routes. To ensure convergence in the case without sibling relationships, it is shown in [GR01] that the
following simple local_preference rule suffices:
If AS1 gets two routes r1 and r2 to a destination D, and the first AS of the r1 route is a customer
of AS1, and the first AS of r2 is not, then r1 will be assigned a higher local_preference value
than r2.
More complex rules exist that allow for cases when the local_preference values can be equal; one such rule
states that strict inequality is only required when r2 is a provider route. Other straightforward rules handle
the case of sibling relationships, eg by requiring that siblings have local_preference rules consistent with the
use of their shared connection only for backup.
As a practical matter, unstable BGP arrangements appear rare on the Internet; most actual relationships and
configurations are consistent with the rules above.
208
10 Large-Scale IP Routing
That is, AS2,AS0 is preferred to the direct path AS0 (one way to express this preference might be
prefer routes for which the AS-PATH begins with AS2; perhaps the AS1AS0 link is more expensive).
Similarly, we assume AS2 prefers paths to D in the order AS1,AS0, AS0. Both AS1 and AS2 start out
using path AS0; they advertise this to each other. As each receives the others advertisement, they apply
their preference order and therefore each switches to routing Ds traffic to the other; that is, AS1 switches
to the route with AS-path AS2,AS0 and AS2 switches to AS1,AS0. This, of course, causes a routing
loop! However, as soon as they export these paths to one another, they will detect the loop in the AS-path
and reject the new route, and so both will switch back to AS0 as soon as they announce to each other the
change in what they use.
This oscillation may continue indefinitely, as long as both AS1 and AS2 switch away from AS0 at the
same moment. If, however, AS1 switches to AS2,AS0 while AS2 continues to use AS0, then AS2 is
stuck and the situation is stable. In practice, therefore, eventual convergence to a stable state is likely.
AS1 and AS2 might choose not to export their D-route to each other to avoid this instability.
Example 2: No stable state exists. This example is from [VGE00]. Assume that the destination D is attached
to AS0, and that AS0 in turn connects to AS1, AS2 and AS3 as in the following diagram:
AS1-AS3 each have a direct route to AS0, but we assume each prefers the AS-path that takes their clockwise
neighbor; that is, AS1 prefers AS3,AS0 to AS0; AS3 prefers AS2,AS0 to AS0, and AS2 prefers
AS1,AS0 to AS0. This is a peculiar, but legal, example of input filtering.
Suppose all adopt AS0, and advertise this, and AS1 is the first to look at the incoming advertisements.
AS1 switches to the route AS3,AS0, and announces this.
At this point, AS2 sees that AS1 uses AS3,AS0; if AS2 switches to AS1 then its path would be
AS1,AS3,AS0 rather than AS1,AS0 and so it does not make the switch.
But AS3 does switch: it prefers AS2,AS0 and this is still available. Once it makes this switch, and
advertises it, AS1 sees that the route it had been using, AS3,AS0, has become AS3,AS1,AS0. At this
point AS1 switches back to AS0.
Now AS2 can switch to using AS1,AS0, and does so. After that, AS3 finds it is now using AS2,AS1,AS0
and it switches back to AS0. This allows AS1 to switch to the longer route, and then AS2 switches back
to the direct route, and then AS3 gets the longer route, then AS2 again, etc, forever rotating clockwise.
10.7 Epilog
CIDR was a deceptively simple idea. At first glance it is a straightforward extension of the subnet concept,
moving the net/host division point to the left as well as to the right. But it has ushered in true hierarchical
10.7 Epilog
209
routing, most often provider-based. While CIDR was originally offered as a solution to some early crises in
IPv4 address-space allocation, it has been adopted into the core of IPv6 routing as well.
Interior routing using either distance-vector or link-state protocols is neat and mathematical. Exterior
routing with BGP is messy and arbitrary. Perhaps the most surprising thing about BGP is that the Internet
works as well as it does, given the complexity of provider interconnections. The business side of routing
almost never has an impact on ordinary users. To an extent, BGP works well because providers voluntarily limit the complexity of their filtering preferences, but that seems to be largely because the business
relationships of real-world ISPs do not seem to require complex filtering.
10.8 Exercises
1. Consider the following IP forwarding table that uses CIDR. IP address bytes are in hexadecimal, so each
hex digit corresponds to four address bits.
destination
[Link]/12
81.3c.0.0/16
81.3c.50.0/20
[Link]/12
[Link]/14
next_hop
A
B
C
D
E
next_hop
A
B
C
D
(a). To what next_hop would each of the following be routed? 63.b1.82.15, 9e.00.15.01, [Link]
(b). Explain why every IP address is routed somewhere, even though there is no default entry.
3. Give an IPv4 forwarding table using CIDR that will route all Class A addresses to next_hop A, all
Class B addresses to next_hop B, and all Class C addresses to next_hop C.
210
10 Large-Scale IP Routing
4. Suppose a router using CIDR has the following entries. Address bytes are in decimal except for the third
byte, which is in binary.
destination
37.119.0000 0000.0/18
37.119.0100 0000.0/18
37.119.1000 0000.0/18
37.119.1100 0000.0/18
next_hop
A
A
A
B
These four entries cannot be consolidated into a single /16 entry, because they dont all go to the same
next_hop. How could they be consolidated into two entries?
5. Suppose P, Q and R are ISPs with respective CIDR address blocks (with bytes in decimal) [Link]/8, [Link]/8 and
A: [Link]/16
B: [Link]/16
Q has customers C and D and assigns them address blocks as follows:
C: [Link]/16
D: [Link]/16
(a). Give forwarding tables for P, Q and R assuming they connect to each other and to each of their own
customers.
(b). Now suppose A switches from provider P to provider Q, and takes its address block with it. Give the
forwarding tables for P, Q and R; the longest-match rule will be needed to resolve conflicts.
(c). Now suppose in addition to A switching from P to Q, C switches from provider Q to provider R. Give
the forwarding tables.
6. Suppose P, Q and R are ISPs as in the previous problem. P and R do not connect directly; they route traffic
to one another via Q. In addition, customer B is multi-homed and has a secondary connection to provider R;
customer D is also multi-homed and has a secondary connection to provider P. R and P use these secondary
connections to send to B and D respectively; however, these secondary connections are not advertised to
other providers. Give forwarding tables for P, Q and R.
7. Consider the following network of providers P-S, all using BGP. The providers are the horizontal lines;
each provider is its own AS.
10.8 Exercises
211
(a). What routes to network NS will P receive, assuming there is no export filtering? For each route, list the
AS-path.
(b). What routes to network NQ will P receive? For each route, list the AS-path.
(c). Suppose R uses export filtering so as not to advertise to P any of its routes except those that involve S
in their AS-path. What routes to network NR will P receive, with AS-paths?
8. Consider the following network of Autonomous Systems AS1 through AS6, which double as destinations.
When AS1 advertises itself to AS2, for example, the AS-path it provides is AS1.
AS1
AS2
AS4
AS5
AS3
:
:
:
AS6
(a). If neither AS3 nor AS6 exports their AS3AS6 link to their neighbors AS2 and AS5 to the left, what
routes will AS2 receive to reach AS5? Specify routes by AS-path.
(b). What routes will AS2 receive to reach AS6?
(c). Suppose AS3 exports to AS2 its link to AS6, but AS6 continues not to export the AS3AS6 link to
AS5. How will AS5 now reach AS2? How will AS2 now reach AS6? Assume that there are no local
preferences in use in BGP best-path selection, and that the shortest AS-path wins.
9. Suppose that Internet routing in the US used geographical routing, and the first 12 bits of every IP
address represent a geographical area similar in size to a telephone area code. Megacorp gets the prefix
[Link]/16, based geographically in Chicago, and allocates subnets from this prefix to its offices in all 50
states. Megacorp routes all its internal traffic over its own network.
(a). Assuming all Megacorp traffic must enter and exit in Chicago, what is the route of traffic to and from
the San Diego office to a client also in San Diego?
212
10 Large-Scale IP Routing
(b). Now suppose each office has its own link to a local ISP, but still uses its [Link]/16 IP addresses.
Now what is the route of traffic between the San Diego office and its neighbor?
(c). Suppose Megacorp gives up and gets a separate geographical prefix for each office. What must it do to
ensure that its internal traffic is still routed over its own network?
10. Suppose we try to use BGPs strategy of exchanging destinations plus paths as an interior routing-update
strategy, perhaps replacing distance-vector routing. No costs or hop-counts are used, but routers attach to
each destination a list of the routers used to reach that destination. Routers can also have route preferences,
such as prefer my link to B whenever possible.
(a). Consider the network of 9.2 Distance-Vector Slow-Convergence Problem:
D
The DA link breaks, and B offers A what it thinks is its own route to D. Explain how exchanging path
information prevents a routing loop here.
(b). Suppose the network is as below, and initially each router knows about itself and its immediately
adjacent neighbors. What sequence of router announcements can lead to A reaching F via
ADEBCF, and what individual router preferences would be necessary? (Initially, for example,
A would reach B directly; what preference might make it prefer ADEB?)
(c). Explain why this method is equivalent to using the hopcount metric with either distance-vector or
link-state routing, if routers are not allowed to have preferences and if the router-path length is used as a
tie-breaker.
10.8 Exercises
213
214
10 Large-Scale IP Routing
11 UDP TRANSPORT
The standard transport protocols riding above the IP layer are TCP and UDP. As we saw in Chapter 1, UDP
provides simple datagram delivery to remote sockets, that is, to host,port pairs. TCP provides a much
richer functionality for sending data, but requires that the remote socket first be connected. In this chapter,
we start with the much-simpler UDP, including the UDP-based Trivial File Transfer Protocol.
We also review some fundamental issues any transport protocol must address, such as lost final packets and
packets arriving late enough to be subject to misinterpretation upon arrival. These fundamental issues will
be equally applicable to TCP connections.
The port numbers are what makes UDP into a real transport protocol: with them, an application can now
connect to an individual server process (that is, the process owning the port number in question), rather
than simply to a host.
UDP is unreliable, in that there is no UDP-layer attempt at timeouts, acknowledgment and retransmission;
applications written for UDP must implement these. As with TCP, a UDP host,port pair is known as a
socket (though UDP ports are considered a separate namespace from TCP ports). UDP is also unconnected,
or stateless; if an application has opened a port on a host, any other host on the Internet may deliver packets
to that host,port socket without preliminary negotiation.
UDP packets use the 16-bit Internet checksum (5.4 Error Detection) on the data. While it is seldom done
now, the checksum can be disabled and the field set to the all-0-bits value, which never occurs as an actual
ones-complement sum.
UDP packets can be dropped due to queue overflows either at an intervening router or at the receiving host.
When the latter happens, it means that packets are arriving faster than the receiver can process them. Higherlevel protocols that define ACK packets (eg UDP-based RPC, below) typically include some form of flow
control to prevent this.
UDP is popular for local transport, confined to one LAN. In this setting it is common to use UDP as the
transport basis for a Remote Procedure Call, or RPC, protocol. The conceptual idea behind RPC is that
one host invokes a procedure on another host; the parameters and the return value are transported back and
forth by UDP. We will consider RPC in greater detail below, in 11.4 Remote Procedure Call (RPC); for
215
now, the point of UDP is that on a local LAN we can fall back on rather simple mechanisms for timeout and
retransmission.
UDP is well-suited for request-reply semantics beyond RPC; one can use TCP to send a message and get
a reply, but there is the additional overhead of setting up and tearing down a connection. DNS uses UDP,
largely for this reason. However, if there is any chance that a sequence of request-reply operations will be
performed in short order then TCP may be worth the overhead.
UDP is also popular for real-time transport; the issue here is head-of-line blocking. If a TCP packet is lost,
then the receiving host queues any later data until the lost data is retransmitted successfully, which can take
several RTTs; there is no option for the receiving application to request different behavior. UDP, on the
other hand, gives the receiving application the freedom simply to ignore lost packets. This approach is very
successful for voice and video, where small losses simply degrade the received signal slightly, but where
larger delays are intolerable. This is the reason the Real-time Transport Protocol, or RTP, is built on top
of UDP rather than TCP. It is common for VoIP telephone calls to use RTP and UDP.
11.1.1 QUIC
Sometimes UDP is used simply because it allows new or experimental protocols to run entirely as user-space
applications; no kernel updates are required, as would be the case with TCP changes. Google has created
a protocol named QUIC (Quick UDP Internet Connections, [Link]/quic) in this category, though
QUIC also takes advantage of UDPs freedom from head-of-line blocking. For example, one of QUICs
goals includes supporting multiplexed streams in a single connection (eg for the multiple components of a
web page). A lost packet blocks its own stream until it is retransmitted, but the other streams can continue
without waiting. Because QUIC supports error-correcting codes (5.4.2 Error-Correcting Codes), a lost
packet might not require any waiting at all; this is another feature that would be difficult to add to TCP.
QUIC also eliminates the extra RTT needed for setting up a TCP connection.
QUIC provides support for advanced congestion control, currently (2014) including a UDP analog of TCP
CUBIC (15.11 TCP CUBIC). QUIC does this at the application layer but new congestion-control mechanisms within TCP often require client operating-system changes even when the mechanism lives primarily
at the server end. QUIC represents a promising approach to using UDPs flexibility to support innovative
or experimental transport-layer features. The downside of QUIC is its nonstandard programming interface,
but note that Google can achieve widespread web utilization of QUIC simply by distributing the client side
in its Chrome browser.
216
11 UDP Transport
address will form the socket address to which clients connect. Clients must discover that port number or
have it written into their application code. Clients too will have a port number, but it is largely invisible.
On the server side, simplex-talk must do the following:
ask for a designated port number
create a socket, the sending/receiving endpoint
bind the socket to the socket address, if this is not done at the point of socket creation
receive packets sent to the socket
for each packet received, print its sender and its content
The client side has a similar list:
look up the servers IP address, using DNS
create an anonymous socket; we dont care what the clients port number is
read a line from the terminal, and send it to the socket address server_IP,port
[Link] The Server
We will start with the server side, presented here in Java. We will use port 5432; the socket-creation and
port-binding operations are combined into the single operation new DatagramSocket(destport).
Once created, this socket will receive packets from any host that addresses a packet to it; there is no need
for preliminary connection. We also need a DatagramPacket object that contains the packet data and
source IP_address,port for arriving packets. The server application does not acknowledge anything sent to
it, or in fact send any response at all.
The server application needs no parameters; it just starts. (That said, we could make the port number a
parameter, to allow easy change. The port we use here, 5432, has also been adopted by PostgreSQL for TCP
connections.) The server accepts both IPv4 and IPv6 connections; we return to this below.
Though it plays no role in the protocol, we will also have the server time out every 15 seconds and display
a message, just to show how this is done; implementations of real protocols essentially always must arrange
when attempting to receive a packet to time out after a certain interval with no response. The file below is
at udp_stalks.java.
/* simplex-talk server, UDP version */
import [Link].*;
import [Link].*;
public class stalks {
static public int destport = 5432;
static public int bufsize = 512;
static public final int timeout = 15000; // time in milliseconds
static public void main(String args[]) {
DatagramSocket s;
// UDP uses DatagramSockets
217
try {
s = new DatagramSocket(destport);
}
catch (SocketException se) {
[Link]("cannot create socket with port " + destport);
return;
}
try {
[Link](timeout);
// set timeout in milliseconds
} catch (SocketException se) {
[Link]("socket exception: timeout not set!");
}
// create DatagramPacket object for receiving data:
DatagramPacket msg = new DatagramPacket(new byte[bufsize], bufsize);
while(true) { // read loop
try {
[Link](bufsize); // max received packet size
[Link](msg);
// the actual receive operation
[Link]("message from <" +
[Link]().getHostAddress() + "," + [Link]() + ">");
} catch (SocketTimeoutException ste) {
// receive() timed out
[Link]("Response timed out!");
continue;
} catch (IOException ioe) {
// should never happen!
[Link]("Bad receive");
break;
}
String str = new String([Link](), 0, [Link]());
[Link](str);
// newline must be part of str
}
[Link]();
} // end of main
}
in which case only packets sent to the host and port through the hosts specific IP address local_addr
would be delivered. It does not matter here whether IP forwarding on the host has been enabled. In the
218
11 UDP Transport
original C socket library, this binding of a port to (usually) a server socket was done with the bind() call.
To allow connections via any of the hosts IP addresses, the special IP address INADDR_ANY is passed to
bind().
When a host has multiple IP addresses, the standard socket library does not provide a way to find out to
which these an arriving UDP packet was actually sent. Normally, however, this is not a major difficulty. If
a host has only one interface on an actual network (ie not counting loopback), and only one IP address for
that interface, then any remote clients must send to that interface and address. Replies (if any, which there
are not with stalk) will also come from that address.
Multiple interfaces do not necessarily create an ambiguity either; the easiest such case to experiment with
involves use of the loopback and Ethernet interfaces (though one would need to use an application that,
unlike stalk, sends replies). If these interfaces have respective IPv4 addresses [Link] and [Link],
and the client is run on the same machine, then connections to the server application sent to [Link] will
be answered from [Link], and connections sent to [Link] will be answered from [Link]. The
IP layer sees these as different subnets, and fills in the IP source-address field according to the appropriate
subnet. The same applies if multiple Ethernet interfaces are involved, or if a single Ethernet interface is
assigned IP addresses for two different subnets, eg [Link] and [Link].
Life is slightly more complicated if a single interface is assigned multiple IP addresses on the same subnet,
eg [Link] and [Link]. Regardless of which address a client sends its request to, the servers reply
will generally always come from one designated address for that subnet, eg [Link]. Thus, it is possible
that a legitimate UDP reply will come from a different IP address than that to which the initial request was
sent.
If this behavior is not desired, one approach is to create multiple server sockets, and to bind each of the
hosts network IP addresses to a different server socket.
[Link] The Client
Next is the Java client version udp_stalkc.java. The client any client must provide the name of the
host to which it wishes to send; as with the port number this can be hard-coded into the application but is
more commonly specified by the user. The version here uses host localhost as a default but accepts any
other hostname as a command-line argument. The call to [Link](desthost)
invokes the DNS system, which looks up name desthost and, if successful, returns an IP address.
([Link]() also accepts addresses in numeric form, eg [Link], in which case
DNS is not necessary.) When we create the socket we do not designate a port in the call to new
DatagramSocket(); this means any port will do for the client. When we create the DatagramPacket
object, the first parameter is a zero-length array as the actual data array will be provided within the loop.
A certain degree of messiness is introduced by the need to create a BufferedReader object to handle
terminal input.
// simplex-talk CLIENT in java, UDP version
import [Link].*;
import [Link].*;
public class stalkc {
static public BufferedReader bin;