0% found this document useful (0 votes)
82 views15 pages

FBOSS

Uploaded by

andrewdong1994
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
82 views15 pages

FBOSS

Uploaded by

andrewdong1994
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

FBOSS: Building Switch Software at Scale

Sean Choi* Boris Burkov Alex Eckert Tian Fang


Stanford University Facebook, Inc. Facebook, Inc. Facebook, Inc.

Saman Kazemkhani Rob Sherwood Ying Zhang Hongyi Zeng


Facebook, Inc. Facebook, Inc. Facebook, Inc. Facebook, Inc.
ABSTRACT KEYWORDS
The conventional software running on network devices, such FBOSS, Facebook, Switch Software Design, Data Center
as switches and routers, is typically vendor-supplied, pro- Networks, Network Management, Network Monitoring
prietary and closed-source; as a result, it tends to contain
ACM Reference Format:
extraneous features that a single operator will not most likely
Sean Choi, Boris Burkov, Alex Eckert, Tian Fang, Saman
fully utilize. Furthermore, cloud-scale data center networks Kazemkhani, Rob Sherwood, Ying Zhang, and Hongyi Zeng. 2018.
often times have software and operational requirements that FBOSS: Building Switch Software at Scale. In SIGCOMM ’18: SIG-
may not be well addressed by the switch vendors. COMM 2018, August 20–25, 2018, Budapest, Hungary. ACM, New
In this paper, we present our ongoing experiences on over- York, NY, USA, 15 pages. [Link]
coming the complexity and scaling issues that we face when
designing, developing, deploying and operating an in-house
1 INTRODUCTION
software built to manage and support a set of features re-
quired for data center switches of a large scale Internet con- The world’s desire to produce, consume, and distribute on-
tent provider. We present FBOSS, our own data center switch line content is increasing at an unprecedented rate. Commen-
software, that is designed with the basis on our switch-as- surate with this growth are equally unprecedented technical
a-server and deploy-early-and-iterate principles. We treat challenges in scaling the underlying networks. Large Internet
software running on data center switches as any other soft- content providers are forced to innovate upon all aspects of
ware services that run on a commodity server. We also build their technology stack, including hardware, kernel, compiler,
and deploy only a minimal number of features and iterate on and various distributed systems building blocks. A driving
it. These principles allow us to rapidly iterate, test, deploy factor is that, at scale even a relatively modest efficiency
and manage FBOSS at scale. Over the last five years, our improvement can have large effects. For us, our data center
experiences show that FBOSS’s design principles allow us networks power a cloud-scale Internet content provider with
to quickly build a stable and scalable network. As evidence, billions of users, interconnecting hundreds of thousands of
we have successfully grown the number of FBOSS instances servers. Thus, it is natural and necessary to innovate on the
running in our data center by over 30x over a two year period. software that runs on switches. 1
Conventional switches typically come with software writ-
CCS CONCEPTS ten by vendors. The software includes drivers for managing
dedicated packet forwarding hardware (e.g., ASICs, FPGAs,
• Networks → Data center networks; Programming in- or NPUs), routing protocols (e.g., BGP, OSPF, STP, MLAG),
terfaces; Routers; monitoring and debugging features (e.g., LLDP, BFD, OAM),
configuration interfaces (e.g., conventional CLI, SNMP, Net-
* Work done while at Facebook, Inc. Conf, OpenConfig), and a long tail of other features needed
to run a modern switch. Implicit in the vendor model is the
Permission to make digital or hard copies of all or part of this work for assumption that networking requirements are correlated be-
personal or classroom use is granted without fee provided that copies are not tween customers. In other words, vendors are successful be-
made or distributed for profit or commercial advantage and that copies bear cause they can create a small number of products and reuse
this notice and the full citation on the first page. Copyrights for components them across many customers. However our network size and
of this work owned by others than ACM must be honored. Abstracting with
the rate of growth of the network (Figure 1) are unlike most
credit is permitted. To copy otherwise, or republish, to post on servers or to
redistribute to lists, requires prior specific permission and/or a fee. Request other data center networks. Thus, they imply that our require-
permissions from permissions@[Link]. ments are quite different from most customers.
SIGCOMM ’18, August 20–25, 2018, Budapest, Hungary
© 2018 Association for Computing Machinery. 1 We use “switch” for general packet switching devices such as switches and
ACM ISBN 978-1-4503-5567-4/18/08. . . $15.00 routers. Our data center networks are fully Layer 3 routed similar to what is
[Link] described in [36].
SIGCOMM ’18, August 20–25, 2018, Budapest, Hungary S. Choi et al.

Growth in FBOSS deployments 30x


FBOSS Deployments →

Thanks to this trend, we started an experiment of building


our in-house designed switch software five years ago. Our
20x server fleet already runs thousands of different software ser-
15x vices. We wanted to see if we can run switch software in a
10x similar way we run software services. This model is quite
different from how conventional networking software is man-
5x aged. Table 1 summarizes the differences between the two
1x Months → high-level approaches using a popular analogy [17].
0 3 6 9 12 15 18 21 24 The result is Facebook Open Switching System (FBOSS),
which is now powering a significant portion of our data center
Figure 1: Growth in the number of switches in our data infrastructure. In this paper, we report on five years of expe-
center over two year period as measured by number of riences on building, deploying and managing FBOSS. The
total FBOSS deployments. main goals of this paper are:
(1) Provide context about the internal workings of the soft-
One of the main technical challenges in running large net-
ware running on switches, including challenges, design trade-
works is managing complexity of excess networking features.
offs, and opportunities for improvement, both in the abstract
Vendors supply common software intended for their entire
for all network switch software and our specific pragmatic
customer base, thus their software includes the union of all
design decisions.
features requested by all customers over the lifetime of the
(2) Describe the design, automated tooling for deployment
product. However, more features lead to more code and more
monitoring, and remediation methods of FBOSS.
code interactions, which ultimately lead to increased bugs,
(3) Provide experiences and illustrative problems encoun-
security holes, operational complexity, and downtime. To
tered on managing a cloud-scale data center switch software.
mitigate these issues, many data centers are designed for sim-
(4) Encourage new research in the more accessible/open
plicity and only use a carefully selected subset of networking
field of switch software and provide a vehicle, an open source
features. For example, Microsoft’s SONiC focuses on build-
version of FBOSS [4], for existing network research to be
ing a “lean stack” in switches [33].
evaluated on real hardware.
Another of our network scaling challenges is enabling a
The rest of the paper closely follows the structure of Ta-
high-rate of innovation while maintaining network stability.
ble 1 and is structured as follows: We first provide a couple of
It is important to be able to test and deploy new ideas at scale
our design principles that guide FBOSS’s development and
in a timely manner. However, inherent in the vendor-supplied
deployment (Section 2). Then, we briefly describe major hard-
software model is that changes and features are prioritized by
ware components that most data center switch software needs
how well they correlate across all of their customers. A com-
to manage (Section 3) and summarize the specific design
mon example we cite is IPv6 forwarding, which was imple-
decisions made in our system (Section 4). We then describe
mented by one of our vendors very quickly due to widespread
the corresponding deployment and management goals and
customer demand. However, an important feature to our oper-
lessons (Section 5, Section 6). We describe three operational
ational workflow was fine-grained monitoring of IPv6, which
challenges (Section 7) and then discuss how we have success-
we quickly implemented for our own operational needs. Had
fully overcome them. We further discuss various topics that
we left this feature to the demands of the customer market
led to our final design (Section 8) and provide a road map for
and to be implemented by the vendors, we would not have
future work (Section 9).
had this feature until over four years later, which was when
the feature actually arrived to the market.
In recent years, the practice of building network switch
components has become more open. First, network vendors 2 DESIGN PRINCIPLES
emerged that do not build their own packet forwarding chips. We designed FBOSS with two high-level design princi-
Instead they rely on third-party silicon vendors, commonly ples: (1) Deploy and evolve the software on our switches the
known as “merchant silicons”. Then, merchant silicon ven- same as we do our servers (Switch-as-a-Server). (2) Use early
dors along with box/chassis manufacturers have emerged that deployment and fast iteration to force ourselves to have a
create a new, disaggregated ecosystem where networking minimally complex network that only uses features that are
hardware can be purchased without any software. As a re- strictly needed (Deploy-Early-and-Iterate). These principles
sult, it is now possible for end-customers to build a complete have been echoed in the industry - a few other customized
custom switch software stack from scratch. switch software efforts like Microsoft ACS [8]/SONiC [33]
is based on similar motivation. However, one thing to note is
FBOSS: Building Switch Software at Scale SIGCOMM ’18, August 20–25, 2018, Budapest, Hungary

Switch Software General Software


Hardware (S3) Closed, custom embedded systems. Limited
Open, general-purpose servers. Spare and fungi-
CPU/Mem resources. ble CPU/Mem resources
Release Cycle (S5) Planned releases every 6-12 months. Continuous deployment.
Testing (S5) Manual testing, slow production roll out.
Continuous integration in production.
Resiliency Goals (S5) High device-level availability with target of
High system-level availability with redundant
99.999% device-level uptime. service instances.
Upgrades (S6) Scheduled downtime, manual update process.
Uninterrupted service, automated deployment.
Configuration (S6) Decentralized and manually managed. Custom
Centrally controlled, automatically generated
backup solutions. and distributed. Version controlled backups.
Monitoring (S6) Custom scripting on top of SNMP counters.
Rich ecosystem of data collection, analytics and
monitoring software libraries and tools.
Table 1: Comparison of development and operation patterns between conventional switch software and general software
services, based on popular analogy [17].

that our design principles are specific to our own infrastruc- are able to rapidly test and evaluate the changes in produc-
ture. Data center network at Facebook has multiple internal tion. In addition, we run our databases on commodity servers,
components, such as Robotron [48],FbNet [46], Scuba [15] rather than running them on custom hardware, so that both
and Gorilla [42], that are meticulously built to work with one the software and the hardware can be easily controlled and
another, and FBOSS is no different. Thus, our design has our debugged. Lastly, since the code is open source, we make
specific goal of easing the integration of FBOSS into our ex- our changes available back to the world and benefit from
isting infrastructure, which ultimately means that it may not discussions and bug fixes produced by external contributors.
be generalized for any data center. Given this, we specifically Our experiences with general software services showed
focus on describing the effects of these design principle in that this principle is largely successful in terms of scalability,
terms of our software architecture, deployment, monitoring, code reuse, and deployment. Therefore, we designed FBOSS
and management. based on the same principle. However, since data center net-
works have different operational requirements than a general
software service, there are a few caveats to naively adopting
2.1 Switch-as-a-Server this principle that are mentioned in Section 8.
A motivation behind this principle comes from our experi-
ences in building large scale software services. Even though
many of the same technical and scaling challenges apply 2.2 Deploy-Early-and-Iterate
equally to switch software as to general distributed software Our initial production deployments were intentionally lack-
systems, historically, they have have been addressed quite ing in features. Bucking conventional network engineering
differently. For us, the general software model has been more wisdom, we went into production without implementing a
successful in terms of reliability, agility, and operational sim- long list of “must have” features, including control plane
plicity. We deploy thousands of software services that are not policing, ARP/NDP expiration, IP fragmentation/reassembly,
feature-complete or bug-free. However, we carefully monitor or Spanning Tree Protocol (STP). Instead of implementing
our services and once any abnormality is found, we quickly these features, we prioritized on building the infrastructure
make a fix, deploy the change. We found this practice to be and tooling to efficiently and frequently update the switch
highly successful in building and scaling our services. software, e.g., the warm boot feature (Section 7.1).
For example, database software is an important part of Keeping with our motivation to evolve the network quickly
our business. Rather than using a closed, proprietary vendor- and reduce complexity, we hypothesized that we could dy-
provided solution that includes unnecessary features, we namically derive the actual minimal network requirements by
started an open source distributed database project and modi- iteratively deploying switch software into production, observ-
fied it heavily for internal use. Given that we have full access ing what breaks, and quickly rolling out the fixes. By starting
to the code, we can precisely customize the software for the small and relying on application-level fault tolerance, a small
desired feature set and thereby reduce complexity. Also, we initial team of developers were able to go from nothing to
make daily modifications to the code and, using the industry code running in production in an order of magnitude fewer
practices of continuous integration and staged deployment, person-years than in typical switch software development.
SIGCOMM ’18, August 20–25, 2018, Budapest, Hungary S. Choi et al.

100 Average Switch CPU Utilization


Power Supply

CPU Utilizaation (%)


80 Wedge40
Fan Wedge100
Wedge100s
Temperature 60
Sensor
40
x86 CPU
20
SSD
Days→
0
0 25 50 75 100 125 150
BMC
Figure 3: Average CPU utilization of FBOSS on across
ASIC various type of switches in one of Facebook’s data cen-
ters.
CPLD
be quickly accessed by the ASIC; a parse pipeline, consisting
of a parser and a deparser, which locates, extracts, saves
QSFP Ports the interesting data from the packet, and rebuilds the packet
before egressing it [19]; and match-action units, which specify
how the ASIC should process the packets based on the data
inside the packet, configured packet processing logic and the
Figure 2: A typical data center switch architecture.
data inside the ASIC memory.
Perhaps more importantly, using this principle, we were PHY. The PHY is responsible for connecting the link-layer
able to derive and build the simplest possible network for our device, such as the ASIC, to the physical medium, such as an
environment and have a positive impact on the production optical fiber, and translating analog signals from the link to
network sooner. For example, when we discovered that lack digital Ethernet frames. In certain switch designs, PHY can be
of control plane policing was causing BGP session time-outs, built within the ASIC. At high-speeds, electrical signal inter-
we quickly implemented and deployed it to fix the problem. ference is so significant that it causes packet corruption inside
By having positive impact to the production network early, a switch. Therefore, complex noise reduction techniques, such
we were able to make a convincing case for additional engi- as PHY tuning [43], are needed. PHY tuning controls various
neers and with more help. To date, we still do not implement parameters such as preemphasis, variable power settings, or
IP fragmentation/reassembly, STP, or a long list of widely the type of Forward Error Correction algorithm to use.
believed “must have” features. Port Subsystem. The port subsystem is responsible for
reading port configurations, detecting the type of ports in-
3 HARDWARE PLATFORM stalled, initializing the ports, and providing interfaces for the
To provide FBOSS’s design context, we first review what ports to interact with the PHY. Data center switches house
typical switch hardware contains. Some examples are a switch multiple Quad Small Form-factor Pluggable (QSFP) ports.
application-specific integrated circuit (ASIC), a port subsys- A QSFP port is a compact, hot-pluggable transceiver used
tem, a Physical Layer subsystem (PHY), a CPU board, com- to interface switch hardware to a cable, enabling data rates
plex programmable logic devices, and event handlers. The up to 100Gb/s. The type and the number of QSFP ports are
internals of a typical data center switch are shown in Fig- determined by the switch specifications and the ASIC.
ure 2 [24]. FBOSS interacts with the port subsystem by assigning
dynamic lane mapping and adapting to port change events.
3.1 Components Dynamic lane mapping refers to mapping multiple lanes in
each of the QSFPs to appropriate port virtual IDs. This allows
Switch ASIC. Switch ASIC is the important hardware changing of port configurations without having to restart the
component on a switch. It is a specialized integrated circuit switch. FBOSS monitors the health of the ports and once any
for fast packet processing, capable of switching packets up abnormality is detected, FBOSS performs remediation steps,
to 12.8 terabits per second [49]. Switches can augment the such as reviving the port or rerouting the traffic to a live port.
switch ASIC with other processing units, such as FPGAs [53] CPU Board. There exists a CPU board within a switch that
or x86 CPUs, at a far lower performance [52]. A switch ASIC runs a microserver [39]. A CPU board closely resembles a
has multiple components: memory, typically either CAM, commodity server, containing a commodity x86 CPU, RAM
TCAM or SRAM [19], that stores information that needs to
FBOSS: Building Switch Software at Scale SIGCOMM ’18, August 20–25, 2018, Budapest, Hungary

and a storage medium. In addition to these standard parts, 12


Growth of FBOSS Codebase

Lines of Code (x10000)


a CPU board has a PCI-E interconnect to the switch ASIC 10
that enables quick driver calls to the ASIC. The presence 8
of a x86 CPU enables installation of commodity Linux to
6
provide general OS functionalities. CPUs within switches
are conventionally underpowered compared to a server-grade 4
CPUs. However, FBOSS is designed under the assumption 2
that the CPUs in the switches are as powerful as server-grade 0
CPUs, so that the switch can run as much required server Mar-15 Sep-15 Mar-16 Sep-16 Mar-17 Sep-17
services as possible. Fortunately, we designed and built our
data center switches in-house, giving us flexibility to choose Figure 4: Growth of FBOSS open source project.
our own CPUs that fits within our design constraints. For
example, our Wedge 100 switch houses an Quad Core Intel events and change in link configurations. The link status han-
E3800 CPU. We over-provision the CPU, so that the switch dler is usually implemented with a busy polling method where
CPU runs under 40% utilization to account for any bursty the switch software has an active thread that constantly moni-
events from shutting down the switch. Such design choice can tors the PHY for link status and then calls the user-supplied
be seen in various types of switches that we deploy, as seen callbacks when changes are detected. FBOSS provides a call-
in Figure 3. The size allocated for the CPU board limited us back to the link event handler, and syncs its local view of the
from including an even powerful CPU [24]. link states when the callback is activated.
Miscellaneous Board Managers. A switch offloads mis- Slow Path Packet Handler. Most switches allow packets
cellaneous functions from the CPU and the ASIC to various to egress to a designated CPU port, the slow path. Similar
components to improve overall system performance. Two to the link status handler, the slow packet handler constantly
examples of such components are Complex Programmable polls the CPU port. Once a packet is received at a CPU port,
Logic Device (CPLD) and the Baseboard Management Con- the slow path packet handler notifies the switch software of
troller (BMC). The CPLDs are responsible for status monitor- the captured packet and activates the supplied callback. The
ing, LED control, fan control and managing front panel ports. callback is supplied with various information, which may
The BMC is a specialized system-on-chip that has its own include the actual packet that is captured. This allows the
CPU, memory, storage, and interfaces to connect to sensors slow path packet handler to greatly extend a switch’s feature
and CPLDs. BMC manages power supplies and fans. It also set, as it enables custom packet processing without having
provides system management functions such as remote power to change the data plane’s functionality. For example, one
control, serial over LAN, out-of-band monitoring and error can sample a subset of the packets for in-band monitoring or
logging, and a pre-OS environment for users to install an modify the packets to include custom information. However,
OS onto the microsever. The BMC is controlled by custom as indicated by its name, the slow path packet handler is too
software such as OpenBMC [25]. slow to perform custom packet processing at line rate. Thus it
The miscellaneous board managers introduce additional is only suitable for use cases that involve using only a small
complexities for FBOSS. For example, FBOSS retrieves sample of the packets that the switch receives.
QSFP control signals from the CPLDs, a process that requires
complex interactions with the CPLD drivers. 4 FBOSS
To manage the switches as described in Section 3, we devel-
oped FBOSS, vendor-agnostic switch software that can run on
3.2 Event Handlers a standard Linux distribution. FBOSS is currently deployed
to both ToR and aggregation switches in our production data
Event handlers enable the switch to notify any external en-
centers. FBOSS’s code base is publicly available as an open
tities of its internal state changes. The mechanics of a switch
source project and it is supported by a growing community.
event handler are very similar to any other hardware-based
As of January 2018, a total of 91 authors have contributed to
event handlers, thus the handlers can be handled in both syn-
the project and the codebase now spans 609 files and 115,898
chronous or asynchronous fashion. We discuss two switch
lines of code. To give a scope of how much lines of code a fea-
specific event handlers: the link event handler, and the slow
ture may take to implement, implementing link aggregation
path packet handler.
in FBOSS required 5,932 lines of newly added code. Note
Link Event Handler. The link event handler notifies the
that this can be highly variable depending on the feature of
ASIC and FBOSS of any events that occur in the QSFP ports
interest, and some features may not be easily divisible from
or the port subsystem. Such events include link on and off
one another. Figure 4 shows the growth of the open source
SIGCOMM ’18, August 20–25, 2018, Budapest, Hungary S. Choi et al.

Software
4.1 Architecture
Configuration
Monitor Routing Daemon
Manager FBOSS consists of multiple interconnected components
Standard that we categorize as follows: Switch Software Development
FBOSS
System Tools &
(Agent, QSFP Service, CLI) Kit (SDK), HwSwitch, Hardware abstraction layer, SwSwitch,
Libraries
Switch SDK
Linux Kernel OpenBMC
State observers, local config generator, a Thrift [2] manage-
ment interface and QSFP service. FBOSS agent is the main
Switch Hardware process that runs most of FBOSS’s functionalities. The switch
x86 Microserver BMC SDK is bundled and compiled with the FBOSS agent, but is
Power
provided externally by the switch ASIC vendor. All of the
Switch ASIC Fans
Supply other components besides the QSFP service, which runs as
Front Panel Ports and Modules
Console and OOB its own independent process, reside inside the FBOSS agent.
Management Ports
We discuss each component in detail, except the local config
generator, which we will discuss in Section 6.
Figure 5: Switch software and hardware components. Switch SDK. A switch SDK is ASIC vendor-provided soft-
ware that exposes APIs for interacting with low-level ASIC
functions. These APIs include ASIC initialization, installing
forwarding table rules, and listening to event handlers.
HwSwitch. The HwSwitch represents an abstraction of
FBOSS Agent
State Local Config Thrift Management the switch hardware. The interfaces of HwSwitch provide
Observers Generator Interface generic abstractions for configuring switch ports, sending and
receiving packets to these ports, and registering callbacks for
SwSwitch state changes on the ports and packet input/output events that
Switch Hardware Independent Thrift occur on these ports. Aside from the generic abstractions,
State Manager Routing Logic Handler ASIC specific implementations are pushed to the hardware
abstraction layer, allowing switch-agnostic interaction with
HwSwitch
the switch hardware. While not a perfect abstraction, FBOSS
Generic Switch Hardware Abstraction has been ported to two ASIC families and more ports are in
progress. An example of a HwSwitch implementation can be
Hardware Abstraction found here [14].
Event Handler Switch Feature Implementation Hardware Abstraction Layer. FBOSS allows users to
Callback (L2, L3, ACL, LAG) easily add implementation that supports a specific ASIC by
extending the HwSwitch interface. This also allows easy sup-
SDK port for multiple ASICs without making changes to the main
QSFP
Service Link Status Slow Path Switch ASIC FBOSS code base. The custom implementation must sup-
Handler Packet Handler APIs port the minimal set of functionalities that are specified in
HwSwitch interface. However, given that HwSwitch only
Figure 6: Architecture overview of FBOSS. specifies a small number of features, FBOSS allows custom
project since its inception. The big jump in the size of the implementation to include additional features. For example,
codebase that occurred in September of 2017 is a result of open-source version of FBOSS implements custom features
adding a large number of hardcoded parameters for FBOSS such as specifying link aggregation, adding ASIC status mon-
to support a particular vendor NIC. itor, and configuring ECMP.
FBOSS is responsible for managing the switch ASIC and SwSwitch. The SwSwitch provides the hardware-
providing a higher level remote API that translates down to independent logic for switching and routing packets, and in-
specific ASIC SDK methods. The external processes include terfaces with the HwSwitch to transfer the commands down to
management, control, routing, configuration, and monitoring the switch ASIC. Some example of the features that SwSwitch
processes. Figure 5 illustrates FBOSS, other software pro- provides are, interfaces for L2 and L3 tables, ACL entries,
cesses and hardware components in a switch. Note that in our and state management.
production deployment, FBOSS share the same Linux envi- State Observers. SwSwitch make it possible to implement
ronment (e.g., OS version, packaging system) as our server low-level control protocols such as ARP, NDP, LACP, and
fleet, so that we can utilize the same system tools and libraries
on both servers and switches.
FBOSS: Building Switch Software at Scale SIGCOMM ’18, August 20–25, 2018, Budapest, Hungary

1 struct L2EntryThrift { SwitchState SwitchState’


2 1: string mac,
3 2: i32 port,
4 3: i32 vlanID, Ports VLANs Routes VLANs’
5 }
6 list<L2EntryThrift> getL2Table()
7 throws (1: [Link] error) Port1 … Porti VLAN1 … VLANj Route1 … Routek VLAN’j

Figure 7: Example of Thrift interface definition to re- ARP1 ARPj ARP’j


trieve L2 entries from the switch.
LLDP, by keeping protocols apprised of state changes2 . The Figure 8: Illustration of FBOSS’s switch state update
protocols are notified of state changes via a mechanism called through copy-on-write tree mechanism.
state observation. Specifically, any object at the time of its
initialization may register itself as a State Observer. By doing
4.2 State Management
so, every future state change invokes a callback provided by FBOSS’s software state management mechanism is de-
the object. The callback provides the state change in question, signed for high concurrency, fast reads, and easy and safe
allowing the object to react accordingly. For example, NDP updates. The state is modeled as a versioned copy-on-write
registers itself as a State Observer so that it may react to port tree [37]. The root of the tree is the main switch state class,
change events. In this way, the state observation mechanism and each child of the root represents a different category of the
allows protocol implementations to be decoupled from issues switch state, such as ports or VLAN entries. When an update
pertaining to state management. happens to one branch of the tree, every node in the branch
Thrift Management Interface. We run our networks in a all the way to the root is copied and updated if necessary.
split control configuration. Each FBOSS instance contains Figure 8 illustrates a switch state update process invoked by
a local control plane, running protocols such as BGP or an update on an VLAN ARP table entry. We can see that only
OpenR [13], on a microserver that communicates with a the nodes and the links starting from the modified ARP table
centralized network management system through a Thrift up to the root are recreated. While the creation of the new tree
management interface. The types of messages that are sent occurs, the FBOSS agent still interacts with the prior states
between them are as in the form seen in Figure 7. The full without needing to capture any locks on the state. Once the
open-source specification of the FBOSS Thrift interface is copy-on-write process completes for the entire tree, FBOSS
also available [5]. Given that the interfaces can be modified reads from the new switch state.
to fit our needs, Thrift provides us with a simple and flexible There are multiple benefits to this model. First, it allows
way to manage and operate the network, leading to increased for easy concurrency, as there are no read locks. Reads can
stability and high availability. We discuss the details of the still continue to happen while a new state is created, and
interactions between the Thrift management interface and the the states are only created or destroyed and never modified.
centralized network management system in Section 6. Secondly, versioning of states is much simpler. This allows
QSFP Service. The QSFP service manages a set of QSFP easier debugging, logging, and validation of each state and
ports. This service detects QSFP insertion or removal, reads its transitions. Lastly, since we log all the state transitions, it
QSFP product information (e.g., manufacturer), controls is possible to perform a restart and then restore the state to
QSFP hardware function (i.e., change power configuration), its pre-restart form. There also are some disadvantages to this
and monitors the QSFPs. FBOSS initially had the QSFP ser- model. Since every state change results in a new switch state
vice within the FBOSS agent. However, as the service con- object, the update process requires more processing. Secondly,
tinues to evolve, we must restart the FBOSS agent and the implementation of switch states is more complex than simply
switch to apply the changes. Thus, we separated the QSFP obtaining locks and updating a single object.
service into a separate process to improve FBOSS’s modular- Hardware Specific State. The hardware states are the
ity and reliability. As the result, FBOSS agent is more reliable states that are kept inside the ASIC itself. Whenever a hard-
as any restarts or bugs in QSFP service do not affect the agent ware state needs to be updated in software, the software must
directly. However, since QSFP service is a separate process, call the switch SDK to retrieve the new states. The FBOSS
it needs separate tools for packaging, deployment, and moni- HwSwitch obtains both read and write locks on the corre-
toring. Also, careful process synchronization between QSFP sponding parts of the hardware state until the update com-
service and FBOSS agent is now required. pletes. The choice of lock based state updates may differ
based on the SDK implementation.
2 Theother functions include control packets transmission and reception and
programming of switch ASIC and hardware.
SIGCOMM ’18, August 20–25, 2018, Budapest, Hungary S. Choi et al.

Misc. Sofware The FBOSS deployment process is very similar to other


Software (60%)

Issues 28% continuous deployment processes [22] and is split into three
Microserver 24%
Reboot distinct parts: continuous canary, daily canary and staged
Kernel Panic deployment. Each of these parts serves a specific purpose to
Microserver ensure a reliable deployment. We currently operate roughly
Unresponsive at a monthly deployment cycle, which includes both canaries
Loss of Power 1%
and staged deployment, to ensure high operational stability.
2%
Hardware (40%)

Bus Degradation Continuous Canary. The continuous canary is a process


5%
SSD Issue that automatically deploys all newly committed code in the
8% FBOSS repository to a small number of switches that are
PCI-E Timeout running in production, around 1-2 switches per each type of
11%
7%
Misc. Hardware switch, and monitors the health of the switch and the adjacent
Issues 14% switches for any failures. Once a failure is detected, contin-
uous canary will immediately revert the latest deployment
Figure 9: Culprits of switch outages over a month. and restore the last stable version of the code. Continuous
5 TESTING AND DEPLOYMENT canary is able to quickly catch errors related to switch initial-
Switch software is conventionally developed and released ization, such as issues with warm boot, configuration errors
by switch vendors and is closed and proprietary. Therefore, a and unpredictable race conditions.
new release to the switch software can take months, in lengthy Daily Canary. The daily canary is a process that follows
development and manual QA test cycles. In addition, given continuous canary to test the new commit at a longer timescale
that software update cycles are infrequent, an update usually with more switches. Daily canary runs once a day and deploys
contains a large number of changes that can introduce new the latest commit that has passed the continuous canary. Daily
bugs that did not exist previously. In contrast, typical large canary deploys the commit to around 10 to 20 switches per
scale software deployment processes are automated, fast, and each type of the switch. Daily canary runs throughout the day
contain a smaller set of changes per update. Furthermore, to capture bugs that slowly surface over time, such as memory
feature deployments are coupled with automated and incre- leaks or performance regressions in critical threads. This is
mental testing mechanisms to quickly check and fix bugs. Our the final phase before a network-wide deployment.
outage records (Figure 9) show that about 60% of the switch Staged Deployment. Once daily canary completes, a hu-
outages are caused by faulty software. This is similar to the man operator intervenes to push the latest code to all of the
known rate of software failures in data center devices, which switches in production. This is the only step of the entire
is around 51% [27]. To minimize the occurrences and impact deployment process that involves an human operator and
of these outages, FBOSS adopts agile, reliable and scalable roughly takes about a day to complete entirely. The operator
large scale software development and testing schemes. runs a deployment script with the appropriate parameters to
Instead of using existing automatic software deployment slowly push the latest code into the subset of the switches at
framework like Chef [3] or Jenkins [6], FBOSS employs a time. Once the number of failed switches exceed a preset
its own deployment software called fbossdeploy. One of the threshold, usually around 0.5% of the entire switch fleet, the
main reason for developing our own deployment software is deployment script stops and asks the operator to investigate
to allow for a tighter feedback loop with existing external the issues and take appropriate actions. The reasons for keep-
monitors. We have several existing external monitors that ing the final step manual are as follows: First, a single server
continuously check the health of the network. These monitors is fast enough to deploy the code to all of the switches in the
check for attributes such as link failures, slow BGP conver- data center, meaning that the deployment process is not bottle-
gence times, network reachability and more. While existing necked by one machine deploying the code. Secondly, it gives
deployment frameworks that are built for deploying generic fine grained monitoring over the unpredicted bugs that may
software are good at preventing propagation of software re- not be caught by the existing monitors. For example, we fixed
lated bugs, such as deadlocks or memory leaks, they are not unpredicted and persistent reachability losses, such as inad-
built to detect and prevent network-wide failures, as these vertently changing interface IP or port speed configurations
failures may be hard to detect from a single node. There- and transient outages like as port flaps, that we found during
fore, fbossdeploy is built to react quickly to the network-wide staged deployment. Lastly, we are still improving our testing,
failures, such as reachability failures, that may occur during monitoring and deployment system. Thus, once the test cover-
deployment. age and automated remediation is within a comfortable range,
we plan automate the last step as well.
FBOSS: Building Switch Software at Scale SIGCOMM ’18, August 20–25, 2018, Budapest, Hungary

the configuration concurrently, which limits inconsistencies


Network Management System in the configuration. Secondly, it makes the configuration
Centralized Operation reproducible and deterministic, since the configurations are
Monitor
Config Database Manager versioned and FBOSS agent always reads the latest configura-
tion upon restarts. And lastly, it avoids manual configuration
Config Use Query/Receive errors. On the other hand, there are also disadvantages to our
Data config_x Device States fully automated configuration system - it lacks a complex
human interactive CLI, which makes manual debugging diffi-
FBOSS Agent cult; also, there is no support for incremental configuration
changes, which makes each configuration change require a
Local Config Thrift restart of the FBOSS agent.
Generator Interface
Generate Revert 6.2 Draining
Draining is the process safely removing an aggregation
Active Stage Staged switch from its service. ToR switches are generally not
config Re-use
config_1 drained, unless all of the services under the ToR switch are
drained as well. Similarly, undraining is the process of restor-
ing the switch’s previous configuration and bringing it back
Configuration Operation into service. Due to frequent feature updates and deployments
Monitoring performed on a switch, draining and undraining a switch is
one of the major operational tasks that is performed frequently.
Figure 10: FBOSS interacts with a central network man- However, draining is conventionally a difficult operational
agement system via the Thrift management interface. task, due to tight timing requirements and simultaneous con-
figuration changes across multiple software components on
6 MANAGEMENT
the switch [47]. In comparison, FBOSS’s draining/undraining
In this section, we present how FBOSS interacts with operation is made much simpler thanks to the automation and
management system and discuss the advantages of FBOSS’s the version control mechanism in the configuration manage-
design from a network management perspective. Figure 10 ment design. Our method of draining a switch is as follows:
shows a high-level overview of the interactions. (1) FBOSS agent retrieves the drained BGP configuration data
from a central configuration database. (2) The central man-
6.1 Configurations agement system triggers the draining process via the Thrift
FBOSS is designed to be used in a highly controlled data management interface. (3) The FBOSS agent activates the
center network with a central network manager. This greatly drained config and restarts the BGP daemon with the drained
simplifies the process of generation and deployment of net- config. As for the undraining process, we repeat the above
work configurations across large number of switches. steps, but with an undrained configuration. Then, as a final
Configuration Design. The configuration of network de- added step, the management system pings the FBOSS agent
vices is highly standardized in data center environments. and queries the switch statistics to ensure that the undraining
Given a specific topology, each device is automatically con- process is successful. Draining is an example where FBOSS’s
figured by using templates and auto-generated configuration Thrift management interface and the centrally managed con-
data. For example, the IP address configurations for a switch figuration snapshots significantly simplify an operational task.
is determined by the type of the switch (e.g., ToR or aggrega-
tion), and its upstream/downstream neighbors in the cluster. 6.3 Monitoring and Failure Handling
Configuration Generation and Deployment. The config- Traditionally, data center operators use standardized net-
uration data is generated by our network management system work management protocols, such as SNMP [21], to collect
called Robotron [48] and is distributed to each switch. The switch statistics, such as CPU/memory utilization, link load,
local config generator in FBOSS agent then consumes the packet loss, and miscellaneous system health, from the vendor
configuration data and creates an active config file. If any network devices. In contrast, FBOSS allows external systems
modification is made to the data file, a new active config file to collect switch statistics through two different interfaces:
is generated and the old configuration is stored as a staged a Thrift management interface and Linux system logs. The
config file. There are multiple advantages to this configuration Thrift management interface serves the queries in the form
process. First, it disallows multiple entities from modifying specified in the Thrift model. This interface is mainly used to
SIGCOMM ’18, August 20–25, 2018, Budapest, Hungary S. Choi et al.

monitor high-level switch usage and link statistics. Given that 7.1 Side Effect of Infrastructure Reuse
FBOSS runs as a Linux process, we can also directly access For improved efficiency, our data centers deploy a network
the system logs of the switch microserver. These logs are topology with a single ToR switch, which implies that the
specifically formatted to log the category events and failures. ToR switches are a single point of failure for the hosts in
This allows the management system to monitor low-level sys- the rack. As a result, frequent FBOSS releases made on the
tem health and hardware failures. Given the statistics that it ToR switches need to be non-disruptive to ensure availability
collects, our monitoring system, called FbFlow [46], stores of the services running on those hosts. To accomplish this,
the data to a database, either Scuba [15] or Gorilla [42], based we use an ASIC feature called "warm boot". Warm boot
on the type of the data. Once the data is stored, it enables allows FBOSS to restart without affecting the forwarding
our engineers to query and analyze the data at a high level tables within the ASIC, effectively allowing the data plane to
over a long time period. Monitoring data, and graphs such as continue to forward traffic while the control plane is being
Figure 3, can easily be obtained by the monitoring system. restarted. Although this feature is highly attractive and has
To go with the monitoring system, we also implemented an allowed us to achieve our desired release velocity, it also
automated failure remediation system. The main purpose of greatly complicates the state management between FBOSS,
the remediation system is to automatically detect and recover routing daemons, switch SDK and the ASIC. Thus, we share
from software or hardware failures. It also provides deeper in- a case where warm boot and our code reuse practices have
sights for human operators to ease the debugging process. The resulted in a major outage.
remediation process is as follows. Once a failure is detected, Despite the fact that we have a series of testing and moni-
the remediation system automatically categorizes each failure toring process for new code deployments, it is inevitable for
to a set of known root causes, applies remediations if needed, bugs to leak into data center-wide deployments. The most dif-
and logs the details of the outage to a datastore. The auto- ficult type of bugs to debug are the ones that appear rarely and
matic categorization and remediation of failures allows us to inconsistently. For example, our BGP daemon has a graceful
focus our debugging efforts on undiagnosed errors rather than restart feature to prevent warm boots from affecting the neigh-
repeatedly debugging the same known issues. Also, the exten- bor devices when BGP sessions are torn down by FBOSS
sive log helps us drive insights like isolating a rare failure to restarts or failures [38]. The graceful restart has a timeout
a particular hardware revision or kernel version. before declaring BGP sessions are broken, which effectively
In summary, our approach has the following advantages: puts a time constraint on the total time a warm boot oper-
Flexible Data Model. Traditionally, supporting a new type ation can take. In one of our deployments, we found that
of data to collect or modifying an existing data model requires the Kerberos [7] library, which FBOSS and many other soft-
modifications and standardization of the network management ware services, use to secure communication between servers,
protocols and then time for vendors to implement the stan- caused outages for a small fraction of switches in our data
dards. In contrast, since we control the device, monitoring center. We realized that the reason for the outages is that
data dissemination via FBOSS and the data collection mecha- the library often took a long time to join the FBOSS agent
nism through the management system, we can easily define thread. Since the timing and availability constraints for other
and modify the collection specification. We explicitly define software services are more lenient than FBOSS’s warm boot
the fine-grained counters we need and instrument the devices requirements, existing monitors were not built detect such
to report those counters. rare performance regressions.
Improved Performance. Compared to conventional moni- Takeaway: Simply reusing widely-used code, libraries or
toring approaches, FBOSS has better performance as the data infrastructure that are tuned for generic software services may
transfer protocol can be customized to reduce both collection not work out of the box with switch software.
time and network load.
Remediation with Detailed Error Logs. Our system al-
lows the engineers to focus on building remediation mecha- 7.2 Side Effect of Rapid Deployment
nisms for unseen bugs, which consequently improves network
During the first few months of our initial FBOSS deploy-
stability and debugging efficiency.
ment, we occasionally encountered unknown cascading out-
ages of multiple switches. The outage would start with a
7 EXPERIENCES single device and would spread to nearby devices, resulting in
While the experiences of operating a data center network very high packet loss within a cluster. Sometimes the network
with custom switch software and hardware has been mostly would recover on its own, sometimes not. We realized that
satisfactory, we faced outages that are previously unseen and the outages were more likely to occur if a deployment would
are unique to our development and deployment model. go awry, yet they were quite difficult to debug because we
FBOSS: Building Switch Software at Scale SIGCOMM ’18, August 20–25, 2018, Budapest, Hungary

Figure 11: Overview of cascading outages seen by a failed ToR switch within a backup group.

had deployed a number of new changes simultaneously as it Thus, one must be careful in adopting features that are known
was our initial FBOSS deployment. to be stable in conventional networks.
We eventually noticed that the loss was usually limited to a
multiple of 16 devices. This pointed towards a configuration
in our data center called backup groups. Prior to deploying
7.3 Resolving Interoperability Issues
FBOSS, the most common type of failure within our data Although we developed and deployed switches that are
center was a failure of a single link leading to a black-holing built in-house, we still need the switches and FBOSS to inter-
of traffic [36]. In order to handle such failures, a group (il- operate with different types of network devices for various rea-
lustrated on the left side of Figure 11) of ToR switches are sons. We share our experiences where the design of FBOSS
designated to provide backup routes if the most direct route allowed an interoperability issue to be quickly resolved.
to a destination becomes unavailable. The backup routes are When configuring link aggregation between FBOSS and
pre-computed and statically configured for faster failover. a particular line of vendor devices, we discovered that flap-
We experienced an outage where a failure of a ToR resulted ping the logical aggregate interface on the vendor device
in a period where packets ping pong between the backup ToRs could disable all IP operations on that interface. A cursory
and the aggregation switches, incorrectly assuming that the inspection revealed that, while the device had expectantly
backup routes are available. This resulted in a loop in the engaged in Duplicate Address Detection (DAD) [50] for the
backup routes. The right side of Figure 11 illustrates the aggregate interface’s address, it had unexpectedly detected a
creation of path loops. The loop eventually resulted in huge duplicate address in the corresponding subnet. This behavior
CPU spikes on all the backup switches. The main reason was isolated to a race condition between LACP and DAD’s
for the CPU spikes was because FBOSS was not correctly probe, wherein an artifact of the hardware support for link
removing the failed routes from the forwarding table and was aggregation could cause DAD’s Neighbor Solicitation packet
also generating TTL expired ICMP packets for all packets to be looped back to the vendor switch. In accordance with
that had ping-ponged back and forth 255 times. Given that the DAD specification, the vendor device had interpreted the
we had not seen this behavior before, we had no control looped back Neighbor Solicitation packet as another node
plane policing in place and sent all packets with TTL of 0 to engaging in DAD for the same address, which the DAD spec-
the FBOSS agent. The rate the FBOSS agent could process ification mandates should cause the switch to disable IP oper-
these packets was far lower than the rate we were receiving ation on the interface on which DAD has been invoked. We
the frames, so we would fall further and further behind and also found that interconnecting the same vendor device with
starve out the BGP keep-alive and withdraw messages we a different vendor’s switch would exhibit the same symptom.
need for the network to converge. Eventually BGP peerings Flapping of interfaces is a step performed by our network
would expire, but since we were already in the looping state, operators during routine network maintenance. To ensure that
it often made the matters worse and caused the starvation to the maintenance could still be performed in a non-disruptive
last indefinitely. We added a set of control plane fixes and the manner, we modified the FBOSS agent to avoid the scenario
network became stable even through multiple ToR failures. described above. In contrast, in response to our report of this
Takeaway: A feature that works well for conventional bug to the vendor, whose switch exhibited the same behavior
networks may not work well for networks deploying FBOSS. as ours, the vendor recommended the other vendors to im-
This is a side effect of rapid deployment, as entire switch plement an extension to DAD. By having entire control over
outages are more frequently than in conventional networks. our switch software, we were able to quickly provide what’s
necessary for our network.
SIGCOMM ’18, August 20–25, 2018, Budapest, Hungary S. Choi et al.

Takeaway: Interoperability issues are common in net- the data plane of our switches follows closely conventional
works with various network devices. FBOSS allows us to vendor-sourced chassis architectures. The main difference is
quickly diagnose and fix the problem directly, instead of wait- that we do not deploy additional servers to act as supervisor
ing for vendor updates or resorting to half-baked solutions. cards and instead leverage our larger data center automation
tooling and monitoring. While this design does not provide
the same single logical switch abstraction that is provided by
8 DISCUSSION
conventional vendor switches, it allows us to jump to larger
Existing Switch Programming Standards. Over time, switch form factors with no software architectural changes.
many software standards have been proposed to open up vari- Implicit and Circular Dependency. One subtle but impor-
ous aspects of the software on the switch. On the academic tant problem we discovered when trying to run our switches
side, there are decades of approaches to open various aspects like a server was hidden and implicit circular dependencies
of switches, including active networking [35], FORCES [32], on the network. Specifically, all servers on our fleet run a
PCE [26], and OpenFlow [40]. On the industry side, upstart standard set of binaries and libraries for logging, monitoring,
vendors have tried to compete with incumbents on being more and etc. By design, we wanted to run these existing software
open (e.g., JunOS’s SDK access program, Arista’s SDK pro- on our switches. Unfortunately, in some cases, the software
gram) and the incumbents have responded with their own built for the servers implicitly depended on the network and
open initiatives (e.g., I2RS, Cisco’s OnePK). On both the when the FBOSS code depended on them, we created a circu-
academic and industry sides, there also are numerous control lar dependency that prevented our network from initializing.
plane and management plane protocols that similarly try to Worse yet, these situations would only arise during other er-
make the switch software more programmable/configurable. ror conditions (e.g., when a daemon crash) and were hard to
Each of these attempts have their own set of trade-offs and debug. In one specific case, we initially deployed the FBOSS
subset of supported hardware. Thus, one could argue that onto switches using the same task scheduling and monitoring
some synthesis of these standards could be “the one perfect software used by other software services in our fleet, but we
API” that gives us the functionalities we want. So, why didn’t found that this software required access to the production
we just use/improve upon one of these existing standards? network before it would run. As a result, we had to decouple
The problem is that these existing standards are all “top our code from it and write our own custom task scheduling
down": they are all additional software/protocols layered on software to specifically manage FBOSS deployments. While
top of the existing vendor software rather than entirely re- this was an easier case to debug, as each software package
placing it. That means that if ever we wanted to change the evolves and is maintained independently, there is a constant
underlying unexposed software, we would still be limited by threat of well-meaning but server focused developers adding
what our vendors would be willing to support and on their a subtle implicit dependency on the network. Our current
timelines. By controlling the entire software stack “bottom solution is to continue to fortify our testing and deployment
up", we can control all the possible states and code on the procedures.
switch and can expose any API anyway we want at our own
schedule. Even more importantly, we can experiment with the
APIs we expose and evolve them over time for our specific 9 FUTURE WORK
needs, allowing us to quickly meet our production needs. Partitioning FBOSS Agent. FBOSS agent currently is a
FBOSS as a Building Block for Larger Switches. While single monolithic binary consisting of multiple features. Sim-
originally developed for ToR, single-ASIC style switches, ilar to how QSFP service was separated to improve switch
we have adapted FBOSS as a building block to run larger, reliability, we plan to further partition FBOSS agent into
multi-ASIC chassis switches as well. We have designed and smaller binaries that runs independently. For example, if state
deployed our own chassis-based switch with removable line observers exist as external processes that communicates with
cards that supports 128x100Gbps links with full bisection FBOSS agent, any events that can overwhelms the state ob-
connectivity. Internally, this switch is composed of eight line servers no long brings FBOSS agent down with it.
cards each with their own CPU and ASIC, connected in a Novel Experiments. One of our main goals for FBOSS is
logic CLOS topology to four fabric cards also with their own to allow more and faster experimentation. We are currently ex-
CPU and ASIC. perimenting with custom routing protocols, stronger slow path
We run an instance of FBOSS on each of the twelve (eight isolation (e.g., to deal with buggy experiments), micro-burst
line cards plus four fabric cards) CPUs and have them peer detection, macro-scale traffic monitoring, big data analytic of
via BGP internally to the switch, logically creating a single low-level hardware statistics to infer failure detection, and a
high-capacity switch that runs the aggregation layers of our host of other design elements. By making FBOSS open source
data centers. While appearing to be a new hardware design, and our research more public, we hope to aid researchers with
FBOSS: Building Switch Software at Scale SIGCOMM ’18, August 20–25, 2018, Budapest, Hungary

tools and ideas to directly implement novel research ideas on frameworks that support continuous canary [45]. Some no-
production ready software and hardware. table examples are Chef [3], Jenkins [6], Travis CI [10] and
Programmable ASIC Support. FBOSS is designed to Ansible [1]. Contrary to other frameworks, fbossdeploy
easily support multiple types of ASICs simultaneously. In is designed specifically for deploying switch software. It is
fact, FBOSS successfully iterated through different versions capable of monitoring the network to perform network spe-
of ASICs without any huge design changes. With the recent cific remediations during the deployment process. In addition,
advent of programmable ASICs, we believe that it will be fbossdeploy can deploy the switch software in a manner
useful for FBOSS to support programmable ASICs [19] and that considers the global network topology.
the language to program these ASICs, such as P4 [18]. Network Management Systems. There are many network
management systems built to interact with vendor specific
devices. For example, HP OpenView [23] has interfaces to
10 RELATED WORK
control various vendors’ switches. IBM Tivoli Netcool [29]
Existing Switch Software. There are various proprietary handles various network events in real-time for efficient trou-
switch software implementations, often referred to as “Net- bleshooting and diagnosis. OpenConfig [9] recently proposed
work OS”, such as Cisco NX-OS [12] or Juniper JunOS [41], a unified vendor-agnostic configuration interface. Instead of
yet FBOSS is quite different from them. For example, FBOSS using a standardized management interface, FBOSS provides
allows full access to the switch Linux, giving users flexibility programmable APIs that can be integrated with other network
to run custom processes for management or configuration. management systems that are vendor-agnostic.
In comparison, conventional switch software are generally
accessed through their own proprietary interfaces.
There is also various open-source switch software that 11 CONCLUSION
runs on Linux, such as Open Network Linux (ONL) [30],
This paper presents a retrospective on five years of develop-
OpenSwitch [11], Cumulus Linux [20] and Microsoft
ing, deploying, operating, and open sourcing switch software
SONiC [33]. FBOSS is probably most comparable to SONiC:
built for large-scale production data centers. When building
both as results of running switch software at scale to serve
and deploying our switch software, we departed from conven-
ever increasing data center network needs, and with similar
tional methods and adopted techniques widely used to ensure
architecture (hardware abstraction layer, state management
scalability and resiliency for building and deploying general
module, etc.). One major difference between SONiC and
purpose software. We built a set of modular abstractions that
FBOSS is that FBOSS is not a separate Linux distribution,
allows the software to be not tied down to a specific set of
but using the same Linux OS and libraries in our large server
features or hardware. We built a continuous deployment sys-
fleet. This allows us to truly reusing many best practices of
tem that allows the software to be changed incrementally and
monitoring, configuring, and deploying for server software.
rapidly, tested automatically, and deployed incrementally and
In general, open source communities around switch software
safely. We built a custom management system that allows
are starting grow, which is promising for FBOSS.
for simpler configuration management, monitoring and op-
Finally, there are recent proposals to completely eliminate
erations. Our approach has provided significant benefits that
switch software [31, 51] from a switch. They provide new
enabled us to quickly and incrementally grow our network
insights for the role of switch software and the future of data
size and features, while reducing software complexity.
center switch design.
Centralized Network Control. In the recent Software-
Defined Network (SDN) movement, many systems (e.g.,
ACKNOWLEDGMENT
[28, 34]), sometimes also referred to as “Network OS”, are
built to realize centralized network control. While we rely Many people in the Network Systems team at Facebook
on centralized configuration management and distributed have contributed to FBOSS over the years and toward this
BGP daemons, FBOSS is largely orthogonal to these efforts. paper. In particular, we would like to acknowledge Sonja
By functionality, FBOSS’s is more comparable to software Keserovic, Srikanth Sundaresan and Petr Lapukhov for the
switches such as Open vSwitch [44], even if the implemen- extensive help with the paper. We also would like to thank
tation and performance characteristics are quite different. In Robert Soulé and Nick McKeown for providing ideas to initi-
fact, similar to how Open vSwitch uses OpenFlow, FBOSS’s ate the paper. We would like to acknowledge Facebook for the
Thrift API, in theory, can interface with a central controller resource it provided for us. And finally, we are also indebted
to provide a more SDN-like functionality. to Omar Baldonado, our shepherd, Hitesh Ballani, as well as
Large-scale Software Deployment. fbossdeploy is the anonymous SIGCOMM reviewers for their comments and
influenced by other cloud scale [16] continuous integration suggestions on earlier drafts.
SIGCOMM ’18, August 20–25, 2018, Budapest, Hungary S. Choi et al.

REFERENCES [25] Tian Fang. 2015. Introducing OpenBMC: an open software framework
[1] [n. d.]. Ansible is Simple IT Automation. [Link] for next-generation system management. [Link]
[2] [n. d.]. Apache Thrift. [Link] posts/1601610310055392.
[3] [n. d.]. Chef. [Link] [26] A. Farrel, J.-P. Vasseur, and J. Ash. 2006. A Path Computation Element
[4] [n. d.]. FBOSS Open Source. [Link] (PCE)-Based Architecture. Technical Report. Internet Engineering Task
[5] [n. d.]. FBOSS Thrift Management Interface. [Link] Force.
facebook/fboss/blob/master/fboss/agent/if/[Link]. [27] Phillipa Gill, Navendu Jain, and Nachiappan Nagappan. 2011. Un-
[6] [n. d.]. Jenkins. [Link] derstanding Network Failures in Data Centers: Measurement, Anal-
[7] [n. d.]. Kerberos: The Network Authentication Protocol. [Link] ysis, and Implications. In Proceedings of the ACM SIGCOMM 2011
edu/kerberos/. Conference (SIGCOMM ’11). ACM, New York, NY, USA, 350–361.
[8] [n. d.]. Microsoft showcases the Azure Cloud [Link]
Switch. [Link] [28] Natasha Gude, Teemu Koponen, Justin Pettit, Ben Pfaff, Martín Casado,
microsoft-showcases-the-azure-cloud-switch-acs/. Nick McKeown, and Scott Shenker. 2008. NOX: Towards an Operating
[9] [n. d.]. OpenConfig. [Link] System for Networks. SIGCOMM Comput. Commun. Rev. 38, 3 (July
[10] [n. d.]. Travis CI. [Link] 2008), 105–110. [Link]
[11] 2016. OpenSwitch. [Link] [29] IBM. [n. d.]. Tivoli Netcool/OMNIbus. [Link]
[12] 2017. Cisco NX-OS Software. [Link] software/products/en/ibmtivolinetcoolomnibus.
ios-nx-os-software/nx-os-software/[Link]. [30] Big Switch Networks Inc. 2013. Open Network Linux. https:
[13] 2018. Facebook Open Routing Group. [Link] //[Link]/.
groups/openr/about/. [31] Xin Jin, Nathan Farrington, and Jennifer Rexford. 2016. Your Data
[14] 2018. HwSwitch implementation for Mellanox Switch. [Link] Center Switch is Trying Too Hard. In Proceedings of the Symposium
com/facebook/fboss/pull/67. on SDN Research (SOSR ’16). ACM, New York, NY, USA, Article 12,
[15] Lior Abraham, John Allen, Oleksandr Barykin, Vinayak Borkar, 6 pages. [Link]
Bhuwan Chopra, Ciprian Gerea, Daniel Merl, Josh Metzler, David [32] D Joachimpillai and JH Salim. 2004. Forwarding and Control Element
Reiss, Subbu Subramanian, Janet L. Wiener, and Okay Zed. 2013. Separation (forces). [Link]
Scuba: Diving into Data at Facebook. Proc. VLDB Endow. 6, 11 (Aug. [33] Yousef Khalidi. 2017. SONiC: The networking switch software that
2013), 1057–1067. [Link] powers the Microsoft Global Cloud. [Link]
[16] Michael Armbrust, Armando Fox, Rean Griffith, Anthony D. Joseph, [34] Teemu Koponen, Martin Casado, Natasha Gude, Jeremy Stribling, Leon
Randy Katz, Andy Konwinski, Gunho Lee, David Patterson, Ariel Poutievski, Min Zhu, Rajiv Ramanathan, Yuichiro Iwata, Hiroaki Inoue,
Rabkin, Ion Stoica, and Matei Zaharia. 2010. A View of Cloud Takayuki Hama, and Scott Shenker. 2010. Onix: A Distributed Control
Computing. Commun. ACM 53, 4 (April 2010), 50–58. https: Platform for Large-scale Production Networks. In Proceedings of the
//[Link]/10.1145/1721654.1721672 9th USENIX Conference on Operating Systems Design and Implemen-
[17] Randy Bias. 2016. The History of Pets vs Cattle and How to Use tation (OSDI’10). USENIX Association, Berkeley, CA, USA, 351–364.
the Analogy Properly. [Link] [Link]
the-history-of-pets-vs-cattle/. [35] David L. Tennenhouse and David J. Wetherall. 2000. Towards an Active
[18] Pat Bosshart, Dan Daly, Glen Gibb, Martin Izzard, Nick McKeown, Network Architecture. 26 (07 2000), 14.
Jennifer Rexford, Cole Schlesinger, Dan Talayco, Amin Vahdat, George [36] P. Lapukhov, A. Premji, and Mitchell J. 2016. Use of BGP for Routing
Varghese, and David Walker. 2014. P4: Programming Protocol- in Large-Scale Data Centers. [Link]
independent Packet Processors. SIGCOMM Comput. Commun. Rev. 44, [37] Ville Lauriokari. 2009. Copy-On-Write 101. [Link]
3 (July 2014), 87–95. [Link] copy-on-write-101-part-1-what-is-it/.
[19] Pat Bosshart, Glen Gibb, Hun-Seok Kim, George Varghese, Nick McK- [38] K. Lougheed, Cisco Systems, and Y. Rkhter. 1989. A Border Gateway
eown, Martin Izzard, Fernando Mujica, and Mark Horowitz. 2013. For- Protocol (BGP). [Link]
warding Metamorphosis: Fast Programmable Match-Action Processing [39] R. P. Luijten, A. Doering, and S. Paredes. 2014. Dual function heat-
in Hardware for SDN. In SIGCOMM Conference on Applications, Tech- spreading and performance of the IBM/ASTRON DOME 64-bit mi-
nologies, Architectures, and Protocols for Computer Communication croserver demonstrator. In 2014 IEEE International Conference on
(SIGCOMM). 99–110. [Link] IC Design Technology. 1–4. [Link]
[20] Cumulus. [n. d.]. Cumulus Linux. [Link] 6838613
products/cumulus-linux/. [40] Nick McKeown, Tom Anderson, Hari Balakrishnan, Guru Parulkar,
[21] Harrington D., R. Presuhn, and Wijnen B. 2002. An Architecture for De- Larry Peterson, Jennifer Rexford, Scott Shenker, and Jonathan Turner.
scribing Simple Network Management Protocol (SNMP) Management 2008. OpenFlow: Enabling Innovation in Campus Networks. SIG-
Frameworks. [Link] COMM Comput. Commun. Rev. 38, 2 (March 2008), 69–74. https:
[22] Sebastian Elbaum, Gregg Rothermel, and John Penix. 2014. Tech- //[Link]/10.1145/1355734.1355746
niques for Improving Regression Testing in Continuous Integration [41] Juniper Networks. 2017. Junos OS. [Link]
Development Environments. In Proceedings of the 22Nd ACM SIG- products-services/nos/junos/.
SOFT International Symposium on Foundations of Software Engi- [42] Tuomas Pelkonen, Scott Franklin, Justin Teller, Paul Cavallaro, Qi
neering (FSE 2014). ACM, New York, NY, USA, 235–245. https: Huang, Justin Meza, and Kaushik Veeraraghavan. 2015. Gorilla: A Fast,
//[Link]/10.1145/2635868.2635910 Scalable, In-memory Time Series Database. Proc. VLDB Endow. 8, 12
[23] HP Enterprise. [n. d.]. HP Openview. [Link] (Aug. 2015), 1816–1827. [Link]
en-us/products/application-lifecycle-management/overview. [43] A.D. Persson, C.A.C. Marcondes, and D.P. Johnson. 2013. Method
[24] Facebook. 2017. Wedge 100S 32x100G Specification. [Link] and system for network stack tuning. [Link]
[Link]/products/facebook-wedge-100s-32x100g/. US8467390 US Patent 8,467,390.
FBOSS: Building Switch Software at Scale SIGCOMM ’18, August 20–25, 2018, Budapest, Hungary

[44] Ben Pfaff, Justin Pettit, Teemu Koponen, Ethan Jackson, Andy Zhou,
Jarno Rajahalme, Jesse Gross, Alex Wang, Joe Stringer, Pravin Shelar,
Keith Amidon, and Martin Casado. 2015. The Design and Implemen-
tation of Open vSwitch. In 12th USENIX Symposium on Networked
Systems Design and Implementation (NSDI 15). USENIX Association,
Oakland, CA, 117–130. [Link]
technical-sessions/presentation/pfaff
[45] Danilo Sato. 2014. Canary Release. [Link]
[Link].
[46] Brandon Schlinker, Hyojeong Kim, Timothy Cui, Ethan Katz-Bassett,
Harsha V. Madhyastha, Italo Cunha, James Quinn, Saif Hasan, Petr
Lapukhov, and Hongyi Zeng. 2017. Engineering Egress with Edge
Fabric: Steering Oceans of Content to the World. In Proceedings of the
ACM SIGCOMM 2017 Conference (SIGCOMM ’17). ACM, New York,
NY, USA, 418–431. [Link]
[47] Arjun Singh, Joon Ong, Amit Agarwal, Glen Anderson, Ashby Armis-
tead, Roy Bannon, Seb Boving, Gaurav Desai, Bob Felderman, Paulie
Germano, Anand Kanagala, Jeff Provost, Jason Simmons, Eiichi Tanda,
Jim Wanderer, Urs Hölzle, Stephen Stuart, and Amin Vahdat. 2015.
Jupiter Rising: A Decade of Clos Topologies and Centralized Control in
Google’s Datacenter Network. SIGCOMM Comput. Commun. Rev. 45,
4 (Aug. 2015), 183–197. [Link]
[48] Yu-Wei Eric Sung, Xiaozheng Tie, Starsky H.Y. Wong, and Hongyi
Zeng. 2016. Robotron: Top-down Network Management at Face-
book Scale. In Proceedings of the ACM SIGCOMM 2016 Confer-
ence (SIGCOMM ’16). ACM, New York, NY, USA, 426–439. https:
//[Link]/10.1145/2934872.2934874
[49] David Szabados. 2017. Broadcom Ships Tomahawk 3, Industry’s
Highest Bandwidth Ethernet Switch Chip at 12.8 Terabits per Sec-
ond. [Link]
irol-newsArticle&ID=2323373.
[50] S Thomson, Narten T., and Jinmei T. 2007. IPv6 Stateless Address
Autoconfiguration. [Link]
[51] F. Wang, L. Gao, S. Xiaozhe, H. Harai, and K. Fujikawa. 2017. Towards
reliable and lightweight source switching for datacenter networks. In
IEEE INFOCOM 2017 - IEEE Conference on Computer Communica-
tions. 1–9. [Link]
[52] Jun Xiao. 2017. New Approach to OVS Datapath Performance. http:
//[Link]/support/boston2017/[Link].
[53] Xilinx. [n. d.]. Lightweight Ethernet Switch. [Link]
com/applications/wireless-communications/wireless-connectivity/
[Link].

You might also like