DPDK Prog Guide-19.11
DPDK Prog Guide-19.11
Release 19.11.10
1 Introduction 1
1.1 Documentation Roadmap . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Related Publications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
2 Overview 3
2.1 Development Environment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
2.2 Environment Abstraction Layer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
2.3 Core Components . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
2.3.1 Ring Manager (librte_ring) . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
2.3.2 Memory Pool Manager (librte_mempool) . . . . . . . . . . . . . . . . . . . . 4
2.3.3 Network Packet Buffer Management (librte_mbuf) . . . . . . . . . . . . . . . 4
2.3.4 Timer Manager (librte_timer) . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2.4 Ethernet* Poll Mode Driver Architecture . . . . . . . . . . . . . . . . . . . . . . . . . 6
2.5 Packet Forwarding Algorithm Support . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2.6 librte_net . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
i
3.4 Malloc . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
3.4.1 Cookies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
3.4.2 Alignment and NUMA Constraints . . . . . . . . . . . . . . . . . . . . . . . . 19
3.4.3 Use Cases . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
3.4.4 Internal Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
4 Service Cores 24
4.1 Service Core Initialization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
4.2 Enabling Services on Cores . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
4.3 Service Core Statistics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
5 RCU Library 25
5.1 What is Quiescent State . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
5.2 Factors affecting the RCU mechanism . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
5.3 RCU in DPDK . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
5.4 How to use this library . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
6 Ring Library 29
6.1 References for Ring Implementation in FreeBSD* . . . . . . . . . . . . . . . . . . . . 29
6.2 Lockless Ring Buffer in Linux* . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
6.3 Additional Features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
6.3.1 Name . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
6.4 Use Cases . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
6.5 Anatomy of a Ring Buffer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
6.5.1 Single Producer Enqueue . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
6.5.2 Single Consumer Dequeue . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
6.5.3 Multiple Producers Enqueue . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
6.5.4 Modulo 32-bit Indexes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
6.6 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
7 Stack Library 40
7.1 Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
7.1.1 Lock-based Stack . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
7.1.2 Lock-free Stack . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
8 Mempool Library 42
8.1 Cookies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
8.2 Stats . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
8.3 Memory Alignment Constraints . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
8.4 Local Cache . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
8.5 Mempool Handlers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
8.6 Use Cases . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
9 Mbuf Library 46
9.1 Design of Packet Buffers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
9.2 Buffers Stored in Memory Pools . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
9.3 Constructors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
9.4 Allocating and Freeing mbufs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
9.5 Manipulating mbufs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
9.6 Meta Information . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
9.7 Direct and Indirect Buffers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
9.8 Debug . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
ii
9.9 Use Cases . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
iii
11.9.4 Unsupported actions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105
11.9.5 Flow rules priority . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105
11.10 Future evolutions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105
iv
15.4.1 Enqueue / Dequeue Burst APIs . . . . . . . . . . . . . . . . . . . . . . . . . . 130
15.4.2 Operation Representation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 130
15.4.3 Operation Management and Allocation . . . . . . . . . . . . . . . . . . . . . . 131
15.4.4 BBDEV Inbound/Outbound Memory . . . . . . . . . . . . . . . . . . . . . . 131
15.4.5 BBDEV Turbo Encode Operation . . . . . . . . . . . . . . . . . . . . . . . . 132
15.4.6 BBDEV Turbo Decode Operation . . . . . . . . . . . . . . . . . . . . . . . . 134
15.4.7 BBDEV LDPC Encode Operation . . . . . . . . . . . . . . . . . . . . . . . . 136
15.4.8 BBDEV LDPC Decode Operation . . . . . . . . . . . . . . . . . . . . . . . . 139
15.5 Sample code . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 142
15.5.1 BBDEV Device API . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 144
v
17.3.1 Operation Representation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 165
17.3.2 Operation Management and Allocation . . . . . . . . . . . . . . . . . . . . . . 165
17.3.3 Passing source data as mbuf-chain . . . . . . . . . . . . . . . . . . . . . . . . 165
17.3.4 Operation Status . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 166
17.3.5 Operation status after enqueue / dequeue . . . . . . . . . . . . . . . . . . . . . 166
17.3.6 Produced, Consumed And Operation Status . . . . . . . . . . . . . . . . . . . 166
17.4 Transforms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 166
17.5 Compression API Hash support . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 167
17.6 Compression API Stateless operation . . . . . . . . . . . . . . . . . . . . . . . . . . . 167
17.6.1 priv_xform in Stateless operation . . . . . . . . . . . . . . . . . . . . . . . . . 167
17.6.2 Stateless and OUT_OF_SPACE . . . . . . . . . . . . . . . . . . . . . . . . . 169
17.6.3 Hash in Stateless . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 169
17.6.4 Checksum in Stateless . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 169
17.7 Compression API Stateful operation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 169
17.7.1 Stream in Stateful operation . . . . . . . . . . . . . . . . . . . . . . . . . . . 170
17.7.2 Stateful and OUT_OF_SPACE . . . . . . . . . . . . . . . . . . . . . . . . . . 172
17.7.3 Hash in Stateful . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 172
17.7.4 Checksum in Stateful . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 172
17.8 Burst in compression API . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 172
17.8.1 Enqueue / Dequeue Burst APIs . . . . . . . . . . . . . . . . . . . . . . . . . . 172
17.9 Sample code . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 173
17.9.1 Compression Device API . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 173
vi
21 Timer Library 197
21.1 Implementation Details . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 197
21.2 Use Cases . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 198
21.3 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 198
vii
25.2.4 Use Case: IPv4 Forwarding . . . . . . . . . . . . . . . . . . . . . . . . . . . . 226
25.2.5 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 226
viii
32.4 Supported GSO Packet Types . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 251
32.4.1 TCP/IPv4 GSO . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 251
32.4.2 UDP/IPv4 GSO . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 251
32.4.3 VxLAN GSO . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 252
32.4.4 GRE GSO . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 252
32.5 How to Segment a Packet . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 252
ix
37.2.1 Init and Config . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 273
37.2.2 Setting up Queues . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 274
37.2.3 Setting up Ports . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 274
37.2.4 Linking Queues and Ports . . . . . . . . . . . . . . . . . . . . . . . . . . . . 275
37.2.5 Starting the EventDev . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 275
37.2.6 Ingress of New Events . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 275
37.2.7 Forwarding of Events . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 276
37.2.8 Egress of Events . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 276
37.3 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 276
x
41.2 API Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 289
41.2.1 Create an adapter instance . . . . . . . . . . . . . . . . . . . . . . . . . . . . 289
41.2.2 Querying adapter capabilities . . . . . . . . . . . . . . . . . . . . . . . . . . . 291
41.2.3 Adding queue pair to the adapter instance . . . . . . . . . . . . . . . . . . . . 291
41.2.4 Configure the service function . . . . . . . . . . . . . . . . . . . . . . . . . . 291
41.2.5 Set event request/response information . . . . . . . . . . . . . . . . . . . . . . 291
41.2.6 Start the adapter instance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 292
41.2.7 Get adapter statistics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 293
xi
45.1 Design Objectives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 343
45.2 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 343
45.3 Port Library Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 343
45.3.1 Port Types . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 343
45.3.2 Port Interface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 344
45.4 Table Library Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 345
45.4.1 Table Types . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 345
45.4.2 Table Interface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 347
45.4.3 Hash Table Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 347
45.5 Pipeline Library Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 360
45.5.1 Connectivity of Ports and Tables . . . . . . . . . . . . . . . . . . . . . . . . . 360
45.5.2 Port Actions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 360
45.5.3 Table Actions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 361
45.6 Multicore Scaling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 362
45.6.1 Shared Data Structures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 362
45.7 Interfacing with Accelerators . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 363
xii
49.3 Supported features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 380
49.4 Limitations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 380
xiii
56 Performance Optimization Guidelines 397
56.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 397
60 Glossary 405
xiv
CHAPTER
ONE
INTRODUCTION
This document provides software architecture information, development environment information and
optimization guidelines.
For programming examples and for instructions on compiling and running each sample application, see
the DPDK Sample Applications User Guide for details.
For general information on compiling and running applications, see the DPDK Getting Started Guide.
1
Programmer’s Guide, Release 19.11.10
TWO
OVERVIEW
This section gives a global overview of the architecture of Data Plane Development Kit (DPDK).
The main goal of the DPDK is to provide a simple, complete framework for fast packet processing in
data plane applications. Users may use the code to understand some of the techniques employed, to
build upon for prototyping or to add their own protocol stacks. Alternative ecosystem options that use
the DPDK are available.
The framework creates a set of libraries for specific environments through the creation of an Environment
Abstraction Layer (EAL), which may be specific to a mode of the Intel® architecture (32-bit or 64-bit),
Linux* user space compilers or a specific platform. These environments are created through the use of
make files and configuration files. Once the EAL library is created, the user may link with the library
to create their own applications. Other libraries, outside of EAL, including the Hash, Longest Prefix
Match (LPM) and rings libraries are also provided. Sample applications are provided to help show the
user how to use various features of the DPDK.
The DPDK implements a run to completion model for packet processing, where all resources must be
allocated prior to calling Data Plane applications, running as execution units on logical processing cores.
The model does not support a scheduler and all devices are accessed by polling. The primary reason for
not using interrupts is the performance overhead imposed by interrupt processing.
In addition to the run-to-completion model, a pipeline model may also be used by passing packets or
messages between cores via the rings. This allows work to be performed in stages and may allow more
efficient use of code on cores.
See the DPDK Getting Started Guide for information on setting up the development environment.
3
Programmer’s Guide, Release 19.11.10
Manipulation of packet
buffers carrying network
X Y data.
X uses Y
Handle a pool of objects
using a ring to store rte_mbuf
them. Allow bulk
Timer facilities. Based enqueue/dequeue and
on HPET interface that per-CPU cache.
is provided by EAL.
rte_ring
rte_timer rte_mempool
Fixed-size lockless
FIFO for storing objects
in a table.
rte_malloc rte_eal + libc
This library provides an API to allocate/free mbufs, manipulate packet buffers which are used to carry
network packets.
Network Packet Buffer Management is described in Mbuf Library.
2.6 librte_net
The librte_net library is a collection of IP protocol definitions and convenience macros. It is based on
code from the FreeBSD* IP stack and contains protocol numbers (for use in IP headers), IP-related
macros, IPv4/IPv6 header structures and TCP, UDP and SCTP header structures.
THREE
The Environment Abstraction Layer (EAL) is responsible for gaining access to low-level resources such
as hardware and memory space. It provides a generic interface that hides the environment specifics from
the applications and libraries. It is the responsibility of the initialization routine to decide how to allocate
these resources (that is, memory space, devices, timers, consoles, and so on).
Typical services expected from the EAL are:
• DPDK Loading and Launching: The DPDK and its application are linked as a single application
and must be loaded by some means.
• Core Affinity/Assignment Procedures: The EAL provides mechanisms for assigning execution
units to specific cores as well as creating execution instances.
• System Memory Reservation: The EAL facilitates the reservation of different memory zones, for
example, physical memory areas for device interactions.
• Trace and Debug Functions: Logs, dump_stack, panic and so on.
• Utility Functions: Spinlocks and atomic counters that are not provided in libc.
• CPU Feature Identification: Determine at runtime if a particular feature, for example, Intel® AVX
is supported. Determine if the current CPU supports the feature set that the binary was compiled
for.
• Interrupt Handling: Interfaces to register/unregister callbacks to specific interrupt sources.
• Alarm Functions: Interfaces to set/remove callbacks to be run at a specific time.
7
Programmer’s Guide, Release 19.11.10
Note: Initialization of objects, such as memory zones, rings, memory pools, lpm tables and hash tables,
should be done as part of the overall application initialization on the master lcore. The creation and
initialization functions for these objects are not multi-thread safe. However, once initialized, the objects
themselves can safely be used in multiple threads simultaneously.
Note: Memory reservations done using the APIs provided by rte_malloc are also backed by pages from
the hugetlbfs filesystem.
main()
rte_eal_init()
rte_eal_memory_init()
rte_eal_logs_init()
rte_eal_pci_init()
...
wait
rte_eal_remote_launch(
per_lcore_app_init)
rte_eal_mp_wait_lcore()
per_lcore_ per_lcore_
app_init() app_init()
wait wait
rte_eal_remote_launch(app)
This way, memory allocator will ensure that, whatever memory mode is in use, either reserved memory
will satisfy the requirements, or the allocation will fail.
There is no need to preallocate any memory at startup using -m or --socket-mem command-line
parameters, however it is still possible to do so, in which case preallocate memory will be “pinned”
(i.e. will never be released by the application back to the system). It will be possible to allocate
more hugepages, and deallocate those, but any preallocated pages will not be freed. If neither -m nor
--socket-mem were specified, no memory will be preallocated, and all memory will be allocated at
runtime, as needed.
Another available option to use in dynamic memory mode is --single-file-segments
command-line option. This option will put pages in single files (per memseg list), as opposed to creating
a file per page. This is normally not needed, but can be useful for use cases like userspace vhost, where
there is limited number of page file descriptors that can be passed to VirtIO.
If the application (or DPDK-internal code, such as device drivers) wishes to receive notifica-
tions about newly allocated memory, it is possible to register for memory event callbacks via
rte_mem_event_callback_register() function. This will call a callback function any time
DPDK’s memory map has changed.
If the application (or DPDK-internal code, such as device drivers) wishes to be notified about memory
allocations above specified threshold (and have a chance to deny them), allocation validator callbacks
are also available via rte_mem_alloc_validator_callback_register() function.
A default validator callback is provided by EAL, which can be enabled with a --socket-limit
command-line option, for a simple way to limit maximum amount of memory that can be used by
DPDK application.
Warning: Memory subsystem uses DPDK IPC internally, so memory allocations/callbacks and IPC
must not be mixed: it is not safe to allocate/free memory inside memory-related or IPC callbacks,
and it is not safe to use IPC inside memory-related callbacks. See chapter Multi-process Support for
more details about DPDK IPC.
• 32-bit support
Additional restrictions are present when running in 32-bit mode. In dynamic memory mode, by default
maximum of 2 gigabytes of VA space will be preallocated, and all of it will be on master lcore NUMA
node unless --socket-mem flag is used.
In legacy mode, VA space will only be preallocated for segments that were requested (plus padding, to
keep IOVA-contiguousness).
• Maximum amount of memory
All possible virtual memory space that can ever be used for hugepage mapping in a DPDK process is
preallocated at startup, thereby placing an upper limit on how much memory a DPDK application can
have. DPDK memory is stored in segment lists, each segment is strictly one physical page. It is possible
to change the amount of virtual memory being preallocated at startup by editing the following config
variables:
• CONFIG_RTE_MAX_MEMSEG_LISTS controls how many segment lists can DPDK have
• CONFIG_RTE_MAX_MEM_MB_PER_LIST controls how much megabytes of memory each seg-
ment list can address
• CONFIG_RTE_MAX_MEMSEG_PER_LIST controls how many segments each segment can have
• CONFIG_RTE_MAX_MEMSEG_PER_TYPE controls how many segments each memory type can
have (where “type” is defined as “page size + NUMA node” combination)
• CONFIG_RTE_MAX_MEM_MB_PER_TYPE controls how much megabytes of memory each
memory type can address
• CONFIG_RTE_MAX_MEM_MB places a global maximum on the amount of memory DPDK can
reserve
Normally, these options do not need to be changed.
Note: Preallocated virtual memory is not to be confused with preallocated hugepage memory! All
DPDK processes preallocate virtual memory at startup. Hugepages can later be mapped into that preal-
located VA space (if dynamic memory mode is enabled), and can optionally be mapped into it at startup.
Note: lcore refers to a logical execution unit of the processor, sometimes called a hardware thread.
Shared variables are the default behavior. Per-lcore variables are implemented using Thread Local
Storage (TLS) to provide per-thread local storage.
3.1.7 Logs
A logging API is provided by EAL. By default, in a Linux application, logs are sent to syslog and
also to the console. However, the log function can be overridden by the user to use a different logging
mechanism.
Note: In DPDK PMD, the only interrupts handled by the dedicated host thread are those for link status
change (link up and link down notification) and for sudden device removal.
• RX Interrupt Event
The receive and transmit routines provided by each PMD don’t limit themselves to execute in polling
thread mode. To ease the idle polling with tiny throughput, it’s useful to pause the polling and wait until
the wake-up event happens. The RX interrupt is the first choice to be such kind of wake-up event, but
probably won’t be the only one.
EAL provides the event APIs for this event-driven thread mode. Taking Linux as an example, the
implementation relies on epoll. Each thread can monitor an epoll instance in which all the wake-up
events’ file descriptors are added. The event file descriptors are created and mapped to the interrupt
vectors according to the UIO/VFIO spec. From FreeBSD’s perspective, kqueue is the alternative way,
but not implemented yet.
EAL initializes the mapping between event file descriptors and interrupt vectors, while each device
initializes the mapping between interrupt vectors and queues. In this way, EAL actually is unaware of
the interrupt cause on the specific vector. The eth_dev driver takes responsibility to program the latter
mapping.
Note: Per queue RX interrupt event is only allowed in VFIO which supports multiple MSI-X vector. In
UIO, the RX interrupt together with other interrupt causes shares the same vector. In this case, when RX
interrupt and LSC(link status change) interrupt are both enabled(intr_conf.lsc == 1 && intr_conf.rxq ==
1), only the former is capable.
3.1.10 Blacklisting
The EAL PCI device blacklist functionality can be used to mark certain NIC ports as blacklisted, so
they are ignored by the DPDK. The ports to be blacklisted are identified using the PCIe* description
(Domain:Bus:Device.Function).
On FreeBSD, RTE_IOVA_PA is always the default. On Linux, the IOVA mode is detected based on a
2-step heuristic detailed below.
For the first step, EAL asks each bus its requirement in terms of IOVA mode and decides on a preferred
IOVA mode.
• if all buses report RTE_IOVA_PA, then the preferred IOVA mode is RTE_IOVA_PA,
• if all buses report RTE_IOVA_VA, then the preferred IOVA mode is RTE_IOVA_VA,
• if all buses report RTE_IOVA_DC, no bus expressed a preferrence, then the preferred mode is
RTE_IOVA_DC,
• if the buses disagree (at least one wants RTE_IOVA_PA and at least one wants RTE_IOVA_VA),
then the preferred IOVA mode is RTE_IOVA_DC (see below with the check on Physical Addresses
availability),
If the buses have expressed no preference on which IOVA mode to pick, then a default is selected using
the following logic:
• if physical addresses are not available, RTE_IOVA_VA mode is used
• if /sys/kernel/iommu_groups is not empty, RTE_IOVA_VA mode is used
• otherwise, RTE_IOVA_PA mode is used
In the case when the buses had disagreed on their preferred IOVA mode, part of the buses won’t work
because of this decision.
The second step checks if the preferred mode complies with the Physical Addresses availability since
those are only available to root user in recent kernels. Namely, if the preferred mode is RTE_IOVA_PA
but there is no access to Physical Addresses, then EAL init fails early, since later probing of the devices
would fail anyway.
Note: The RTE_IOVA_VA mode is preferred as the default in most cases for the following reasons:
• All drivers are expected to work in RTE_IOVA_VA mode, irrespective of physical address avail-
ability.
• By default, the mempool, first asks for IOVA-contiguous memory using
RTE_MEMZONE_IOVA_CONTIG. This is slow in RTE_IOVA_PA mode and it may affect
the application boot time.
• It is easy to enable large amount of IOVA-contiguous memory use cases with IOVA in VA mode.
It is expected that all PCI drivers work in both RTE_IOVA_PA and RTE_IOVA_VA modes.
If a PCI driver does not support RTE_IOVA_PA mode, the RTE_PCI_DRV_NEED_IOVA_AS_VA flag
is used to dictate that this PCI driver can only work in RTE_IOVA_VA mode.
When the KNI kernel module is detected, RTE_IOVA_PA mode is preferred as a performance penalty
is expected in RTE_IOVA_VA mode.
Using this option, for each given lcore ID, the associated CPUs can be assigned. It’s also compatible
with the pattern of corelist(‘-l’) option.
with non-EAL pthreads, the put/get operations will bypass the default mempool cache and
there is a performance penalty because of this bypass. Only user-owned external caches can
be used in a non-EAL context in conjunction with rte_mempool_generic_put() and
rte_mempool_generic_get() that accept an explicit cache parameter.
• rte_ring
rte_ring supports multi-producer enqueue and multi-consumer dequeue. However, it is non-
preemptive, this has a knock on effect of making rte_mempool non-preemptable.
This means, use cases involving preemptible pthreads should consider using rte_ring carefully.
1. It CAN be used for preemptible single-producer and single-consumer use case.
2. It CAN be used for non-preemptible multi-producer and preemptible single-consumer use
case.
3. It CAN be used for preemptible single-producer and non-preemptible multi-consumer use
case.
4. It MAY be used by preemptible multi-producer and/or preemptible multi-consumer pthreads
whose scheduling policy are all SCHED_OTHER(cfs), SCHED_IDLE or SCHED_BATCH.
User SHOULD be aware of the performance penalty before using it.
5. It MUST not be used by multi-producer/consumer pthreads, whose scheduling policies are
SCHED_FIFO or SCHED_RR.
Alternatively, applications can use the lock-free stack mempool handler. When considering this
handler, note that:
– It is currently limited to the aarch64 and x86_64 platforms, because it uses an instruction
(16-byte compare-and-swap) that is not yet available on other platforms.
– It has worse average-case performance than the non-preemptive rte_ring, but software
caching (e.g. the mempool cache) can mitigate this by reducing the number of stack ac-
cesses.
• rte_timer
Running rte_timer_manage() on a non-EAL pthread is not allowed. However, reset-
ting/stopping the timer from a non-EAL pthread is allowed.
• rte_log
In non-EAL pthreads, there is no per thread loglevel and logtype, global loglevels are used.
• misc
The debug statistics of rte_ring, rte_mempool and rte_timer are not supported in a non-EAL
pthread.
cd /sys/fs/cgroup/cpu/pkt_io
echo 100000 > pkt_io/cpu.cfs_period_us
echo 50000 > pkt_io/cpu.cfs_quota_us
3.4 Malloc
The EAL provides a malloc API to allocate any-sized memory.
The objective of this API is to provide malloc-like functions to allow allocation from hugepage memory
and to facilitate application porting. The DPDK API Reference manual describes the available functions.
Typically, these kinds of allocations should not be done in data plane processing because they are slower
than pool-based allocation and make use of locks within the allocation and free paths. However, they
can be used in configuration code.
Refer to the rte_malloc() function description in the DPDK API Reference manual for more information.
3.4.1 Cookies
When CONFIG_RTE_MALLOC_DEBUG is enabled, the allocated memory contains overwrite protec-
tion fields to help identify buffer overflows.
3.4. Malloc 19
Programmer’s Guide, Release 19.11.10
For allocating/freeing data at runtime, in the fast-path of an application, the memory pool library should
be used instead.
Structure: malloc_heap
The malloc_heap structure is used to manage free space on a per-socket basis. Internally, there is one
heap structure per NUMA node, which allows us to allocate memory to a thread based on the NUMA
node on which this thread runs. While this does not guarantee that the memory will be used on that
NUMA node, it is no worse than a scheme where the memory is always allocated on a fixed or random
node.
The key fields of the heap structure and their function are described below (see also diagram above):
• lock - the lock field is needed to synchronize access to the heap. Given that the free space in the
heap is tracked using a linked list, we need a lock to prevent two threads manipulating the list at
the same time.
• free_head - this points to the first element in the list of free nodes for this malloc heap.
• first - this points to the first element in the heap.
• last - this points to the last element in the heap.
pad
size
size
Unavailable space
Fig. 3.2: Example of a malloc heap and malloc elements within the malloc library
Structure: malloc_elem
The malloc_elem structure is used as a generic header structure for various blocks of memory. It is used
in two different ways - all shown in the diagram above:
1. As a header on a block of free or allocated memory - normal case
3.4. Malloc 20
Programmer’s Guide, Release 19.11.10
Note: If the usage of a particular field in one of the above three usages is not described, the field can be
assumed to have an undefined value in that situation, for example, for padding headers only the “state”
and “pad” fields have valid values.
• heap - this pointer is a reference back to the heap structure from which this block was allocated.
It is used for normal memory blocks when they are being freed, to add the newly-freed block to
the heap’s free-list.
• prev - this pointer points to previous header element/block in memory. When freeing a block, this
pointer is used to reference the previous block to check if that block is also free. If so, and the
two blocks are immediately adjacent to each other, then the two free blocks are merged to form a
single larger block.
• next - this pointer points to next header element/block in memory. When freeing a block, this
pointer is used to reference the next block to check if that block is also free. If so, and the two
blocks are immediately adjacent to each other, then the two free blocks are merged to form a
single larger block.
• free_list - this is a structure pointing to previous and next elements in this heap’s free list. It is
only used in normal memory blocks; on malloc() to find a suitable free block to allocate and
on free() to add the newly freed element to the free-list.
• state - This field can have one of three values: FREE, BUSY or PAD. The former two are to indicate
the allocation state of a normal memory block and the latter is to indicate that the element structure
is a dummy structure at the end of the start-of-block padding, i.e. where the start of the data within
a block is not at the start of the block itself, due to alignment constraints. In that case, the pad
header is used to locate the actual malloc element header for the block.
• pad - this holds the length of the padding present at the start of the block. In the case of a normal
block header, it is added to the address of the end of the header to give the address of the start of
the data area, i.e. the value passed back to the application on a malloc. Within a dummy header
inside the padding, this same value is stored, and is subtracted from the address of the dummy
header to yield the address of the actual block header.
• size - the size of the data block, including the header itself.
Memory Allocation
On EAL initialization, all preallocated memory segments are setup as part of the malloc heap. This
setup involves placing an element header with FREE at the start of each virtually contiguous segment of
memory. The FREE element is then added to the free_list for the malloc heap.
This setup also happens whenever memory is allocated at runtime (if supported), in which case newly
allocated pages are also added to the heap, merging with any adjacent free segments if there are any.
3.4. Malloc 21
Programmer’s Guide, Release 19.11.10
When an application makes a call to a malloc-like function, the malloc function will first index the
lcore_config structure for the calling thread, and determine the NUMA node of that thread. The
NUMA node is used to index the array of malloc_heap structures which is passed as a parameter to
the heap_alloc() function, along with the requested size, type, alignment and boundary parameters.
The heap_alloc() function will scan the free_list of the heap, and attempt to find a free block
suitable for storing data of the requested size, with the requested alignment and boundary constraints.
When a suitable free element has been identified, the pointer to be returned to the user is calculated.
The cache-line of memory immediately preceding this pointer is filled with a struct malloc_elem header.
Because of alignment and boundary constraints, there could be free space at the start and/or end of the
element, resulting in the following behavior:
1. Check for trailing space. If the trailing space is big enough, i.e. > 128 bytes, then the free element
is split. If it is not, then we just ignore it (wasted space).
2. Check for space at the start of the element. If the space at the start is small, i.e. <=128 bytes,
then a pad header is used, and the remaining space is wasted. If, however, the remaining space is
greater, then the free element is split.
The advantage of allocating the memory from the end of the existing element is that no adjustment of
the free list needs to take place - the existing element on the free list just has its size value adjusted, and
the next/previous elements have their “prev”/”next” pointers redirected to the newly created element.
In case when there is not enough memory in the heap to satisfy allocation request, EAL will attempt
to allocate more memory from the system (if supported) and, following successful allocation, will retry
reserving the memory again. In a multiprocessing scenario, all primary and secondary processes will
synchronize their memory maps to ensure that any valid pointer to DPDK memory is guaranteed to be
valid at all times in all currently running processes.
Failure to synchronize memory maps in one of the processes will cause allocation to fail, even though
some of the processes may have allocated the memory successfully. The memory is not added to the
malloc heap unless primary process has ensured that all other processes have mapped this memory
successfully.
Any successful allocation event will trigger a callback, for which user applications and other DPDK sub-
systems can register. Additionally, validation callbacks will be triggered before allocation if the newly
allocated memory will exceed threshold set by the user, giving a chance to allow or deny allocation.
Note: Any allocation of new pages has to go through primary process. If the primary process is not
active, no memory will be allocated even if it was theoretically possible to do so. This is because pri-
mary’s process map acts as an authority on what should or should not be mapped, while each secondary
process has its own, local memory map. Secondary processes do not update the shared memory map,
they only copy its contents to their local memory map.
Freeing Memory
To free an area of memory, the pointer to the start of the data area is passed to the free function. The size
of the malloc_elem structure is subtracted from this pointer to get the element header for the block.
If this header is of type PAD then the pad length is further subtracted from the pointer to get the proper
element header for the entire block.
From this element header, we get pointers to the heap from which the block was allocated and to where it
must be freed, as well as the pointer to the previous and next elements. These next and previous elements
are then checked to see if they are also FREE and are immediately adjacent to the current one, and if so,
3.4. Malloc 22
Programmer’s Guide, Release 19.11.10
they are merged with the current element. This means that we can never have two FREE memory blocks
adjacent to one another, as they are always merged into a single block.
If deallocating pages at runtime is supported, and the free element encloses one or more pages, those
pages can be deallocated and be removed from the heap. If DPDK was started with command-line
parameters for preallocating memory (-m or --socket-mem), then those pages that were allocated at
startup will not be deallocated.
Any successful deallocation event will trigger a callback, for which user applications and other DPDK
subsystems can register.
3.4. Malloc 23
CHAPTER
FOUR
SERVICE CORES
DPDK has a concept known as service cores, which enables a dynamic way of performing work on
DPDK lcores. Service core support is built into the EAL, and an API is provided to optionally allow
applications to control how the service cores are used at runtime.
The service cores concept is built up out of services (components of DPDK that require CPU cycles to
operate) and service cores (DPDK lcores, tasked with running services). The power of the service core
concept is that the mapping between service cores and services can be configured to abstract away the
difference between platforms and environments.
For example, the Eventdev has hardware and software PMDs. Of these the software PMD requires an
lcore to perform the scheduling operations, while the hardware PMD does not. With service cores, the
application would not directly notice that the scheduling is done in software.
For detailed information about the service core API, please refer to the docs.
24
CHAPTER
FIVE
RCU LIBRARY
Lockless data structures provide scalability and determinism. They enable use cases where locking may
not be allowed (for example real-time applications).
In the following sections, the term “memory” refers to memory allocated by typical APIs like malloc()
or anything that is representative of memory, for example an index of a free element array.
Since these data structures are lockless, the writers and readers are accessing the data structures concur-
rently. Hence, while removing an element from a data structure, the writers cannot return the memory
to the allocator, without knowing that the readers are not referencing that element/memory anymore.
Hence, it is required to separate the operation of removing an element into two steps:
1. Delete: in this step, the writer removes the reference to the element from the data structure but
does not return the associated memory to the allocator. This will ensure that new readers will not
get a reference to the removed element. Removing the reference is an atomic operation.
2. Free (Reclaim): in this step, the writer returns the memory to the memory allocator only after
knowing that all the readers have stopped referencing the deleted element.
This library helps the writer determine when it is safe to free the memory by making use of thread
Quiescent State (QS).
25
Programmer’s Guide, Release 19.11.10
Delete Free
Grace Period
Delete entry1 from D1
Reader Thread 1 D1 D2
Quiescent states
T2 D1 D2
Critical sections
T3 D1 D2
while(1) loop
Free Point in time when the writer can free the deleted entry.
The writer thread can trigger the reader threads to report their quiescent state by calling the API
rte_rcu_qsbr_start(). It is possible for multiple writer threads to query the quiescent state
status simultaneously. Hence, rte_rcu_qsbr_start() returns a token to each caller.
The writer thread must call rte_rcu_qsbr_check() API with the token to get the current quiescent
state status. Option to block till all the reader threads enter the quiescent state is provided. If this API
indicates that all the reader threads have entered the quiescent state, the application can free the deleted
entry.
The APIs rte_rcu_qsbr_start() and rte_rcu_qsbr_check() are lock free. Hence, they
can be called concurrently from multiple writers even while running as worker threads.
The separation of triggering the reporting from querying the status provides the writer threads flexibility
to do useful work instead of blocking for the reader threads to enter the quiescent state or go offline. This
reduces the memory accesses due to continuous polling for the status. But, since the resource is freed at
a later time, the token and the reference to the deleted resource need to be stored for later queries.
The rte_rcu_qsbr_synchronize() API combines the functionality of
rte_rcu_qsbr_start() and blocking rte_rcu_qsbr_check() into a single API. This
API triggers the reader threads to report their quiescent state and polls till all the readers enter the
quiescent state or go offline. This API does not allow the writer to do useful work while waiting and
introduces additional memory accesses due to continuous polling. However, the application does not
have to store the token or the reference to the deleted resource. The resource can be freed immediately
after rte_rcu_qsbr_synchronize() API returns.
The reader thread must call rte_rcu_qsbr_thread_offline() and
rte_rcu_qsbr_thread_unregister() APIs to remove itself from reporting its quies-
cent state. The rte_rcu_qsbr_check() API will not wait for this reader thread to report the
quiescent state status anymore.
The reader threads should call rte_rcu_qsbr_quiescent() API to indicate that they entered a
quiescent state. This API checks if a writer has triggered a quiescent state query and update the state
accordingly.
The rte_rcu_qsbr_lock() and rte_rcu_qsbr_unlock() are empty functions. How-
ever, when CONFIG_RTE_LIBRTE_RCU_DEBUG is enabled, these APIs aid in debugging issues.
One can mark the access to shared data structures on the reader side using these APIs. The
rte_rcu_qsbr_quiescent() will check if all the locks are unlocked.
SIX
RING LIBRARY
The ring allows the management of queues. Instead of having a linked list of infinite size, the rte_ring
has the following properties:
• FIFO
• Maximum size is fixed, the pointers are stored in a table
• Lockless implementation
• Multi-consumer or single-consumer dequeue
• Multi-producer or single-producer enqueue
• Bulk dequeue - Dequeues the specified count of objects if successful; otherwise fails
• Bulk enqueue - Enqueues the specified count of objects if successful; otherwise fails
• Burst dequeue - Dequeue the maximum available objects if the specified count cannot be fulfilled
• Burst enqueue - Enqueue the maximum available objects if the specified count cannot be fulfilled
The advantages of this data structure over a linked list queue are as follows:
• Faster; only requires a single Compare-And-Swap instruction of sizeof(void *) instead of several
double-Compare-And-Swap instructions.
• Simpler than a full lockless queue.
• Adapted to bulk enqueue/dequeue operations. As pointers are stored in a table, a dequeue of
several objects will not produce as many cache misses as in a linked queue. Also, a bulk dequeue
of many objects does not cost more than a dequeue of a simple object.
The disadvantages:
• Size is fixed
• Having many rings costs more in terms of memory than a linked list queue. An empty ring
contains at least N pointers.
A simplified representation of a Ring is shown in with consumer and producer head and tail pointers to
objects stored in the data structure.
29
Programmer’s Guide, Release 19.11.10
cons_head prod_head
cons_tail prod_tail
• bufring.c in FreeBSD
local variables
cons_head prod_head
cons_tail prod_tail
structure state
local variables
structure state
local variables
cons_head prod_tail
cons_tail prod_head
structure state
If there are not enough objects in the ring (this is detected by checking prod_tail), it returns an error.
local variables
cons_head prod_tail
cons_tail prod_head
structure state
local variables
structure state
Fig. 6.6: Dequeue second step
local variables
cons_head prod_tail
cons_tail prod_head
structure state
cons_head prod_head
cons_tail prod_tail
structure state
cons_head prod_head
cons_tail prod_tail
structure state
cons_head prod_head
cons_tail prod_tail
structure state
cons_head prod_head
cons_tail prod_tail
structure state
cons_head prod_head
cons_tail prod_tail
structure state
Note: To simplify the explanation, operations with modulo 16-bit are used instead of modulo 32-bit. In
addition, the four indexes are defined as unsigned 16-bit integers, as opposed to unsigned 32-bit integers
in the more realistic case.
0 0
0 16384 32768 49152 65536 16384 32768 49152 65536 16384
value for
ring indexes
(prod_head,
used_entries prod_tail, ...)
size = 16384
ch ph
mask = 16383
ct pt
ph = pt = 14000
ct = ch = 3000 used entries in ring
used_entries = (pt - ch) % 65536 = 11000
free_entries = (mask + ct - ph) % 65536 = 5383
value for
ring indexes
(prod_head,
used_entries prod_tail, ...)
size = 16384
ch ph
mask = 16383
ct pt
ph = pt = 6000
ct = ch = 59000 used entries in ring
used_entries = (pt - ch) % 65536 = 12536
free_entries = (mask + ct - ph) % 65536 = 3847
Note: For ease of understanding, we use modulo 65536 operations in the above examples. In real
execution cases, this is redundant for low efficiency, but is done automatically when the result overflows.
The code always maintains a distance between producer and consumer between 0 and size(ring)-1.
Thanks to this property, we can do subtractions between 2 index values in a modulo-32bit base: that’s
why the overflow of the indexes is not a problem.
At any time, entries and free_entries are between 0 and size(ring)-1, even if only the first term of sub-
traction has overflowed:
6.6 References
• bufring.h in FreeBSD (version 8)
• bufring.c in FreeBSD (version 8)
• Linux Lockless Ring Buffer Design
6.6. References 39
CHAPTER
SEVEN
STACK LIBRARY
DPDK’s stack library provides an API for configuration and use of a bounded stack of pointers.
The stack library provides the following basic operations:
• Create a uniquely named stack of a user-specified size and using a user-specified socket, with
either standard (lock-based) or lock-free behavior.
• Push and pop a burst of one or more stack objects (pointers). These function are multi-threading
safe.
• Free a previously created stack.
• Lookup a pointer to a stack by its name.
• Query a stack’s current depth and number of free entries.
7.1 Implementation
The library supports two types of stacks: standard (lock-based) and lock-free. Both types use the same
set of interfaces, but their implementations differ.
40
Programmer’s Guide, Release 19.11.10
The linked list elements themselves are maintained in a lock-free LIFO, and are allocated before stack
pushes and freed after stack pops. Since the stack has a fixed maximum depth, these elements do not
need to be dynamically created.
The lock-free behavior is selected by passing the RTE_STACK_F_LF flag to rte_stack_create().
7.1. Implementation 41
CHAPTER
EIGHT
MEMPOOL LIBRARY
A memory pool is an allocator of a fixed-sized object. In the DPDK, it is identified by name and uses
a mempool handler to store free objects. The default mempool handler is ring based. It provides some
other optional services such as a per-core object cache and an alignment helper to ensure that objects are
padded to spread them equally on all DRAM or DDR3 channels.
This library is used by the Mbuf Library.
8.1 Cookies
In debug mode (CONFIG_RTE_LIBRTE_MEMPOOL_DEBUG is enabled), cookies are added at the
beginning and end of allocated blocks. The allocated objects then contain overwrite protection fields to
help debugging buffer overflows.
8.2 Stats
In debug mode (CONFIG_RTE_LIBRTE_MEMPOOL_DEBUG is enabled), statistics about get
from/put in the pool are stored in the mempool structure. Statistics are per-lcore to avoid concurrent
access to statistics counters.
Note: The command line must always have the number of memory channels specified for the processor.
42
Programmer’s Guide, Release 19.11.10
Examples of alignment for different DIMM architectures are shown in Fig. 8.1 and Fig. 8.2.
0 1 2 3 4 5 6 7 8 9 A B C D E F 0 1 2 3
packet 1 padding packet 2
In this case, the assumption is that a packet is 16 blocks of 64 bytes, which is not true.
The Intel® 5520 chipset has three channels, so in most cases, no padding is required between objects
(except for objects whose size are n x 3 x 64 bytes blocks).
When creating a new pool, the user can specify to use this feature or not.
obj 2
obj n
Alternatively to the internal default per-lcore local cache, an application can create and manage ex-
ternal caches through the rte_mempool_cache_create(), rte_mempool_cache_free()
and rte_mempool_cache_flush() calls. These user-owned caches can be explicitly
passed to rte_mempool_generic_put() and rte_mempool_generic_get(). The
rte_mempool_default_cache() call returns the default internal cache if any. In contrast to
the default caches, user-owned caches can be used by non-EAL threads too.
Note: When running a DPDK application with shared libraries, mempool handler shared
objects specified with the ‘-d’ EAL command-line parameter are dynamically loaded. When
running a multi-process application with shared libraries, the -d arguments for mempool
handlers must be specified in the same order for all processes to ensure correct operation.
NINE
MBUF LIBRARY
The mbuf library provides the ability to allocate and free buffers (mbufs) that may be used by the DPDK
application to store message buffers. The message buffers are stored in a mempool, using the Mempool
Library.
A rte_mbuf struct generally carries network packet buffers, but it can actually be any data (control data,
events, ...). The rte_mbuf header structure is kept as small as possible and currently uses just two cache
lines, with the most frequently used fields being on the first of the two cache lines.
46
Programmer’s Guide, Release 19.11.10
rte_pktmbuf_mtod(m)
mbuf
struct
headroom tailroom
m->buf_addr
(m->buf_iova is the
corresponding physical address)
struct rte_mbuf
rte_pktmbuf_pktlen(m) = rte_pktmbuf_datalen(m) +
rte_pktmbuf_datalen(mseg2) + rte_pktmbuf_datalen(mseg3)
rte_pktmbuf_mtod(m)
m mseg2 mseg3
multi-segmented rte_mbuf
9.3 Constructors
Packet mbuf constructors are provided by the API. The rte_pktmbuf_init() function initializes some
fields in the mbuf structure that are not modified by the user once created (mbuf type, origin pool, buffer
start address, and so on). This function is given as a callback function to the rte_mempool_create()
function at pool creation time.
For instance, this is the case on RX side for the IEEE1588 packet timestamp mechanism, the VLAN
tagging and the IP checksum computation.
On TX side, it is also possible for an application to delegate some processing to the hardware if it
supports it. For instance, the PKT_TX_IP_CKSUM flag allows to offload the computation of the IPv4
checksum.
The following examples explain how to configure different TX offloads on a vxlan-encapsulated tcp
packet: out_eth/out_ip/out_udp/vxlan/in_eth/in_ip/in_tcp/payload
• calculate checksum of out_ip:
mb->l2_len = len(out_eth)
mb->l3_len = len(out_ip)
mb->ol_flags |= PKT_TX_IPV4 | PKT_TX_IP_CSUM
set out_ip checksum to 0 in the packet
This is similar to case 1), but l2_len is different. It is supported on hardware advertising
DEV_TX_OFFLOAD_IPV4_CKSUM. Note that it can only work if outer L4 checksum is 0.
• calculate checksum of in_ip and in_tcp:
mb->l2_len = len(out_eth + out_ip + out_udp + vxlan + in_eth)
mb->l3_len = len(in_ip)
mb->ol_flags |= PKT_TX_IPV4 | PKT_TX_IP_CSUM | PKT_TX_TCP_CKSUM
set in_ip checksum to 0 in the packet
set in_tcp checksum to pseudo header using rte_ipv4_phdr_cksum()
This is similar to case 2), but l2_len is different. It is supported on hardware advertising
DEV_TX_OFFLOAD_IPV4_CKSUM and DEV_TX_OFFLOAD_TCP_CKSUM. Note that it
can only work if outer L4 checksum is 0.
• segment inner TCP:
mb->l2_len = len(out_eth + out_ip + out_udp + vxlan + in_eth)
mb->l3_len = len(in_ip)
mb->l4_len = len(in_tcp)
mb->ol_flags |= PKT_TX_IPV4 | PKT_TX_IP_CKSUM | PKT_TX_TCP_CKSUM |
PKT_TX_TCP_SEG;
set in_ip checksum to 0 in the packet
set in_tcp checksum to pseudo header without including the IP
payload length using rte_ipv4_phdr_cksum()
9.8 Debug
In debug mode (CONFIG_RTE_MBUF_DEBUG is enabled), the functions of the mbuf library perform
sanity checks before any operation (such as, buffer corruption, bad type, and so on).
TEN
The DPDK includes 1 Gigabit, 10 Gigabit and 40 Gigabit and para virtualized virtio Poll Mode Drivers.
A Poll Mode Driver (PMD) consists of APIs, provided through the BSD driver running in user space,
to configure the devices and their respective queues. In addition, a PMD accesses the RX and TX de-
scriptors directly without any interrupts (with the exception of Link Status Change interrupts) to quickly
receive, process and deliver packets in the user’s application. This section describes the requirements of
the PMDs, their global design principles and proposes a high-level architecture and a generic external
API for the Ethernet PMDs.
51
Programmer’s Guide, Release 19.11.10
To avoid any unnecessary interrupt processing overhead, the execution environment must not use any
asynchronous notification mechanisms. Whenever needed and appropriate, asynchronous communica-
tion should be introduced as much as possible through the use of rings.
Avoiding lock contention is a key issue in a multi-core environment. To address this issue, PMDs are
designed to work with per-core private resources as much as possible. For example, a PMD maintains
a separate transmit queue per-core, per-port, if the PMD is not DEV_TX_OFFLOAD_MT_LOCKFREE
capable. In the same way, every receive queue of a port is assigned to and polled by a single logical core
(lcore).
To comply with Non-Uniform Memory Access (NUMA), memory management is designed to assign to
each logical core a private buffer pool in local memory to minimize remote memory access. The con-
figuration of packet buffer pools should take into account the underlying physical memory architecture
in terms of DIMMS, channels and ranks. The application must ensure that appropriate parameters are
given at memory pool creation time. See Mempool Library.
Burst-oriented functions are also introduced via the API for services that are intensively used by the
PMD. This applies in particular to buffer allocators used to populate NIC rings, which provide func-
tions to allocate/free several buffers at a time. For example, an mbuf_multiple_alloc function returning
an array of pointers to rte_mbuf buffers which speeds up the receive poll function of the PMD when
replenishing multiple descriptors of the receive ring.
Note: It is the DPDK entity responsibility to set the port owner before using it and to manage the port
indicate the default value should be used. The default value for tx_free_thresh is 32. This ensures
that the PMD does not search for completed descriptors until at least 32 have been processed by
the NIC for this queue.
• The minimum RS bit threshold. The minimum number of transmit descriptors to use before setting
the Report Status (RS) bit in the transmit descriptor. Note that this parameter may only be valid for
Intel 10 GbE network adapters. The RS bit is set on the last descriptor used to transmit a packet
if the number of descriptors used since the last RS bit setting, up to the first descriptor used to
transmit the packet, exceeds the transmit RS bit threshold (tx_rs_thresh). In short, this parameter
controls which transmit descriptors are written back to host memory by the network adapter. A
value of 0 can be passed during the TX queue configuration to indicate that the default value
should be used. The default value for tx_rs_thresh is 32. This ensures that at least 32 descriptors
are used before the network adapter writes back the most recently used descriptor. This saves
upstream PCIe* bandwidth resulting from TX descriptor write-backs. It is important to note that
the TX Write-back threshold (TX wthresh) should be set to 0 when tx_rs_thresh is greater than 1.
Refer to the Intel® 82599 10 Gigabit Ethernet Controller Datasheet for more details.
The following constraints must be satisfied for tx_free_thresh and tx_rs_thresh:
• tx_rs_thresh must be greater than 0.
• tx_rs_thresh must be less than the size of the ring minus 2.
• tx_rs_thresh must be less than or equal to tx_free_thresh.
• tx_free_thresh must be greater than 0.
• tx_free_thresh must be less than the size of the ring minus 3.
• For optimal performance, TX wthresh should be set to 0 when tx_rs_thresh is greater than 1.
One descriptor in the TX ring is used as a sentinel to avoid a hardware race condition, hence the maxi-
mum threshold constraints.
Note: When configuring for DCB operation, at port initialization, both the number of transmit queues
and the number of receive queues must be set to 128.
interfaces, but a packet copy can be avoided. This API is independent of whether the packet was
transmitted or dropped, only that the mbuf is no longer in use by the interface.
• Some applications are designed to make multiple runs, like a packet generator. For performance
reasons and consistency between runs, the application may want to reset back to an initial state
between each run, where all mbufs are returned to the mempool. In this case, it can call the
rte_eth_tx_done_cleanup() API for each destination interface it has been using to re-
quest it to release of all its used mbufs.
To determine if a driver supports this API, check for the Free Tx mbuf on demand feature in the Network
Interface Controller Drivers document.
Note: PMDs are not required to support the standard device arguments and users should consult the
relevant PMD documentation to see support devargs.
API Design
The xstats API uses the name, id, and value to allow performant lookup of specific statistics. Perfor-
mant lookup means two things;
• No string comparisons with the name of the statistic in fast-path
• Allow requesting of only the statistics of interest
The API ensures these requirements are met by mapping the name of the statistic to a unique id, which
is used as a key for lookup in the fast-path. The API allows applications to request an array of id values,
so that the PMD only performs the required calculations. Expected usage is that the application scans
the name of each statistic, and caches the id if it has an interest in that statistic. On the fast-path, the
integer can be used to retrieve the actual value of the statistic that the id represents.
API Functions
The API is built out of a small number of functions, which can be used to retrieve the number of statistics
and the names, IDs and values of those statistics.
• rte_eth_xstats_get_names_by_id(): returns the names of the statistics. When given
a NULL parameter the function returns the number of statistics that are available.
• rte_eth_xstats_get_id_by_name(): Searches for the statistic ID that matches
xstat_name. If found, the id integer is set.
• rte_eth_xstats_get_by_id(): Fills in an array of uint64_t values with matching the
provided ids array. If the ids array is NULL, it returns all statistics that are available.
Application Usage
Imagine an application that wants to view the dropped packet count. If no packets are dropped, the appli-
cation does not read any other metrics for performance reasons. If packets are dropped, the application
has a particular set of statistics that it requests. This “set” of statistics allows the app to decide what next
steps to perform. The following code-snippets show how the xstats API can be used to achieve this goal.
First step is to get all statistics names and list them:
struct rte_eth_xstat_name *xstats_names;
uint64_t *values;
int len, i;
/* Retrieve xstats names, passing NULL for IDs to return all statistics */
if (len != rte_eth_xstats_get_names_by_id(port_id, xstats_names, NULL, len)) {
printf("Cannot get xstat names\n");
goto err;
}
The application has access to the names of all of the statistics that the PMD exposes. The application
can decide which statistics are of interest, cache the ids of those statistics by looking up the name as
follows:
uint64_t id;
uint64_t value;
const char *xstat_name = "rx_errors";
The API provides flexibility to the application so that it can look up multiple statistics using an array
containing multiple id numbers. This reduces the function call overhead of retrieving statistics, and
makes lookup of multiple statistics simpler for the application.
#define APP_NUM_STATS 4
/* application cached these ids previously; see above */
uint64_t ids_array[APP_NUM_STATS] = {3,4,7,21};
uint64_t value_array[APP_NUM_STATS];
uint32_t i;
for(i = 0; i < APP_NUM_STATS; i++) {
printf("%d: %"PRIu64"\n", ids_array[i], value_array[i]);
}
This array lookup API for xstats allows the application create multiple “groups” of statistics, and look up
the values of those IDs using a single API call. As an end result, the application is able to achieve its goal
of monitoring a single statistic (“rx_errors” in this case), and if that shows packets being dropped, it can
easily retrieve a “set” of statistics using the IDs array parameter to rte_eth_xstats_get_by_id
function.
Sometimes a port has to be reset passively. For example when a PF is reset, all its VFs should also be
reset by the application to make them consistent with the PF. A DPDK application also can call this
function to trigger a port reset. Normally, a DPDK application would invokes this function when an
RTE_ETH_EVENT_INTR_RESET event is detected.
It is the duty of the PMD to trigger RTE_ETH_EVENT_INTR_RESET events and the application should
register a callback function to handle these events. When a PMD needs to trigger a reset, it can trigger an
RTE_ETH_EVENT_INTR_RESET event. On receiving an RTE_ETH_EVENT_INTR_RESET event,
applications can handle it as follows: Stop working queues, stop calling Rx and Tx functions, and then
call rte_eth_dev_reset(). For thread safety all these operations should be called from the same thread.
For example when PF is reset, the PF sends a message to notify VFs of this event and also trig-
ger an interrupt to VFs. Then in the interrupt service routine the VFs detects this notification mes-
sage and calls _rte_eth_dev_callback_process(dev, RTE_ETH_EVENT_INTR_RESET, NULL). This
means that a PF reset triggers an RTE_ETH_EVENT_INTR_RESET event within VFs. The function
_rte_eth_dev_callback_process() will call the registered callback function. The callback function can
trigger the application to handle all operations the VF reset requires including stopping Rx/Tx queues
and calling rte_eth_dev_reset().
The rte_eth_dev_reset() itself is a generic function which only does some hardware reset operations
through calling dev_unint() and dev_init(), and itself does not handle synchronization, which is handled
by application.
The PMD itself should not call rte_eth_dev_reset(). The PMD can trigger the application to handle reset
event. It is duty of application to handle all synchronization before it calls rte_eth_dev_reset().
ELEVEN
11.1 Overview
This API provides a generic means to configure hardware to match specific ingress or egress traffic, alter
its fate and query related counters according to any number of user-defined rules.
It is named rte_flow after the prefix used for all its symbols, and is defined in rte_flow.h.
• Matching can be performed on packet data (protocol headers, payload) and properties (e.g. asso-
ciated physical port, virtual device function ID).
• Possible operations include dropping traffic, diverting it to specific queues, to virtual/physical
device functions or ports, performing tunnel offloads, adding marks and so on.
It is slightly higher-level than the legacy filtering framework which it encompasses and supersedes (in-
cluding all functions and filter types) in order to expose a single interface with an unambiguous behavior
that is common to all poll-mode drivers (PMDs).
62
Programmer’s Guide, Release 19.11.10
Flow rules can also be grouped, the flow rule priority is specific to the group they belong to. All flow
rules in a given group are thus processed within the context of that group. Groups are not linked by
default, so the logical hierarchy of groups must be explicitly defined by flow rules themselves in each
group using the JUMP action to define the next group to redirect too. Only flow rules defined in the
default group 0 are guarantee to be matched against, this makes group 0 the origin of any group hierarchy
defined by an application.
Support for multiple actions per rule may be implemented internally on top of non-default hardware
priorities, as a result both features may not be simultaneously available to applications.
Considering that allowed pattern/actions combinations cannot be known in advance and would result in
an impractically large number of capabilities to expose, a method is provided to validate a given rule
from the current device configuration state.
This enables applications to check if the rule types they need is supported at initialization time, before
starting their data path. This method can be used anytime, its only requirement being that the resources
needed by a rule should exist (e.g. a target RX queue should be configured first).
Each defined rule is associated with an opaque handle managed by the PMD, applications are responsible
for keeping it. These can be used for queries and rules management, such as retrieving counters or other
data and destroying them.
To avoid resource leaks on the PMD side, handles must be explicitly destroyed by the application before
releasing associated resources such as queues and ports.
The following sections cover:
• Attributes (represented by struct rte_flow_attr): properties of a flow rule such as its
direction (ingress or egress) and priority.
• Pattern item (represented by struct rte_flow_item): part of a matching pattern that ei-
ther matches specific packet data or traffic properties. It can also describe properties of the pattern
itself, such as inverted matching.
• Matching pattern: traffic properties to look for, a combination of any number of items.
• Actions (represented by struct rte_flow_action): operations to perform whenever a
packet is matched by a pattern.
11.2.2 Attributes
Attribute: Group
Flow rules can be grouped by assigning them a common group number. Groups allow a logical hierarchy
of flow rule groups (tables) to be defined. These groups can be supported virtually in the PMD or in the
physical device. Group 0 is the default group and this is the only group which flows are guarantee to
matched against, all subsequent groups can only be reached by way of the JUMP action from a matched
flow rule.
Although optional, applications are encouraged to group similar rules as much as possible to fully take
advantage of hardware capabilities (e.g. optimized matching) and work around limitations (e.g. a single
pattern type possibly allowed in a given group), while being aware that the groups hierarchies must be
programmed explicitly.
Note that support for more than a single group is not guaranteed.
Attribute: Priority
A priority level can be assigned to a flow rule, lower values denote higher priority, with 0 as the maxi-
mum.
Priority levels are arbitrary and up to the application, they do not need to be contiguous nor start from 0,
however the maximum number varies between devices and may be affected by existing flow rules.
A flow which matches multiple rules in the same group will always matched by the rule with the highest
priority in that group.
If a packet is matched by several rules of a given group for a given priority level, the outcome is unde-
fined. It can take any path, may be duplicated or even cause unrecoverable errors.
Note that support for more than a single priority level is not guaranteed.
Attribute: Transfer
Instead of simply matching the properties of traffic as it would appear on a given DPDK port ID, enabling
this attribute transfers a flow rule to the lowest possible level of any device endpoints found in the pattern.
When supported, this effectively enables an application to reroute traffic not necessarily intended for it
(e.g. coming from or addressed to different physical ports, VFs or applications) at the device level.
It complements the behavior of some pattern items such as Item: PHY_PORT and is meaningless without
them.
When transferring flow rules, ingress and egress attributes (Attribute: Traffic direction) keep their orig-
inal meaning, as if processing traffic emitted or received by the application.
Table 11.5:
UDPv6 any-
where
Index Item
0 IPv6
1 UDP
2 END
If supported by the PMD, omitting one or several protocol layers at the bottom of the stack as in the
above example (missing an Ethernet specification) enables looking up anywhere in packets.
It is unspecified whether the payload of supported encapsulations (e.g. VXLAN payload) is matched by
such a pattern, which may apply to inner, outer or both packets.
Item: END
End marker for item lists. Prevents further processing of items, thereby ending the pattern.
• Its numeric value is 0 for convenience.
• PMD support is mandatory.
• spec, last and mask are ignored.
Item: VOID
Used as a placeholder for convenience. It is ignored and simply discarded by PMDs.
• PMD support is mandatory.
• spec, last and mask are ignored.
Item: INVERT
Inverted matching, i.e. process packets that do not match the pattern.
• spec, last and mask are ignored.
Table 11.10:
INVERT
Field Value
spec ignored
last ignored
mask ignored
Usage example, matching non-TCPv4 packets only:
Table 11.11:
Anything but TCPv4
Index Item
0 INVERT
1 Ethernet
2 IPv4
3 TCP
4 END
Item: PF
Matches traffic originating from (ingress) or going to (egress) the physical function of the current device.
If supported, should work even if the physical function is not managed by the application and thus not
associated with a DPDK port ID.
• Can be combined with any number of Item: VF to match both PF and VF traffic.
• spec, last and mask must not be set.
Table 11.12: PF
Field Value
spec unset
last unset
mask unset
Item: VF
Matches traffic originating from (ingress) or going to (egress) a given virtual function of the current
device.
If supported, should work even if the virtual function is not managed by the application and thus not
associated with a DPDK port ID.
Note this pattern item does not match VF representors traffic which, as separate entities, should be
addressed through their own DPDK port IDs.
• Can be specified multiple times to match traffic addressed to several VF IDs.
• Can be combined with a PF item to match both PF and VF traffic.
• Default mask matches any VF ID.
Table 11.13: VF
Field Subfield Value
spec id destination VF ID
last id upper range value
mask id zeroed to match any VF ID
Item: PHY_PORT
Matches traffic originating from (ingress) or going to (egress) a physical port of the underlying device.
The first PHY_PORT item overrides the physical port normally associated with the specified DPDK
input port (port_id). This item can be provided several times to match additional physical ports.
Note that physical ports are not necessarily tied to DPDK input ports (port_id) when those are not under
DPDK control. Possible values are specific to each device, they are not necessarily indexed from zero
and may not be contiguous.
As a device property, the list of allowed values as well as the value associated with a port_id should be
retrieved by other means.
• Default mask matches any port index.
Item: PORT_ID
Matches traffic originating from (ingress) or going to (egress) a given DPDK port ID.
Normally only supported if the port ID in question is known by the underlying PMD and related to the
device the flow rule is created against.
This must not be confused with Item: PHY_PORT which refers to the physical port of a device, whereas
Item: PORT_ID refers to a struct rte_eth_dev object on the application side (also known as
“port representor” depending on the kind of underlying device).
• Default mask matches the specified DPDK port ID.
Item: MARK
Matches an arbitrary integer value which was set using the MARK action in a previously matched rule.
This item can only specified once as a match criteria as the MARK action can only be specified once in a
flow action.
Note the value of MARK field is arbitrary and application defined.
Depending on the underlying implementation the MARK item may be supported on the physical device,
with virtual groups in the PMD or not at all.
• Default mask matches any integer value.
Item: TAG
Matches tag item set by other flows. Multiple tags are supported by specifying index.
• Default mask matches the specified tag value and index.
Item: META
Matches 32 bit metadata item set.
On egress, metadata can be set either by mbuf metadata field with PKT_TX_DYNF_METADATA flag
or SET_META action. On ingress, SET_META action sets metadata for a packet and the metadata will
be reported via metadata dynamic field of rte_mbuf with PKT_RX_DYNF_METADATA flag.
• Default mask matches the specified Rx metadata value.
Item: ANY
Matches any protocol in place of the current layer, a single ANY may also stand for several protocol
layers.
This is usually specified as the first pattern item when looking for a protocol anywhere in a packet.
• Default mask stands for any number of layers.
Item: RAW
Matches a byte string of a given length at a given offset.
Offset is either absolute (using the start of the packet) or relative to the end of the previous matched item
in the stack, in which case negative values are allowed.
If search is enabled, offset is used as the starting point. The search area can be delimited by setting limit
to a nonzero value, which is the maximum number of bytes after offset where the pattern may start.
Matching a zero-length pattern is allowed, doing so resets the relative offset for subsequent items.
• This type does not support ranges (last field).
• Default mask matches all fields exactly.
Note that matching subsequent pattern items would resume after “baz”, not “bar” since matching is
always performed after the previous item of the stack.
Item: ETH
Matches an Ethernet header.
The type field either stands for “EtherType” or “TPID” when followed by so-called layer 2.5 pattern
items such as RTE_FLOW_ITEM_TYPE_VLAN. In the latter case, type refers to that of the outer
header, with the inner EtherType/TPID provided by the subsequent pattern item. This is the same order
as on the wire.
Item: VLAN
Matches an 802.1Q/ad VLAN tag.
The corresponding standard outer EtherType (TPID) values are RTE_ETHER_TYPE_VLAN or
RTE_ETHER_TYPE_QINQ. It can be overridden by the preceding pattern item.
• tci: tag control information.
• inner_type: inner EtherType or TPID.
• Default mask matches the VID part of TCI only (lower 12 bits).
Item: IPV4
Matches an IPv4 header.
Note: IPv4 options are handled by dedicated pattern items.
• hdr: IPv4 header definition (rte_ip.h).
• Default mask matches source and destination addresses only.
Item: IPV6
Matches an IPv6 header.
Note: IPv6 options are handled by dedicated pattern items, see Item: IPV6_EXT.
• hdr: IPv6 header definition (rte_ip.h).
• Default mask matches source and destination addresses only.
Item: ICMP
Matches an ICMP header.
• hdr: ICMP header definition (rte_icmp.h).
• Default mask matches ICMP type and code only.
Item: UDP
Matches a UDP header.
• hdr: UDP header definition (rte_udp.h).
• Default mask matches source and destination ports only.
Item: TCP
Matches a TCP header.
• hdr: TCP header definition (rte_tcp.h).
• Default mask matches source and destination ports only.
Item: SCTP
Matches a SCTP header.
• hdr: SCTP header definition (rte_sctp.h).
• Default mask matches source and destination ports only.
Item: VXLAN
Matches a VXLAN header (RFC 7348).
• flags: normally 0x08 (I flag).
• rsvd0: reserved, normally 0x000000.
• vni: VXLAN network identifier.
• rsvd1: reserved, normally 0x00.
• Default mask matches VNI only.
Item: E_TAG
Matches an IEEE 802.1BR E-Tag header.
The corresponding standard outer EtherType (TPID) value is RTE_ETHER_TYPE_ETAG. It can be
overridden by the preceding pattern item.
• epcp_edei_in_ecid_b: E-Tag control information (E-TCI), E-PCP (3b), E-DEI (1b),
ingress E-CID base (12b).
• rsvd_grp_ecid_b: reserved (2b), GRP (2b), E-CID base (12b).
• in_ecid_e: ingress E-CID ext.
• ecid_e: E-CID ext.
• inner_type: inner EtherType or TPID.
• Default mask simultaneously matches GRP and E-CID base.
Item: NVGRE
Matches a NVGRE header (RFC 7637).
• c_k_s_rsvd0_ver: checksum (1b), undefined (1b), key bit (1b), sequence number (1b), re-
served 0 (9b), version (3b). This field must have value 0x2000 according to RFC 7637.
• protocol: protocol type (0x6558).
• tni: virtual subnet ID.
• flow_id: flow ID.
• Default mask matches TNI only.
Item: MPLS
Matches a MPLS header.
• label_tc_s_ttl: label, TC, Bottom of Stack and TTL.
• Default mask matches label only.
Item: GRE
Matches a GRE header.
• c_rsvd0_ver: checksum, reserved 0 and version.
• protocol: protocol type.
• Default mask matches protocol only.
Item: GRE_KEY
Matches a GRE key field. This should be preceded by item GRE.
• Value to be matched is a big-endian 32 bit integer.
• When this item present it implicitly match K bit in default mask as “1”
Item: FUZZY
Fuzzy pattern match, expect faster than default.
This is for device that support fuzzy match option. Usually a fuzzy match is fast but the cost is accuracy.
i.e. Signature Match only match pattern’s hash value, but it is possible two different patterns have the
same hash value.
Matching accuracy level can be configured by threshold. Driver can divide the range of threshold and
map to different accuracy levels that device support.
Threshold 0 means perfect match (no fuzziness), while threshold 0xffffffff means fuzziest match.
Item: ESP
Matches an ESP header.
• hdr: ESP header definition (rte_esp.h).
• Default mask matches SPI only.
Item: GENEVE
Matches a GENEVE header.
• ver_opt_len_o_c_rsvd0: version (2b), length of the options fields (6b), OAM packet (1b),
critical options present (1b), reserved 0 (6b).
• protocol: protocol type.
• vni: virtual network identifier.
• rsvd1: reserved, normally 0x00.
• Default mask matches VNI only.
Item: VXLAN-GPE
Matches a VXLAN-GPE header (draft-ietf-nvo3-vxlan-gpe-05).
• flags: normally 0x0C (I and P flags).
• rsvd0: reserved, normally 0x0000.
• protocol: protocol type.
• vni: VXLAN network identifier.
• rsvd1: reserved, normally 0x00.
• Default mask matches VNI only.
Item: ARP_ETH_IPV4
Matches an ARP header for Ethernet/IPv4.
• hdr: hardware type, normally 1.
• pro: protocol type, normally 0x0800.
Item: IPV6_EXT
Matches the presence of any IPv6 extension header.
• next_hdr: next header.
• Default mask matches next_hdr.
Normally preceded by any of:
• Item: IPV6
• Item: IPV6_EXT
Item: ICMP6
Matches any ICMPv6 header.
• type: ICMPv6 type.
• code: ICMPv6 code.
• checksum: ICMPv6 checksum.
• Default mask matches type and code.
Item: ICMP6_ND_NS
Matches an ICMPv6 neighbor discovery solicitation.
• type: ICMPv6 type, normally 135.
• code: ICMPv6 code, normally 0.
• checksum: ICMPv6 checksum.
• reserved: reserved, normally 0.
• target_addr: target address.
• Default mask matches target address only.
Item: ICMP6_ND_NA
Matches an ICMPv6 neighbor discovery advertisement.
• type: ICMPv6 type, normally 136.
• code: ICMPv6 code, normally 0.
Item: ICMP6_ND_OPT
Matches the presence of any ICMPv6 neighbor discovery option.
• type: ND option type.
• length: ND option length.
• Default mask matches type only.
Normally preceded by any of:
• Item: ICMP6_ND_NA
• Item: ICMP6_ND_NS
• Item: ICMP6_ND_OPT
Item: ICMP6_ND_OPT_SLA_ETH
Matches an ICMPv6 neighbor discovery source Ethernet link-layer address option.
• type: ND option type, normally 1.
• length: ND option length, normally 1.
• sla: source Ethernet LLA.
• Default mask matches source link-layer address only.
Normally preceded by any of:
• Item: ICMP6_ND_NA
• Item: ICMP6_ND_OPT
Item: ICMP6_ND_OPT_TLA_ETH
Matches an ICMPv6 neighbor discovery target Ethernet link-layer address option.
• type: ND option type, normally 2.
• length: ND option length, normally 1.
• tla: target Ethernet LLA.
• Default mask matches target link-layer address only.
Normally preceded by any of:
• Item: ICMP6_ND_NS
• Item: ICMP6_ND_OPT
Item: META
Matches an application specific 32 bit metadata item.
• Default mask matches the specified metadata value.
Item: GTP_PSC
Matches a GTP PDU extension header with type 0x85.
• pdu_type: PDU type.
• qfi: QoS flow identifier.
• Default mask matches QFI only.
Item: PPPOE_PROTO_ID
Matches a PPPoE session protocol identifier.
• proto_id: PPP protocol identifier.
• Default mask matches proto_id only.
Item: NSH
Matches a network service header (RFC 8300).
• version: normally 0x0 (2 bits).
• oam_pkt: indicate oam packet (1 bit).
• reserved: reserved bit (1 bit).
• ttl: maximum SFF hopes (6 bits).
• length: total length in 4 bytes words (6 bits).
• reserved1: reserved1 bits (4 bits).
• mdtype: ndicates format of NSH header (4 bits).
• next_proto: indicates protocol type of encap data (8 bits).
• spi: service path identifier (3 bytes).
• sindex: service index (1 byte).
• Default mask matches mdtype, next_proto, spi, sindex.
Item: IGMP
Matches a Internet Group Management Protocol (RFC 2236).
• type: IGMP message type (Query/Report).
• max_resp_time: max time allowed before sending report.
• checksum: checksum, 1s complement of whole IGMP message.
• group_addr: group address, for Query value will be 0.
• Default mask matches group_addr.
Item: AH
Matches a IP Authentication Header (RFC 4302).
• next_hdr: next payload after AH.
• payload_len: total length of AH in 4B words.
• reserved: reserved bits.
• spi: security parameters index.
• seq_num: counter value increased by 1 on each packet sent.
• Default mask matches spi.
Item: HIGIG2
Matches a HIGIG2 header field. It is layer 2.5 protocol and used in Broadcom switches.
• Default mask matches classification and vlan.
11.2.7 Actions
Each possible action is represented by a type. An action can have an associated configuration object.
Several actions combined in a list can be assigned to a flow rule and are performed in order.
They fall in three categories:
• Actions that modify the fate of matching traffic, for instance by dropping or assigning it a specific
destination.
• Actions that modify matching traffic contents or its properties. This includes adding/removing
encapsulation, encryption, compression and marks.
• Actions related to the flow rule itself, such as updating counters or making it non-terminating.
Flow rules being terminating by default, not specifying any action of the fate kind results in undefined
behavior. This applies to both ingress and egress.
PASSTHRU, when supported, makes a flow rule non-terminating.
Like matching patterns, action lists are terminated by END items.
Example of action that redirects packets to queue index 10:
Table 11.25:
Queue action
Field Value
index 10
Actions are performed in list order:
Action: END
End marker for action lists. Prevents further processing of actions, thereby ending the list.
• Its numeric value is 0 for convenience.
• PMD support is mandatory.
• No configurable properties.
Table 11.30:
END
Field
no properties
Action: VOID
Used as a placeholder for convenience. It is ignored and simply discarded by PMDs.
• PMD support is mandatory.
• No configurable properties.
Table 11.31:
VOID
Field
no properties
Action: PASSTHRU
Leaves traffic up for additional processing by subsequent flow rules; makes a flow rule non-terminating.
• No configurable properties.
Table 11.32:
PASSTHRU
Field
no properties
Example to copy a packet to a queue and continue processing by subsequent flow rules:
Action: JUMP
Redirects packets to a group on the current device.
In a hierarchy of groups, which can be used to represent physical or logical flow group/tables on the
device, this action redirects the matched flow to the specified group on that device.
If a matched flow is redirected to a table which doesn’t contain a matching rule for that flow then the
behavior is undefined and the resulting behavior is up to the specific device. Best practice when using
groups would be define a default flow rule for each group which a defines the default actions in that
group so a consistent behavior is defined.
Defining an action for matched flow in a group to jump to a group which is higher in the group hierarchy
may not be supported by physical devices, depending on how groups are mapped to the physical devices.
In the definitions of jump actions, applications should be aware that it may be possible to define flow
rules which trigger an undefined behavior causing flows to loop between groups.
Action: MARK
Attaches an integer value to packets and sets PKT_RX_FDIR and PKT_RX_FDIR_ID mbuf flags.
This value is arbitrary and application-defined. Maximum allowed value depends on the underlying
implementation. It is returned in the hash.fdir.hi mbuf field.
Action: FLAG
Flags packets. Similar to Action: MARK without a specific value; only sets the PKT_RX_FDIR mbuf
flag.
• No configurable properties.
Table 11.36:
FLAG
Field
no properties
Action: QUEUE
Assigns packets to a given queue index.
Action: DROP
Drop packets.
• No configurable properties.
Table 11.38:
DROP
Field
no properties
Action: COUNT
Adds a counter action to a matched flow.
If more than one count action is specified in a single flow rule, then each action must specify a unique
id.
Counters can be retrieved and reset through rte_flow_query(), see struct
rte_flow_query_count.
The shared flag indicates whether the counter is unique to the flow rule the action is specified with, or
whether it is a shared counter.
For a count action with the shared flag set, then a global device namespace is assumed for the counter
id, so that any matched flow rules using a count action with the same counter id on the same port will
contribute to that counter.
For ports within the same switch domain then the counter id namespace extends to all ports within that
switch domain.
Action: RSS
Similar to QUEUE, except RSS is additionally performed on packets to spread them among several
queues according to the provided parameters.
Unlike global RSS settings used by other DPDK APIs, unsetting the types field does not disable RSS
in a flow rule. Doing so instead requests safe unspecified “best-effort” settings from the underlying
PMD, which depending on the flow rule, may result in anything ranging from empty (single queue) to
all-inclusive RSS.
Note: RSS hash result is stored in the hash.rss mbuf field which overlaps hash.fdir.lo. Since
Action: MARK sets the hash.fdir.hi field only, both can be requested simultaneously.
Also, regarding packet encapsulation level:
• 0 requests the default behavior. Depending on the packet type, it can mean outermost, innermost,
anything in between or even no RSS.
It basically stands for the innermost encapsulation level RSS can be performed on according to
PMD and device capabilities.
• 1 requests RSS to be performed on the outermost packet encapsulation level.
• 2 and subsequent values request RSS to be performed on the specified inner packet encapsu-
lation level, from outermost to innermost (lower to higher values).
Values other than 0 are not necessarily supported.
Requesting a specific RSS level on unrecognized traffic results in undefined behavior. For predictable
results, it is recommended to make the flow rule pattern match packet headers up to the requested en-
capsulation level so that only matching traffic goes through.
Action: PF
Directs matching traffic to the physical function (PF) of the current device.
Table 11.42:
PF
Field
no properties
Action: VF
Directs matching traffic to a given virtual function of the current device.
Packets matched by a VF pattern item can be redirected to their original VF ID instead of the specified
one. This parameter may not be available and is not guaranteed to work properly if the VF part is
matched by a prior flow rule or if packets are not addressed to a VF in the first place.
See Item: VF.
Table 11.43: VF
Field Value
original use original VF ID if possible
id VF ID
Action: PHY_PORT
Directs matching traffic to a given physical port index of the underlying device.
See Item: PHY_PORT.
Action: PORT_ID
Directs matching traffic to a given DPDK port ID.
See Item: PORT_ID.
Action: METER
Applies a stage of metering and policing.
The metering and policing (MTR) object has to be first created using the rte_mtr_create() API function.
The ID of the MTR object is specified as action parameter. More than one flow can use the same MTR
object through the meter action. The MTR object can be further updated or queried using the rte_mtr*
API.
Action: SECURITY
Perform the security action on flows matched by the pattern items according to the configuration of the
security session.
This action modifies the payload of matched flows. For INLINE_CRYPTO, the security protocol headers
and IV are fully provided by the application as specified in the flow pattern. The payload of matching
packets is encrypted on egress, and decrypted and authenticated on ingress. For INLINE_PROTOCOL,
the security protocol is fully offloaded to HW, providing full encapsulation and decapsulation of packets
in security protocols. The flow pattern specifies both the outer security header fields and the inner packet
fields. The security session specified in the action must match the pattern parameters.
The security session specified in the action must be created on the same port as the flow action that is
being specified.
The ingress/egress flow attribute should match that specified in the security session if the security session
supports the definition of the direction.
Multiple flows can be configured to use the same security session.
Action: OF_SET_MPLS_TTL
Implements OFPAT_SET_MPLS_TTL (“MPLS TTL”) as defined by the OpenFlow Switch Specifica-
tion.
Table 11.50:
OF_SET_MPLS_TTL
Field Value
mpls_ttl MPLS TTL
Action: OF_DEC_MPLS_TTL
Implements OFPAT_DEC_MPLS_TTL (“decrement MPLS TTL”) as defined by the OpenFlow Switch
Specification.
Table 11.51:
OF_DEC_MPLS_TTL
Field
no properties
Action: OF_SET_NW_TTL
Implements OFPAT_SET_NW_TTL (“IP TTL”) as defined by the OpenFlow Switch Specification.
Table 11.52:
OF_SET_NW_TTL
Field Value
nw_ttl IP TTL
Action: OF_DEC_NW_TTL
Implements OFPAT_DEC_NW_TTL (“decrement IP TTL”) as defined by the OpenFlow Switch Specifi-
cation.
Table 11.53:
OF_DEC_NW_TTL
Field
no properties
Action: OF_COPY_TTL_OUT
Implements OFPAT_COPY_TTL_OUT (“copy TTL “outwards” – from next-to-outermost to outer-
most”) as defined by the OpenFlow Switch Specification.
Table 11.54:
OF_COPY_TTL_OUT
Field
no properties
Action: OF_COPY_TTL_IN
Implements OFPAT_COPY_TTL_IN (“copy TTL “inwards” – from outermost to next-to-outermost”)
as defined by the OpenFlow Switch Specification.
Table 11.55:
OF_COPY_TTL_IN
Field
no properties
Action: OF_POP_VLAN
Implements OFPAT_POP_VLAN (“pop the outer VLAN tag”) as defined by the OpenFlow Switch Spec-
ification.
Table 11.56:
OF_POP_VLAN
Field
no properties
Action: OF_PUSH_VLAN
Implements OFPAT_PUSH_VLAN (“push a new VLAN tag”) as defined by the OpenFlow Switch Spec-
ification.
Table 11.57:
OF_PUSH_VLAN
Field Value
ethertype EtherType
Action: OF_SET_VLAN_VID
Implements OFPAT_SET_VLAN_VID (“set the 802.1q VLAN id”) as defined by the OpenFlow Switch
Specification.
Table 11.58:
OF_SET_VLAN_VID
Field Value
vlan_vid VLAN id
Action: OF_SET_VLAN_PCP
Implements OFPAT_SET_LAN_PCP (“set the 802.1q priority”) as defined by the OpenFlow Switch
Specification.
Table 11.59:
OF_SET_VLAN_PCP
Field Value
vlan_pcp VLAN priority
Action: OF_POP_MPLS
Implements OFPAT_POP_MPLS (“pop the outer MPLS tag”) as defined by the OpenFlow Switch Spec-
ification.
Table 11.60:
OF_POP_MPLS
Field Value
ethertype EtherType
Action: OF_PUSH_MPLS
Implements OFPAT_PUSH_MPLS (“push a new MPLS tag”) as defined by the OpenFlow Switch Spec-
ification.
Table 11.61:
OF_PUSH_MPLS
Field Value
ethertype EtherType
Action: VXLAN_ENCAP
Performs a VXLAN encapsulation action by encapsulating the matched flow in the VXLAN tunnel as
defined in the‘‘rte_flow_action_vxlan_encap‘‘ flow items definition.
This action modifies the payload of matched flows. The flow definition specified in the
rte_flow_action_tunnel_encap action structure must define a valid VLXAN network over-
lay which conforms with RFC 7348 (Virtual eXtensible Local Area Network (VXLAN): A Framework
for Overlaying Virtualized Layer 2 Networks over Layer 3 Networks). The pattern must be terminated
with the RTE_FLOW_ITEM_TYPE_END item type.
Action: VXLAN_DECAP
Performs a decapsulation action by stripping all headers of the VXLAN tunnel network overlay from
the matched flow.
The flow items pattern defined for the flow rule with which a VXLAN_DECAP action is specified, must
define a valid VXLAN tunnel as per RFC7348. If the flow pattern does not specify a valid VXLAN
tunnel then a RTE_FLOW_ERROR_TYPE_ACTION error should be returned.
This action modifies the payload of matched flows.
Action: NVGRE_ENCAP
Performs a NVGRE encapsulation action by encapsulating the matched flow in the NVGRE tunnel as
defined in the‘‘rte_flow_action_tunnel_encap‘‘ flow item definition.
This action modifies the payload of matched flows. The flow definition specified in the
rte_flow_action_tunnel_encap action structure must defined a valid NVGRE network over-
lay which conforms with RFC 7637 (NVGRE: Network Virtualization Using Generic Routing Encap-
sulation). The pattern must be terminated with the RTE_FLOW_ITEM_TYPE_END item type.
Action: NVGRE_DECAP
Performs a decapsulation action by stripping all headers of the NVGRE tunnel network overlay from the
matched flow.
The flow items pattern defined for the flow rule with which a NVGRE_DECAP action is specified, must
define a valid NVGRE tunnel as per RFC7637. If the flow pattern does not specify a valid NVGRE
Action: RAW_ENCAP
Adds outer header whose template is provided in its data buffer, as defined in the
rte_flow_action_raw_encap definition.
This action modifies the payload of matched flows. The data supplied must be a valid header, either
holding layer 2 data in case of adding layer 2 after decap layer 3 tunnel (for example MPLSoGRE) or
complete tunnel definition starting from layer 2 and moving to the tunnel item itself. When applied to
the original packet the resulting packet must be a valid packet.
Action: RAW_DECAP
Remove outer header whose template is provided in its data buffer, as defined in the
rte_flow_action_raw_decap
This action modifies the payload of matched flows. The data supplied must be a valid header, either
holding layer 2 data in case of removing layer 2 before encapsulation of layer 3 tunnel (for example
MPLSoGRE) or complete tunnel definition starting from layer 2 and moving to the tunnel item itself.
When applied to the original packet the resulting packet must be a valid packet.
Action: SET_IPV4_SRC
Set a new IPv4 source address in the outermost IPv4 header.
It must be used with a valid RTE_FLOW_ITEM_TYPE_IPV4 flow pattern item. Otherwise,
RTE_FLOW_ERROR_TYPE_ACTION error will be returned.
Action: SET_IPV4_DST
Set a new IPv4 destination address in the outermost IPv4 header.
It must be used with a valid RTE_FLOW_ITEM_TYPE_IPV4 flow pattern item. Otherwise,
RTE_FLOW_ERROR_TYPE_ACTION error will be returned.
Action: SET_IPV6_SRC
Set a new IPv6 source address in the outermost IPv6 header.
It must be used with a valid RTE_FLOW_ITEM_TYPE_IPV6 flow pattern item. Otherwise,
RTE_FLOW_ERROR_TYPE_ACTION error will be returned.
Action: SET_IPV6_DST
Set a new IPv6 destination address in the outermost IPv6 header.
It must be used with a valid RTE_FLOW_ITEM_TYPE_IPV6 flow pattern item. Otherwise,
RTE_FLOW_ERROR_TYPE_ACTION error will be returned.
Action: SET_TP_SRC
Set a new source port number in the outermost TCP/UDP header.
It must be used with a valid RTE_FLOW_ITEM_TYPE_TCP or RTE_FLOW_ITEM_TYPE_UDP flow
pattern item. Otherwise, RTE_FLOW_ERROR_TYPE_ACTION error will be returned.
Action: SET_TP_DST
Set a new destination port number in the outermost TCP/UDP header.
It must be used with a valid RTE_FLOW_ITEM_TYPE_TCP or RTE_FLOW_ITEM_TYPE_UDP flow
pattern item. Otherwise, RTE_FLOW_ERROR_TYPE_ACTION error will be returned.
Action: MAC_SWAP
Swap the source and destination MAC addresses in the outermost Ethernet header.
It must be used with a valid RTE_FLOW_ITEM_TYPE_ETH flow pattern item. Otherwise,
RTE_FLOW_ERROR_TYPE_ACTION error will be returned.
Table 11.74:
MAC_SWAP
Field
no properties
Action: DEC_TTL
Decrease TTL value.
If there is no valid RTE_FLOW_ITEM_TYPE_IPV4 or RTE_FLOW_ITEM_TYPE_IPV6 in pattern,
Some PMDs will reject rule because behavior will be undefined.
Table 11.75:
DEC_TTL
Field
no properties
Action: SET_TTL
Assigns a new TTL value.
If there is no valid RTE_FLOW_ITEM_TYPE_IPV4 or RTE_FLOW_ITEM_TYPE_IPV6 in pattern,
Some PMDs will reject rule because behavior will be undefined.
Action: SET_MAC_SRC
Set source MAC address.
It must be used with a valid RTE_FLOW_ITEM_TYPE_ETH flow pattern item. Otherwise,
RTE_FLOW_ERROR_TYPE_ACTION error will be returned.
Action: SET_MAC_DST
Set destination MAC address.
It must be used with a valid RTE_FLOW_ITEM_TYPE_ETH flow pattern item. Otherwise,
RTE_FLOW_ERROR_TYPE_ACTION error will be returned.
Action: INC_TCP_SEQ
Increase sequence number in the outermost TCP header. Value to increase TCP sequence number by is
a big-endian 32 bit integer.
Using this action on non-matching traffic will result in undefined behavior.
Action: DEC_TCP_SEQ
Decrease sequence number in the outermost TCP header. Value to decrease TCP sequence number by is
a big-endian 32 bit integer.
Using this action on non-matching traffic will result in undefined behavior.
Action: INC_TCP_ACK
Increase acknowledgment number in the outermost TCP header. Value to increase TCP acknowledgment
number by is a big-endian 32 bit integer.
Using this action on non-matching traffic will result in undefined behavior.
Action: DEC_TCP_ACK
Decrease acknowledgment number in the outermost TCP header. Value to decrease TCP acknowledg-
ment number by is a big-endian 32 bit integer.
Using this action on non-matching traffic will result in undefined behavior.
Action: SET_TAG
Set Tag.
Tag is a transient data used during flow matching. This is not delivered to application. Multiple tags are
supported by specifying index.
Action: SET_META
Set metadata. Item META matches metadata.
Metadata set by mbuf metadata field with PKT_TX_DYNF_METADATA flag on egress will be overrid-
den by this action. On ingress, the metadata will be carried by metadata dynamic field of rte_mbuf
which can be accessed by RTE_FLOW_DYNF_METADATA(). PKT_RX_DYNF_METADATA flag will
be set along with the data.
11.3.1 Validation
Given that expressing a definite set of device capabilities is not practical, a dedicated function is provided
to check if a flow rule is supported and can be created.
int
rte_flow_validate(uint16_t port_id,
const struct rte_flow_attr *attr,
const struct rte_flow_item pattern[],
const struct rte_flow_action actions[],
struct rte_flow_error *error);
The flow rule is validated for correctness and whether it could be accepted by the device given sufficient
resources. The rule is checked against the current device mode and queue configuration. The flow rule
may also optionally be validated against existing flow rules and device resources. This function has no
effect on the target device.
The returned value is guaranteed to remain valid only as long as no successful calls to
rte_flow_create() or rte_flow_destroy() are made in the meantime and no device pa-
rameter affecting flow rules in any way are modified, due to possible collisions or resource limitations
(although in such cases EINVAL should not be returned).
Arguments:
• port_id: port identifier of Ethernet device.
• attr: flow rule attributes.
• pattern: pattern specification (list terminated by the END pattern item).
• actions: associated actions (list terminated by the END action).
• error: perform verbose error reporting if not NULL. PMDs initialize this structure in case of
error only.
Return values:
• 0 if flow rule is valid and can be created. A negative errno value otherwise (rte_errno is also
set), the following errors are defined.
• -ENOSYS: underlying device does not support this functionality.
• -EINVAL: unknown or invalid rule specification.
• -ENOTSUP: valid but unsupported rule specification (e.g. partial bit-masks are unsupported).
• EEXIST: collision with an existing rule. Only returned if device supports flow rule collision
checking and there was a flow rule collision. Not receiving this return code is no guarantee that
creating the rule will not fail due to a collision.
• ENOMEM: not enough memory to execute the function, or if the device supports resource valida-
tion, resource limitation on the device.
• -EBUSY: action cannot be performed due to busy device resources, may succeed if the affected
queues or even the entire port are in a stopped state (see rte_eth_dev_rx_queue_stop()
and rte_eth_dev_stop()).
11.3.2 Creation
Creating a flow rule is similar to validating one, except the rule is actually created and a handle returned.
struct rte_flow *
rte_flow_create(uint16_t port_id,
const struct rte_flow_attr *attr,
const struct rte_flow_item pattern[],
const struct rte_flow_action *actions[],
struct rte_flow_error *error);
Arguments:
• port_id: port identifier of Ethernet device.
• attr: flow rule attributes.
• pattern: pattern specification (list terminated by the END pattern item).
• actions: associated actions (list terminated by the END action).
• error: perform verbose error reporting if not NULL. PMDs initialize this structure in case of
error only.
Return values:
A valid handle in case of success, NULL otherwise and rte_errno is set to the positive version of
one of the error codes defined for rte_flow_validate().
11.3.3 Destruction
Flow rules destruction is not automatic, and a queue or a port should not be released if any are still
attached to them. Applications must take care of performing this step before releasing resources.
int
rte_flow_destroy(uint16_t port_id,
struct rte_flow *flow,
struct rte_flow_error *error);
Failure to destroy a flow rule handle may occur when other flow rules depend on it, and destroying it
would result in an inconsistent state.
This function is only guaranteed to succeed if handles are destroyed in reverse order of their creation.
Arguments:
• port_id: port identifier of Ethernet device.
• flow: flow rule handle to destroy.
• error: perform verbose error reporting if not NULL. PMDs initialize this structure in case of
error only.
Return values:
• 0 on success, a negative errno value otherwise and rte_errno is set.
11.3.4 Flush
Convenience function to destroy all flow rule handles associated with a port. They are released as with
successive calls to rte_flow_destroy().
int
rte_flow_flush(uint16_t port_id,
struct rte_flow_error *error);
In the unlikely event of failure, handles are still considered destroyed and no longer valid but the port
must be assumed to be in an inconsistent state.
Arguments:
• port_id: port identifier of Ethernet device.
• error: perform verbose error reporting if not NULL. PMDs initialize this structure in case of
error only.
Return values:
• 0 on success, a negative errno value otherwise and rte_errno is set.
11.3.5 Query
Query an existing flow rule.
This function allows retrieving flow-specific data such as counters. Data is gathered by special actions
which must be present in the flow rule definition.
int
rte_flow_query(uint16_t port_id,
struct rte_flow *flow,
const struct rte_flow_action *action,
void *data,
struct rte_flow_error *error);
Arguments:
• port_id: port identifier of Ethernet device.
• flow: flow rule handle to query.
• action: action to query, this must match prototype from flow rule.
• data: pointer to storage for the associated query data type.
• error: perform verbose error reporting if not NULL. PMDs initialize this structure in case of
error only.
Return values:
• 0 on success, a negative errno value otherwise and rte_errno is set.
Applications relying on this mode are therefore encouraged to toggle it as soon as possible after device
initialization, ideally before the first call to rte_eth_dev_configure() to avoid possible failures
due to conflicting settings.
Once effective, the following functionality has no effect on the underlying port and may return errors
such as ENOTSUP (“not supported”):
• Toggling promiscuous mode.
• Toggling allmulticast mode.
• Configuring MAC addresses.
• Configuring multicast addresses.
• Configuring VLAN filters.
• Configuring Rx filters through the legacy API (e.g. FDIR).
• Configuring global RSS settings.
int
rte_flow_isolate(uint16_t port_id, int set, struct rte_flow_error *error);
Arguments:
• port_id: port identifier of Ethernet device.
• set: nonzero to enter isolated mode, attempt to leave it otherwise.
• error: perform verbose error reporting if not NULL. PMDs initialize this structure in case of
error only.
Return values:
• 0 on success, a negative errno value otherwise and rte_errno is set.
struct rte_flow_error {
enum rte_flow_error_type type; /**< Cause field and error types. */
const void *cause; /**< Object responsible for the error. */
const char *message; /**< Human-readable error message. */
};
Error type RTE_FLOW_ERROR_TYPE_NONE stands for no error, in which case remaining fields can
be ignored. Other error types describe the type of the object pointed by cause.
If non-NULL, cause points to the object responsible for the error. For a flow rule, this may be a pattern
item or an individual action.
If non-NULL, message provides a human-readable error message.
This object is normally allocated by applications and set by PMDs in case of error, the message points
to a constant string which does not need to be freed by the application, however its pointer can be
considered valid only as long as its associated DPDK port remains configured. Closing the underlying
device or unloading the PMD invalidates it.
11.6 Helpers
11.6.1 Error initializer
static inline int
rte_flow_error_set(struct rte_flow_error *error,
int code,
enum rte_flow_error_type type,
const void *cause,
const char *message);
This function initializes error (if non-NULL) with the provided parameters and sets rte_errno to
code. A negative error code is then returned.
11.7 Caveats
• DPDK does not keep track of flow rules definitions or flow rule objects automatically. Applica-
tions may keep track of the former and must keep track of the latter. PMDs may also do it for
internal needs, however this must not be relied on by applications.
• Flow rules are not maintained between successive port initializations. An application exiting
without releasing them and restarting must re-create them from scratch.
• API operations are synchronous and blocking (EAGAIN cannot be returned).
• There is no provision for re-entrancy/multi-thread safety, although nothing should prevent differ-
ent devices from being configured at the same time. PMDs may protect their control path functions
accordingly.
• Stopping the data path (TX/RX) should not be necessary when managing flow rules. If this can-
not be achieved naturally or with workarounds (such as temporarily replacing the burst function
pointers), an appropriate error code must be returned (EBUSY).
• PMDs, not applications, are responsible for maintaining flow rules configuration when stopping
and restarting a port or performing other actions which may affect them. They can only be de-
stroyed explicitly by applications.
For devices exposing multiple ports sharing global settings affected by flow rules:
• All ports under DPDK control must behave consistently, PMDs are responsible for making sure
that existing flow rules on a port are not affected by other ports.
• Ports not under DPDK control (unaffected or handled by other applications) are user’s responsi-
bility. They may affect existing flow rules and cause undefined behavior. PMDs aware of this may
prevent flow rules creation altogether in such cases.
• A method to optimize rte_flow rules with specific pattern items and action types generated on the
fly by PMDs. DPDK should assign negative numbers to these in order to not collide with the
existing types. See Negative types.
• Adding specific egress pattern items and actions as described in Attribute: Traffic direction.
• Optional software fallback when PMDs are unable to handle requested flow rules so applications
do not have to implement their own.
TWELVE
• Introduction
• Port Representors
• Basic SR-IOV
• Controlled SR-IOV
– Initialization
– VF Representors
– Traffic Steering
• Flow API (rte_flow)
– Extensions
– Traffic Direction
– Transferring Traffic
107
Programmer’s Guide, Release 19.11.10
• Switching Examples
– Associating VF 1 with Physical Port 0
– Sharing Broadcasts
– Encapsulating VF 2 Traffic in VXLAN
12.1 Introduction
Network adapters with multiple physical ports and/or SR-IOV capabilities usually support the offload of
traffic steering rules between their virtual functions (VFs), physical functions (PFs) and ports.
Like for standard Ethernet switches, this involves a combination of automatic MAC learning and manual
configuration. For most purposes it is managed by the host system and fully transparent to users and
applications.
On the other hand, applications typically found on hypervisors that process layer 2 (L2) traffic (such as
OVS) need to steer traffic themselves according on their own criteria.
Without a standard software interface to manage traffic steering rules between VFs, PFs and the var-
ious physical ports of a given device, applications cannot take advantage of these offloads; software
processing is mandatory even for traffic which ends up re-injected into the device it originates from.
This document describes how such steering rules can be configured through the DPDK flow API
(rte_flow), with emphasis on the SR-IOV use case (PF/VF steering) using a single physical port for
clarity, however the same logic applies to any number of ports without necessarily involving SR-IOV.
• As virtual devices, they may be more limited than their physical counterparts, for instance by
exposing only a subset of device configuration callbacks and/or by not necessarily having Rx/Tx
capability.
• Among other things, they can be used to assign MAC addresses to the resource they represent.
1
Ethernet switch device driver model (switchdev)
• Applications can tell port representors apart from other physical of virtual port
by checking the dev_flags field within their device information structure for the
RTE_ETH_DEV_REPRESENTOR bit-field.
struct rte_eth_dev_info {
...
uint32_t dev_flags; /**< Device flags */
...
};
• The device or group relationship of ports can be discovered using the switch domain_id field
within the devices switch information structure. By default the switch domain_id of a port will
be RTE_ETH_DEV_SWITCH_DOMAIN_ID_INVALID to indicate that the port doesn’t support
the concept of a switch domain, but ports which do support the concept will be allocated a unique
switch domain_id, ports within the same switch domain will share the same domain_id. The
switch port_id is used to specify the port_id in terms of the switch, so in the case of SR-IOV
devices the switch port_id would represent the virtual function identifier of the port.
/**
* Ethernet device associated switch information
*/
struct rte_eth_switch_info {
const char *name; /**< switch name */
uint16_t domain_id; /**< switch domain id */
uint16_t port_id; /**< switch port id */
};
• A DPDK application running on the hypervisor owns the PF device, which is arbitrarily assigned
port index 3.
• Both VFs are assigned to VMs and used by unknown applications; they may be DPDK-based or
anything else.
• Interconnection is not necessarily done through a true Ethernet switch and may not even exist as a
separate entity. The role of this block is to show that something brings PF, VFs and physical ports
together and enables communication between them, with a number of built-in restrictions.
Subsequent sections in this document describe means for DPDK applications running on the hypervi-
sor to freely assign specific flows between PF, VFs and physical ports based on traffic properties, by
managing this interconnection.
In this mode, interconnection must be configured by the application to enable VF communication, for
instance by explicitly directing traffic with a given destination MAC address to VF 1 and allowing that
with the same source MAC address to come out of it.
For this to work, hypervisor applications need a way to refer to either VF 1 or VF 2 in addition to the
PF. This is addressed by VF representors.
12.4.2 VF Representors
VF representors are virtual but standard DPDK network devices (albeit with limited capabilities) created
by PMDs when managing a PF device.
Since they represent VF instances used by other applications, configuring them (e.g. assigning a MAC
address or setting up promiscuous mode) affects interconnection accordingly. If supported, they may
also be used as two-way communication ports with VFs (assuming switchdev topology)
• VF representors are assigned arbitrary port indices 4 and 5 in the hypervisor application and are
respectively associated with VF 1 and VF 2.
• They can’t be dissociated; even if VF 1 and VF 2 were not connected, representors could still be
used for configuration.
• In this context, port index 3 can be thought as a representor for physical port 0.
As previously described, the “interconnection” block represents a logical concept. Interconnection oc-
curs when hardware configuration enables traffic flows from one place to another (e.g. physical port 0
to VF 1) according to some criteria.
This is discussed in more detail in traffic steering.
| | | | |
| | .---------' | |
`-----. | | .-----------------' |
| | | | .---------------------'
| | | | |
.--+-------+---+---+---+--.
| managed interconnection |
`------------+------------'
|
.---(F)----.
| physical |
| port 0 |
`----------'
• A: PF device.
• B: port representor for VF 1.
• C: port representor for VF 2.
• D: VF 1 proper.
• E: VF 2 proper.
• F: physical port.
Although uncommon, some devices do not enforce a one to one mapping between PF and physical ports.
For instance, by default all ports of mlx4 adapters are available to all their PF/VF instances, in which
case additional ports appear next to F in the above diagram.
Assuming no interconnection is provided by default in this mode, setting up a basic SR-IOV configura-
tion involving physical port 0 could be broken down as:
PF:
• A to F: let everything through.
• F to A: PF MAC as destination.
VF 1:
• A to D, E to D and F to D: VF 1 MAC as destination.
• D to A: VF 1 MAC as source and PF MAC as destination.
• D to E: VF 1 MAC as source and VF 2 MAC as destination.
• D to F: VF 1 MAC as source.
VF 2:
• A to E, D to E and F to E: VF 2 MAC as destination.
• E to A: VF 2 MAC as source and PF MAC as destination.
• E to D: VF 2 MAC as source and VF 1 MAC as destination.
• E to F: VF 2 MAC as source.
Devices may additionally support advanced matching criteria such as IPv4/IPv6 addresses or TCP/UDP
ports.
The combination of matching criteria with target endpoints fits well with rte_flow 6 , which expresses
flow rules as combinations of patterns and actions.
6
Generic flow API (rte_flow)
Enhancing rte_flow with the ability to make flow rules match and target these endpoints provides a
standard interface to manage their interconnection without introducing new concepts and whole new
API to implement them. This is described in flow API (rte_flow).
| managed interconnection |
`------------+------------'
^ |
ingress | |
egress | |
v |
.---(F)----.
| physical |
| port 0 |
`----------'
Ingress and egress are defined as relative to the application creating the flow rule.
For instance, matching traffic sent by VM 2 would be done through an ingress flow rule on VF 2 (E).
Likewise for incoming traffic on physical port (F). This also applies to C and A respectively.
With “ingress” only, traffic is matched on A thus still goes to physical port F by default
testpmd> flow create 3 ingress pattern vf id is 1 / end
actions queue index 6 / end
With “ingress + transfer”, traffic is matched on D and is therefore successfully assigned to queue 6 on A
testpmd> flow create 3 ingress transfer pattern vf id is 1 / end
actions queue index 6 / end
PORT Action
Directs matching traffic to a given physical port index.
• Targets F in traffic steering.
PORT_ID Action
Directs matching traffic to a given DPDK port ID.
Same restrictions as PORT_ID pattern item.
• Targets A, B or C in traffic steering.
PF Pattern Item
Matches traffic originating from (ingress) or going to (egress) the physical function of the current device.
If supported, should work even if the physical function is not managed by the application and thus not
associated with a DPDK port ID. Its behavior is otherwise similar to PORT_ID pattern item using PF
port ID.
• Matches A in traffic steering.
PF Action
Directs matching traffic to the physical function of the current device.
Same restrictions as PF pattern item.
• Targets A in traffic steering.
VF Pattern Item
Matches traffic originating from (ingress) or going to (egress) a given virtual function of the current
device.
If supported, should work even if the virtual function is not managed by the application and thus not
associated with a DPDK port ID. Its behavior is otherwise similar to PORT_ID pattern item using VF
port ID.
Note this pattern item does not match VF representors traffic which, as separate entities, should be
addressed through their own port IDs.
• Matches D or E in traffic steering.
VF Action
Directs matching traffic to a given virtual function of the current device.
Same restrictions as VF pattern item.
• Targets D or E in traffic steering.
*_ENCAP actions
These actions are named according to the protocol they encapsulate traffic with (e.g. VXLAN_ENCAP)
and using specific parameters (e.g. VNI for VXLAN).
While they modify traffic and can be used multiple times (order matters), unlike PORT_ID action and
friends, they have no impact on steering.
As described in actions order and repetition this means they are useless if used alone in an action list,
the resulting traffic gets dropped unless combined with either PASSTHRU or other endpoint-targeting
actions.
*_DECAP actions
They perform the reverse of *_ENCAP actions by popping protocol headers from traffic instead of
pushing them. They can be used multiple times as well.
Note that using these actions on non-matching traffic results in undefined behavior. It is recommended to
match the protocol headers to decapsulate on the pattern side of a flow rule in order to use these actions
or otherwise make sure only matching traffic goes through.
By default, PF (A) can communicate with the physical port it is associated with (F), while VF 1 (D)
and VF 2 (E) are isolated and restricted to communicate with the hypervisor application through their
respective representors (B and C) if supported.
Examples in subsequent sections apply to hypervisor applications only and are based on port represen-
tors A, B and C.
Note port_id id 3 is necessary otherwise only VFs would receive matching traffic.
From PF to outside and VFs
flow create 3 egress
pattern eth dst is ff:ff:ff:ff:ff:ff / end
actions port / port_id id 4 / port_id id 5 / end
Similar 33:33:* rules based on known MAC addresses should be added for IPv6 traffic.
Here passthru is needed since as described in actions order and repetition, flow rules are otherwise
terminating; if supported, a rule without a target endpoint will drop traffic.
Without pass-through support, ingress encapsulation on the destination endpoint might not be supported
and action list must provide one
flow create 5 ingress
pattern eth src is {VF 2 MAC} / end
actions vxlan_encap vni 42 / port_id id 3 / end
THIRTEEN
13.1 Overview
This is the generic API for the Quality of Service (QoS) Traffic Metering and Policing (MTR) of Ethernet
devices. This API is agnostic of the underlying HW, SW or mixed HW-SW implementation.
The main features are:
• Part of DPDK rte_ethdev API
• Capability query API
• Metering algorithms: RFC 2697 Single Rate Three Color Marker (srTCM), RFC 2698 and RFC
4115 Two Rate Three Color Marker (trTCM)
• Policer actions (per meter output color): recolor, drop
• Statistics (per policer output color)
120
Programmer’s Guide, Release 19.11.10
which case the input packet already has an initial color (the input color), or in color blind mode,
which is equivalent to considering all input packets initially colored as green.
• Policing: There is a separate policer action configured for each meter output color, which can:
– Drop the packet.
– Keep the same packet color: the policer output color matches the meter output color (essen-
tially a no-op action).
– Recolor the packet: the policer output color is set to a different color than the meter output
color. The policer output color is the output color of the packet, which is set in the packet
meta-data (i.e. struct rte_mbuf::sched::color).
• Statistics: The set of counters maintained for each MTR object is configurable and subject to the
implementation support. This set includes the number of packets and bytes dropped or passed for
each output color.
FOURTEEN
14.1 Overview
This is the generic API for the Quality of Service (QoS) Traffic Management of Ethernet devices, which
includes the following main features: hierarchical scheduling, traffic shaping, congestion management,
packet marking. This API is agnostic of the underlying HW, SW or mixed HW-SW implementation.
Main features:
• Part of DPDK rte_ethdev API
• Capability query API per port, per hierarchy level and per hierarchy node
• Scheduling algorithms: Strict Priority (SP), Weighed Fair Queuing (WFQ)
• Traffic shaping: single/dual rate, private (per node) and shared (by multiple nodes) shapers
• Congestion management for hierarchy leaf nodes: algorithms of tail drop, head drop, WRED,
private (per node) and shared (by multiple nodes) WRED contexts
• Packet marking: IEEE 802.1q (VLAN DEI), IETF RFC 3168 (IPv4/IPv6 ECN for TCP and
SCTP), IETF RFC 2597 (IPv4 / IPv6 DSCP)
122
Programmer’s Guide, Release 19.11.10
The configuration of WRED private and shared contexts is done through the definition of WRED pro-
files. Any WRED profile can be used by one or several WRED contexts (either private or shared).
FIFTEEN
The Wireless Baseband library provides a common programming framework that abstracts HW ac-
celerators based on FPGA and/or Fixed Function Accelerators that assist with 3GPP Physical Layer
processing. Furthermore, it decouples the application from the compute-intensive wireless functions by
abstracting their optimized libraries to appear as virtual bbdev devices.
The functional scope of the BBDEV library are those functions in relation to the 3GPP Layer 1 signal
processing (channel coding, modulation, ...).
The framework currently only supports Turbo Code FEC function.
126
Programmer’s Guide, Release 19.11.10
• A unique device index used to designate the bbdev device in all functions exported by the bbdev
API.
• A device name used to designate the bbdev device in console messages, for administration or
debugging purposes. For ease of use, the port name includes the port index.
• num_queues argument identifies the total number of queues to setup for this device.
• socket_id specifies which socket will be used to allocate the memory.
The rte_bbdev_intr_enable API is used to enable interrupts for a bbdev device, if supported by
the driver. Should be called before starting the device.
int rte_bbdev_intr_enable(uint16_t dev_id);
By default, all queues are started when the device is started, but they can be stopped individually.
int rte_bbdev_queue_start(uint16_t dev_id, uint16_t queue_id)
int rte_bbdev_queue_stop(uint16_t dev_id, uint16_t queue_id)
A device reports its capabilities when registering itself in the bbdev framework. With the aid of this
capabilities mechanism, an application can query devices to discover which operations within the 3GPP
physical layer they are capable of performing. Below is an example of the capabilities for a PMD it
supports in relation to Turbo Encoding and Decoding operations.
static const struct rte_bbdev_op_cap bbdev_capabilities[] = {
{
.type = RTE_BBDEV_OP_TURBO_DEC,
.cap.turbo_dec = {
.capability_flags =
RTE_BBDEV_TURBO_SUBBLOCK_DEINTERLEAVE |
RTE_BBDEV_TURBO_POS_LLR_1_BIT_IN |
RTE_BBDEV_TURBO_NEG_LLR_1_BIT_IN |
RTE_BBDEV_TURBO_CRC_TYPE_24B |
RTE_BBDEV_TURBO_DEC_TB_CRC_24B_KEEP |
RTE_BBDEV_TURBO_EARLY_TERMINATION,
.max_llr_modulus = 16,
.num_buffers_src = RTE_BBDEV_TURBO_MAX_CODE_BLOCKS,
.num_buffers_hard_out =
RTE_BBDEV_TURBO_MAX_CODE_BLOCKS,
.num_buffers_soft_out = 0,
}
},
{
.type = RTE_BBDEV_OP_TURBO_ENC,
.cap.turbo_enc = {
.capability_flags =
RTE_BBDEV_TURBO_CRC_24B_ATTACH |
RTE_BBDEV_TURBO_CRC_24A_ATTACH |
RTE_BBDEV_TURBO_RATE_MATCH |
RTE_BBDEV_TURBO_RV_INDEX_BYPASS,
.num_buffers_src = RTE_BBDEV_TURBO_MAX_CODE_BLOCKS,
.num_buffers_dst = RTE_BBDEV_TURBO_MAX_CODE_BLOCKS,
}
},
RTE_BBDEV_END_OF_CAPABILITIES_LIST()
};
This allows the user to query a specific bbdev PMD and get all the device capabilities. The
rte_bbdev_info structure provides two levels of information:
• Device relevant information, like: name and related rte_bus.
• Driver specific information, as defined by the struct rte_bbdev_driver_info structure,
this is where capabilities reside along with other specifics like: maximum queue sizes and priority
level.
struct rte_bbdev_info {
int socket_id;
const char *dev_name;
const struct rte_device *device;
uint16_t num_queues;
bool started;
struct rte_bbdev_driver_info drv;
};
The dequeue API uses the same format as the enqueue API of processed but the num_ops and ops
parameters are now used to specify the max processed operations the user wishes to retrieve and the
location in which to store them. The API call returns the actual number of processed operations returned,
this can never be larger than num_ops.
uint16_t rte_bbdev_dequeue_enc_ops(uint16_t dev_id, uint16_t queue_id,
struct rte_bbdev_enc_op **ops, uint16_t num_ops)
struct rte_bbdev_dec_op {
int status;
struct rte_mempool *mempool;
void *opaque_data;
union {
struct rte_bbdev_op_turbo_dec turbo_enc;
struct rte_bbdev_op_ldpc_dec ldpc_enc;
}
};
The operation structure by itself defines the operation type. It includes an operation status, a reference
to the operation specific data, which can vary in size and content depending on the operation being
provisioned. It also contains the source mempool for the operation, if it is allocated from a mempool.
If bbdev operations are allocated from a bbdev operation mempool, see next section, there is also the
ability to allocate private memory with the operation for applications purposes.
Application software is responsible for specifying all the operation specific fields in the
rte_bbdev_*_op structure which are then used by the bbdev PMD to process the requested op-
eration.
uint32_t offset;
uint32_t length;
};
uint32_t op_flags;
uint8_t rv_index;
uint8_t code_block_mode;
union {
struct rte_bbdev_op_enc_cb_params cb_params;
The Turbo encode structure includes the input and output mbuf data pointers. The provided mbuf
pointer of input needs to be big enough to stretch for extra CRC trailers.
The length is total size of the CBs inclusive of any CRC24A and CRC24B in case they were appended
by the application.
The case when one CB belongs to TB and is being enqueued individually to BBDEV, this case is consid-
ered as a special case of partial TB where its number of CBs is 1. Therefore, it requires to get processed
in TB-mode.
The figure below visualizes the encoding of CBs using BBDEV interface in TB-mode. CB-mode is a
reduced version, where only one CB exists:
k_neg k_pos
CRC24B
CRC24B
CRC24B
CRC24B
CRC24A
by the application
CB1 CB2 ... CBc-1 CBc
- The raw TB is given as a contiguous
buffer
offset length
or
k_neg k_pos
- Only CRC24A was pre-calculated by the
application, therefore
CRC24A
RTE_BBDEV_TURBO_CRC_24B_ATTACH
CB1 CB2 ... CBc-1 CBc is set in op_flags
- The raw TB is given as a contiguous
buffer
offset length
or
mbuf seg 1 mbuf seg 2
k_neg k_pos
- CRC24A was pre-calculated and
RTE_BBDEV_TURBO_CRC_24B_ATTACH
CRC24A
offset length
encode
ea eb
offset length
uint32_t op_flags;
uint8_t rv_index;
uint8_t iter_min:4;
uint8_t iter_max:4;
uint8_t iter_count;
uint8_t ext_scale;
uint8_t num_maps;
uint8_t code_block_mode;
union {
struct rte_bbdev_op_dec_cb_params cb_params;
struct rte_bbdev_op_dec_tb_params tb_params;
};
};
The Turbo decode structure includes the input, hard_output and optionally the soft_output
mbuf data pointers.
The first CB Virtual Circular Buffer (VCB) index is given by r but the number of the remaining CB
VCBs is calculated automatically by BBDEV before passing down to the driver.
The number of remaining CB VCBs should not be confused with c. c is the total number of CBs that
composes the whole TB (this maps to C as described in 3GPP TS 36.212 section 5.1.2).
The length is total size of the CBs inclusive of any CRC24A and CRC24B in case they were appended
by the application.
The case when one CB belongs to TB and is being enqueued individually to BBDEV, this case is consid-
ered as a special case of partial TB where its number of CBs is 1. Therefore, it requires to get processed
in TB-mode.
The output mbuf data structure is expected to be allocated by the application with enough room for the
output data.
The figure below visualizes the decoding of CBs using BBDEV interface in TB-mode. CB-mode is a
reduced version, where only one CB exists:
offset length
or
mbuf seg 1 mbuf seg 2
offset length
decode
k_neg k_pos
offset length
or
k_neg k_pos
Result is decoded back into the given output
mbuf as one contiguous buffer with CRC24B
CRC24B
CRC24B
CRC24B
CRC24B
CRC24A
offset length
NOTE: The actual operation flags that may be used with a specific BBDEV PMD are
dependent on the driver capabilities as reported via rte_bbdev_info_get(), and may
be a subset of those below.
Description of LDPC encode capability flags
RTE_BBDEV_LDPC_INTERLEAVER_BYPASS Set to bypass bit-level interleaver on output
stream
The structure passed for each LDPC encode operation is given below, with the operation flags forming
a bitmask in the op_flags field.
struct rte_bbdev_op_ldpc_enc {
uint32_t op_flags;
uint8_t rv_index;
uint8_t basegraph;
uint16_t z_c;
uint16_t n_cb;
uint8_t q_m;
uint16_t n_filler;
uint8_t code_block_mode;
union {
struct rte_bbdev_op_enc_ldpc_cb_params cb_params;
struct rte_bbdev_op_enc_ldpc_tb_params tb_params;
};
};
The LDPC encode parameters are set out in the table below.
Parameter Description
input input CB or TB data
output rate matched CB or TB output buffer
op_flags bitmask of all active operation capabilities
rv_index redundancy version index [0..3]
basegraph Basegraph 1 or 2
z_c Zc, LDPC lifting size
n_cb Ncb, length of the circular buffer in bits.
q_m Qm, modulation order {2,4,6,8,10}
n_filler number of filler bits
code_block_mode code block or transport block mode
op_flags bitmask of all active operation capabilities
cb_params code block specific parameters (code block mode only)
e E, length of the rate matched output sequence in bits
tb_params transport block specific parameters (transport block mode only)
c number of CBs in the TB or partial TB
r index of the first CB in the inbound mbuf data
c_ab number of CBs that use Ea before switching to Eb
ea Ea, length of the RM output sequence in bits, r < cab
eb Eb, length of the RM output sequence in bits, r >= cab
The mbuf input input is mandatory for all BBDEV PMDs and is the incoming code block or transport
block data.
The mbuf output output is mandatory and is the encoded CB(s). In CB-mode ut contains the encoded
CB of size e (E in 3GPP TS 38.212 section 6.2.5). In TB-mode it contains multiple contiguous encoded
CBs of size ea or eb. The output buffer is allocated by the application with enough room for the
output data.
The encode interface works on both a code block (CB) and a transport block (TB) basis.
NOTE: All enqueued ops in one rte_bbdev_enqueue_enc_ops() call belong to
one mode, either CB-mode or TB-mode.
The valid modes of operation are:
• CB-mode: one CB (attach CRC24B if required)
• CB-mode: one CB making up one TB (attach CRC24A if required)
• TB-mode: one or more CB of a partial TB (attach CRC24B(s) if required)
• TB-mode: one or more CB of a complete TB (attach CRC24AB(s) if required)
In CB-mode if RTE_BBDEV_LDPC_CRC_24A_ATTACH is set then CRC24A is appended to the CB.
If RTE_BBDEV_LDPC_CRC_24A_ATTACH is not set the application is responsible for calculating and
appending CRC24A before calling BBDEV. The input data mbuf length is inclusive of CRC24A/B
where present and is equal to the code block size K.
In TB-mode, CRC24A is assumed to be pre-calculated and appended to the inbound TB data buffer,
unless the RTE_BBDEV_LDPC_CRC_24A_ATTACH flag is set when it is the responsibility of BBDEV.
The input data mbuf length is total size of the CBs inclusive of any CRC24A and CRC24B in the case
they were appended by the application.
Not all BBDEV PMDs may be capable of CRC24A/B calculation. Flags
RTE_BBDEV_LDPC_CRC_24A_ATTACH and RTE_BBDEV_LDPC_CRC_24B_ATTACH inform the
application of the relevant capability. These flags can be set in the op_flags parameter to indicate
BBDEV to calculate and append CRC24A to CB before going forward with LDPC encoding.
The difference between the partial and full-size TB is that BBDEV needs the index of the first CB in this
group and the number of CBs in the group. The first CB index is given by r but the number of the CBs
is calculated by BBDEV before signalling to the driver.
The number of CBs in the group should not be confused with c, the total number of CBs in the full TB
(C as per 3GPP TS 38.212 section 5.2.2)
Figure Fig. 15.1 above showing the Turbo encoding of CBs using BBDEV interface in TB-mode is also
valid for LDPC encode.
The structure passed for each LDPC decode operation is given below, with the operation flags forming
a bitmask in the op_flags field.
struct rte_bbdev_op_ldpc_dec {
uint32_t op_flags;
uint8_t rv_index;
uint8_t basegraph;
uint16_t z_c;
uint16_t n_cb;
uint8_t q_m;
uint16_t n_filler;
uint8_t iter_max;
uint8_t iter_count;
uint8_t code_block_mode;
union {
struct rte_bbdev_op_dec_ldpc_cb_params cb_params;
struct rte_bbdev_op_dec_ldpc_tb_params tb_params;
};
};
The LDPC decode parameters are set out in the table below.
Parameter Description
input input CB or TB data
hard_output hard decisions buffer, decoded output
soft_output soft LLR output buffer (optional)
harq_comb_input HARQ combined input buffer (optional)
harq_comb_output HARQ combined output buffer (optional)
op_flags bitmask of all active operation capabilities
rv_index redundancy version index [0..3]
basegraph Basegraph 1 or 2
z_c Zc, LDPC lifting size
n_cb Ncb, length of the circular buffer in bits.
q_m Qm, modulation order {1,2,4,6,8} from pi/2-BPSK to 256QAM
n_filler number of filler bits
iter_max maximum number of iterations to perform in decode all CBs
iter_count number of iterations performed in decoding all CBs
code_block_mode code block or transport block mode
op_flags bitmask of all active operation capabilities
cb_params code block specific parameters (code block mode only)
e E, length of the rate matched output sequence in bits
tb_params transport block specific parameters (transport block mode only)
c number of CBs in the TB or partial TB
r index of the first CB in the inbound mbuf data
c_ab number of CBs that use Ea before switching to Eb
ea Ea, length of the RM output sequence in bits, r < cab
eb Eb, length of the RM output sequence in bits r >= cab
The mbuf input input encoded CB data is mandatory for all BBDEV PMDs and is the Virtual Circular
Buffer data stream with null padding. Each byte in the input circular buffer is the LLR value of each bit
/* EAL Init */
ret = rte_eal_init(argc, argv);
if (ret < 0)
rte_exit(EXIT_FAILURE, "Invalid EAL arguments\n");
while (!global_exit_flag) {
/* set op */
ops_burst[j]->turbo_enc.input.offset =
sizeof(struct rte_ether_hdr);
ops_burst[j]->turbo_enc->input.length =
rte_pktmbuf_pkt_len(bbdev_pkts[j]);
ops_burst[j]->turbo_enc->input.data =
input_pkts_burst[j];
ops_burst[j]->turbo_enc->output.offset =
sizeof(struct rte_ether_hdr);
ops_burst[j]->turbo_enc->output.data =
output_pkts_burst[j];
}
SIXTEEN
The cryptodev library provides a Crypto device framework for management and provisioning of hard-
ware and software Crypto poll mode drivers, defining generic APIs which support a number of dif-
ferent Crypto operations. The framework currently only supports cipher, authentication, chained ci-
pher/authentication and AEAD symmetric and asymmetric Crypto operations.
Note:
• If DPDK application requires multiple software crypto PMD devices then required number of
--vdev with appropriate libraries are to be added.
• An Application with crypto PMD instances sharing the same library requires unique ID.
Example: --vdev 'crypto_aesni_mb0' --vdev 'crypto_aesni_mb1'
145
Programmer’s Guide, Release 19.11.10
The rte_cryptodev_config structure is used to pass the configuration parameters for socket se-
lection and number of queue pairs.
struct rte_cryptodev_config {
int socket_id;
/**< Socket to allocate resources on */
uint16_t nb_queue_pairs;
/**< Number of queue pairs to configure on device */
};
struct rte_cryptodev_qp_conf {
uint32_t nb_descriptors; /**< Number of descriptors per queue pair */
struct rte_mempool *mp_session;
/**< The mempool for creating session in sessionless mode */
struct rte_mempool *mp_session_private;
/**< The mempool for creating sess private data in sessionless mode */
};
The fields mp_session and mp_session_private are used for creating temporary session to
process the crypto operations in the session-less mode. They can be the same other different mempools.
Please note not all Cryptodev PMDs supports session-less mode.
struct rte_cryptodev_capabilities;
Each Crypto poll mode driver defines its own private array of capabilities for the operations it sup-
ports. Below is an example of the capabilities for a PMD which supports the authentication algorithm
SHA1_HMAC and the cipher algorithm AES_CBC.
static const struct rte_cryptodev_capabilities pmd_capabilities[] = {
{ /* SHA1 HMAC */
.op = RTE_CRYPTO_OP_TYPE_SYMMETRIC,
.sym = {
.xform_type = RTE_CRYPTO_SYM_XFORM_AUTH,
.auth = {
.algo = RTE_CRYPTO_AUTH_SHA1_HMAC,
.block_size = 64,
.key_size = {
.min = 64,
.max = 64,
.increment = 0
},
.digest_size = {
.min = 12,
.max = 12,
.increment = 0
},
.aad_size = { 0 },
.iv_size = { 0 }
}
}
},
{ /* AES CBC */
.op = RTE_CRYPTO_OP_TYPE_SYMMETRIC,
.sym = {
.xform_type = RTE_CRYPTO_SYM_XFORM_CIPHER,
.cipher = {
.algo = RTE_CRYPTO_CIPHER_AES_CBC,
.block_size = 16,
.key_size = {
.min = 16,
.max = 32,
.increment = 8
},
.iv_size = {
.min = 16,
.max = 16,
.increment = 0
}
}
}
}
}
This allows the user to query a specific Crypto PMD and get all the device features and capabilities. The
rte_cryptodev_info structure contains all the relevant information for the device.
struct rte_cryptodev_info {
uint64_t feature_flags;
unsigned max_nb_queue_pairs;
struct {
unsigned max_nb_sessions;
} sym;
};
void * rte_cryptodev_sym_session_get_user_data(
struct rte_cryptodev_sym_session *sess);
Please note the size passed to set API cannot be bigger than the predefined user_data_sz when cre-
ating the session header mempool, otherwise the function will return error. Also when user_data_sz
was defined as 0 when creating the session header mempool, the get API will always return NULL.
For session-less mode, the private user data information can be placed along with the struct
rte_crypto_op. The rte_crypto_op::private_data_offset indicates the start of pri-
vate data information. The offset is counted from the start of the rte_crypto_op including other crypto
information such as the IVs (since there can be an IV also for authentication).
returns the number of operations it actually enqueued for processing, a return value equal to nb_ops
means that all packets have been enqueued.
uint16_t rte_cryptodev_enqueue_burst(uint8_t dev_id, uint16_t qp_id,
struct rte_crypto_op **ops, uint16_t nb_ops)
The dequeue API uses the same format as the enqueue API of processed but the nb_ops and ops
parameters are now used to specify the max processed operations the user wishes to retrieve and the
location in which to store them. The API call returns the actual number of processed operations returned,
this can never be larger than nb_ops.
uint16_t rte_cryptodev_dequeue_burst(uint8_t dev_id, uint16_t qp_id,
struct rte_crypto_op **ops, uint16_t nb_ops)
Crypto Operation
private data
The operation structure includes the operation type, the operation status and the session type (session-
based/less), a reference to the operation specific data, which can vary in size and content depending on
the operation being provisioned. It also contains the source mempool for the operation, if it allocated
from a mempool.
If Crypto operations are allocated from a Crypto operation mempool, see next section, there is also the
ability to allocate private memory with the operation for applications purposes.
Application software is responsible for specifying all the operation specific fields in the
rte_crypto_op structure which are then used by the Crypto PMD to process the requested oper-
ation.
uint16_t nb_drivers;
uint16_t user_data_sz;
uint64_t opaque_data;
Private Session Data
struct {
void *data;
uint16_t refcnt;
} session_data[];
user_data
...
Crypto Driver Private Session
metric transform chain is used to specify the operation and its parameters. See the section below for
details on transforms.
When a session is no longer used, user must call rte_cryptodev_sym_session_clear() for
each of the crypto devices that are using the session, to free all driver private session data. Once this is
done, session should be freed using rte_cryptodev_sym_session_free which returns them to
their mempool.
};
};
The API does not place a limit on the number of transforms that can be chained together but this will be
limited by the underlying Crypto device poll mode driver which is processing the operation.
Symmetric Transform
(struct rte_crypto_sym_xform)
next transform
(struct rte_crypto_sym_xform *)
next transform
(struct rte_crypto_sym_xform *)
Transform Parameters
transform type
struct rte_crypto_auth_xform (enum rte_crypto_sym_xform_type)
struct rte_crypto_cipher_xform
struct rte_crypto_aead_xform
Transform Parameters
struct rte_crypto_auth_xform
struct rte_crypto_cipher_xform
struct rte_crypto_aead_xform
union {
struct rte_cryptodev_sym_session *session;
/**< Handle for the initialised session context */
struct rte_crypto_sym_xform *xform;
/**< Session-less API Crypto operation parameters */
};
union {
struct {
struct {
uint32_t offset;
uint32_t length;
struct {
uint8_t *data;
rte_iova_t phys_addr;
} digest; /**< Digest parameters */
struct {
uint8_t *data;
rte_iova_t phys_addr;
} aad;
/**< Additional authentication parameters */
} aead;
struct {
struct {
struct {
uint32_t offset;
uint32_t length;
} data; /**< Data offsets and length for ciphering */
} cipher;
struct {
struct {
uint32_t offset;
uint32_t length;
} data;
/**< Data offsets and length for authentication */
struct {
uint8_t *data;
rte_iova_t phys_addr;
} digest; /**< Digest parameters */
} auth;
};
};
};
sizeof(struct rte_crypto_sym_op))
/* Initialize EAL. */
ret = rte_eal_init(argc, argv);
if (ret < 0)
rte_exit(EXIT_FAILURE, "Invalid EAL arguments\n");
/*
* The IV is always placed after the crypto operation,
* so some private data is required to be reserved.
*/
unsigned int crypto_op_private_data = AES_CBC_IV_LENGTH;
#ifdef USE_TWO_MEMPOOLS
/* Create session mempool for the session header. */
session_pool = rte_cryptodev_sym_session_pool_create("session_pool",
MAX_SESSIONS,
0,
POOL_CACHE_SIZE,
0,
socket_id);
/*
#else
/* Use of the same mempool for session header and private data */
session_pool = rte_cryptodev_sym_session_pool_create("session_pool",
MAX_SESSIONS * 2,
session_size,
POOL_CACHE_SIZE,
0,
socket_id);
session_priv_pool = session_pool;
#endif
if (rte_cryptodev_start(cdev_id) < 0)
rte_exit(EXIT_FAILURE, "Failed to start device\n");
if (rte_cryptodev_sym_session_init(cdev_id, session,
&cipher_xform, session_priv_pool) < 0)
rte_exit(EXIT_FAILURE, "Session could not be initialized "
"for the crypto device\n");
generate_random_bytes(iv_ptr, AES_CBC_IV_LENGTH);
op->sym->cipher.data.offset = 0;
op->sym->cipher.data.length = BUFFER_SIZE;
/*
* Dequeue the crypto operations until all the operations
* are processed in the crypto device.
*/
uint16_t num_dequeued_ops, total_num_dequeued_ops = 0;
do {
struct rte_crypto_op *dequeued_ops[BURST_SIZE];
num_dequeued_ops = rte_cryptodev_dequeue_burst(cdev_id, 0,
dequeued_ops, BURST_SIZE);
total_num_dequeued_ops += num_dequeued_ops;
/* Initialize EAL. */
ret = rte_eal_init(argc, argv);
if (ret < 0)
/*
* Create session mempool, with two objects per session,
* one for the session header and another one for the
* private asym session data for the crypto device.
*/
asym_session_pool = rte_mempool_create("asym_session_pool",
MAX_ASYM_SESSIONS * 2,
asym_session_size,
0,
0, NULL, NULL, NULL,
NULL, socket_id,
0);
if (rte_cryptodev_queue_pair_setup(cdev_id, 0, &qp_conf,
socket_id, asym_session_pool) < 0)
rte_exit(EXIT_FAILURE, "Failed to setup queue pair\n");
if (rte_cryptodev_start(cdev_id) < 0)
rte_exit(EXIT_FAILURE, "Failed to start device\n");
.xform_type = RTE_CRYPTO_ASYM_XFORM_MODEX,
.modex = {
.modulus = {
.data =
(uint8_t *)
("\xb3\xa1\xaf\xb7\x13\x08\x00\x0a\x35\xdc\x2b\x20\x8d"
"\xa1\xb5\xce\x47\x8a\xc3\x80\xf4\x7d\x4a\xa2\x62\xfd\x61\x7f"
"\xb5\xa8\xde\x0a\x17\x97\xa0\xbf\xdf\x56\x5a\x3d\x51\x56\x4f"
"\x70\x70\x3f\x63\x6a\x44\x5b\xad\x84\x0d\x3f\x27\x6e\x3b\x34"
"\x91\x60\x14\xb9\xaa\x72\xfd\xa3\x64\xd2\x03\xa7\x53\x87\x9e"
"\x88\x0b\xc1\x14\x93\x1a\x62\xff\xb1\x5d\x74\xcd\x59\x63\x18"
"\x11\x3d\x4f\xba\x75\xd4\x33\x4e\x23\x6b\x7b\x57\x44\xe1\xd3"
"\x03\x13\xa6\xf0\x8b\x60\xb0\x9e\xee\x75\x08\x9d\x71\x63\x13"
"\xcb\xa6\x81\x92\x14\x03\x22\x2d\xde\x55"),
.length = 128
},
.exponent = {
.data = (uint8_t *)("\x01\x00\x01"),
.length = 3
}
}
};
/* Create asym crypto session and initialize it for the crypto device. */
struct rte_cryptodev_asym_session *asym_session;
asym_session = rte_cryptodev_asym_session_create(asym_session_pool);
if (asym_session == NULL)
rte_exit(EXIT_FAILURE, "Session could not be created\n");
if (rte_cryptodev_asym_session_init(cdev_id, asym_session,
&modex_xform, asym_session_pool) < 0)
rte_exit(EXIT_FAILURE, "Session could not be initialized "
"for the crypto device\n");
/*
* Dequeue the crypto operations until all the operations
* are processed in the crypto device.
*/
uint16_t num_dequeued_ops, total_num_dequeued_ops = 0;
do {
struct rte_crypto_op *dequeued_ops[1];
num_dequeued_ops = rte_cryptodev_dequeue_burst(cdev_id, 0,
dequeued_ops, 1);
total_num_dequeued_ops += num_dequeued_ops;
SEVENTEEN
The compression framework provides a generic set of APIs to perform compression services as well as
to query and configure compression devices both physical(hardware) and virtual(software) to perform
those services. The framework currently only supports lossless compression schemes: Deflate and LZS.
Note:
• If DPDK application requires multiple software compression PMD devices then required number
of --vdev with appropriate libraries are to be added.
• An Application with multiple compression device instances exposed by the same PMD must spec-
ify a unique name for each device.
Example: --vdev 'pmd0' --vdev 'pmd1'
163
Programmer’s Guide, Release 19.11.10
• A device name used to designate the compression device in console messages, for administration
or debugging purposes.
17.2.1 Capabilities
Each PMD has a list of capabilities, including algorithms listed in enum rte_comp_algorithm
and its associated feature flag and sliding window range in log base 2 value. Sliding window tells the
minimum and maximum size of lookup window that algorithm uses to find duplicates.
See DPDK API Reference for details.
Each Compression poll mode driver defines its array of capabilities for each algorithm it supports. See
PMD implementation for capability initialization.
multiple sequential enqueue_burst() calls for each of them processing them statefully. See Compression
API Stateful Operation for stateful processing of ops.
17.4 Transforms
Compression transforms (rte_comp_xform) are the mechanism to specify the details of the com-
pression operation such as algorithm, window size and checksum.
priv_xform op
priv_xform op
priv_xform op
op
op
priv_xform
op
op
if (rte_compressdev_queue_pair_setup(cdev_id, 0, NUM_MAX_INFLIGHT_OPS,
socket_id()) < 0)
rte_exit(EXIT_FAILURE, "Failed to setup queue pair\n");
if (rte_compressdev_start(cdev_id) < 0)
rte_exit(EXIT_FAILURE, "Failed to start device\n");
op->src.offset = 0;
op->dst.offset = 0;
op->src.length = OP_LEN;
op->input_chksum = 0;
setup op->m_src and op->m_dst;
}
num_enqd = rte_compressdev_enqueue_burst(cdev_id, 0, comp_ops, NUM_OPS);
/* wait for this to complete before enqueuing next*/
do {
num_deque = rte_compressdev_dequeue_burst(cdev_id, 0 , &processed_ops, NUM_OPS);
} while (num_dqud < num_enqd);
RTE_COMP_FLUSH_FULL/FINAL.
In case of either one or all of the above conditions, PMD initiates stateful processing and releases
acquired resources after processing operation with flush value = RTE_COMP_FLUSH_FULL/FINAL
is complete. Unlike stateless, application can enqueue only one stateful op from a particular stream at a
time and must attach stream handle to each op.
op
op
stream
op
op
if (rte_compressdev_queue_pair_setup(cdev_id, 0, NUM_MAX_INFLIGHT_OPS,
socket_id()) < 0)
rte_exit(EXIT_FAILURE, "Failed to setup queue pair\n");
if (rte_compressdev_start(cdev_id) < 0)
rte_exit(EXIT_FAILURE, "Failed to start device\n");
/* create stream */
void *stream;
rte_compressdev_stream_create(cdev_id, &compress_xform, &stream);
function returns the number of operations it actually enqueued for processing, a return value equal to
nb_ops means that all packets have been enqueued.
The dequeue API uses the same format as the enqueue API but the nb_ops and ops parameters are
now used to specify the max processed operations the user wishes to retrieve and the location in which
to store them. The API call returns the actual number of processed operations returned, this can never
be larger than nb_ops.
EIGHTEEN
SECURITY LIBRARY
The security library provides a framework for management and provisioning of security protocol oper-
ations offloaded to hardware based devices. The library defines generic APIs to create and free security
sessions which can support full protocol offload as well as inline crypto operation with NIC or crypto
devices. The framework currently only supports the IPsec and PDCP protocol and associated operations,
other protocols will be added in future.
Note: Currently, the security library does not support the case of multi-process. It will be updated in
the future releases.
Note: The underlying device may not support crypto processing for all ingress packet matching to a
particular flow (e.g. fragmented packets), such packets will be passed as encrypted packets. It is the
174
Programmer’s Guide, Release 19.11.10
responsibility of application to process such encrypted packets using other crypto driver instance.
Egress Data path - The software prepares the egress packet by adding relevant security protocol head-
ers. Only the data will not be encrypted by the software. The driver will accordingly configure the tx
descriptors. The hardware device will encrypt the data before sending the packet out.
Note: The underlying device in this case is stateful. It is expected that the device shall support crypto
processing for all kind of packets matching to a given flow, this includes fragmented packets (post
reassembly). E.g. in case of IPsec the device may internally manage anti-replay etc. It will provide a
configuration option for anti-replay behavior i.e. to drop the packets or pass them to driver with error
flags set in the descriptor.
Egress Data path - The software will send the plain packet without any security protocol headers added
to the packet. The driver will configure the security index and other requirement in tx descriptors. The
hardware device will do security processing on the packet that includes adding the relevant protocol
headers and encrypting the data before sending the packet out. The software should make sure that the
buffer has required head room and tail room for any protocol header addition. The software may also do
early fragmentation if the resultant packet is expected to cross the MTU size.
Note: The underlying device will manage state information required for egress processing. E.g. in case
of IPsec, the seq number will be added to the packet, however the device shall provide indication when
the sequence number is about to overflow. The underlying device may support post encryption TSO.
in case of IPsec, IPsec tunnel headers (if any), ESP/AH headers will be removed from the packet and
the decrypted packet may contain plain data only.
Note: In case of IPsec the device may internally manage anti-replay etc. It will provide a configuration
option for anti-replay behavior i.e. to drop the packets or pass them to driver with error flags set in
descriptor.
Encryption: The software will submit the packet to cryptodev as usual for encryption, the hardware
device in this case will also add the relevant security protocol header along with encrypting the packet.
The software should make sure that the buffer has required head room and tail room for any protocol
header addition.
Note: In the case of IPsec, the seq number will be added to the packet, It shall provide an indication
when the sequence number is about to overflow.
| |
+---------|----------+ +-----------|----------+
| Header Compression*| | Header Decompression*|
| (Data-Plane only) | | (Data Plane only) |
+---------|----------+ +-----------|----------+
| |
+---------|-----------+ +-----------|----------+
| Integrity Protection| |Integrity Verification|
| (Control Plane only)| | (Control Plane only) |
+---------|-----------+ +-----------|----------+
+---------|-----------+ +----------|----------+
| Ciphering | | Deciphering |
+---------|-----------+ +----------|----------+
+---------|-----------+ +----------|----------+
| Add PDCP header | | Remove PDCP Header |
+---------|-----------+ +----------|----------+
| |
+----------------->>----------------+
Note:
• Header Compression and decompression are not supported currently.
Just like IPsec, in case of PDCP also header addition/deletion, cipher/ de-cipher, integrity protec-
tion/verification is done based on the action type chosen.
Each driver (crypto or ethernet) defines its own private array of capabilities for the operations it supports.
Below is an example of the capabilities for a PMD which supports the IPsec and PDCP protocol.
static const struct rte_security_capability pmd_security_capabilities[] = {
{ /* IPsec Lookaside Protocol offload ESP Tunnel Egress */
.action = RTE_SECURITY_ACTION_TYPE_LOOKASIDE_PROTOCOL,
.protocol = RTE_SECURITY_PROTOCOL_IPSEC,
.ipsec = {
.proto = RTE_SECURITY_IPSEC_SA_PROTO_ESP,
.mode = RTE_SECURITY_IPSEC_SA_MODE_TUNNEL,
.direction = RTE_SECURITY_IPSEC_SA_DIR_EGRESS,
.options = { 0 }
},
.crypto_capabilities = pmd_capabilities
},
{ /* IPsec Lookaside Protocol offload ESP Tunnel Ingress */
.action = RTE_SECURITY_ACTION_TYPE_LOOKASIDE_PROTOCOL,
.protocol = RTE_SECURITY_PROTOCOL_IPSEC,
.ipsec = {
.proto = RTE_SECURITY_IPSEC_SA_PROTO_ESP,
.mode = RTE_SECURITY_IPSEC_SA_MODE_TUNNEL,
.direction = RTE_SECURITY_IPSEC_SA_DIR_INGRESS,
.options = { 0 }
},
.crypto_capabilities = pmd_capabilities
},
{ /* PDCP Lookaside Protocol offload Data Plane */
.action = RTE_SECURITY_ACTION_TYPE_LOOKASIDE_PROTOCOL,
.protocol = RTE_SECURITY_PROTOCOL_PDCP,
.pdcp = {
.domain = RTE_SECURITY_PDCP_MODE_DATA,
.capa_flags = 0
},
.crypto_capabilities = pmd_capabilities
},
{ /* PDCP Lookaside Protocol offload Control */
.action = RTE_SECURITY_ACTION_TYPE_LOOKASIDE_PROTOCOL,
.protocol = RTE_SECURITY_PROTOCOL_PDCP,
.pdcp = {
.domain = RTE_SECURITY_PDCP_MODE_CONTROL,
.capa_flags = 0
},
.crypto_capabilities = pmd_capabilities
},
{
.action = RTE_SECURITY_ACTION_TYPE_NONE
}
};
static const struct rte_cryptodev_capabilities pmd_capabilities[] = {
{ /* SHA1 HMAC */
.op = RTE_CRYPTO_OP_TYPE_SYMMETRIC,
.sym = {
.xform_type = RTE_CRYPTO_SYM_XFORM_AUTH,
.auth = {
.algo = RTE_CRYPTO_AUTH_SHA1_HMAC,
.block_size = 64,
.key_size = {
.min = 64,
.max = 64,
.increment = 0
},
.digest_size = {
.min = 12,
.max = 12,
.increment = 0
},
.aad_size = { 0 },
.iv_size = { 0 }
}
}
},
{ /* AES CBC */
.op = RTE_CRYPTO_OP_TYPE_SYMMETRIC,
.sym = {
.xform_type = RTE_CRYPTO_SYM_XFORM_CIPHER,
.cipher = {
.algo = RTE_CRYPTO_CIPHER_AES_CBC,
.block_size = 16,
.key_size = {
.min = 16,
.max = 32,
.increment = 8
},
.iv_size = {
.min = 16,
.max = 16,
.increment = 0
}
}
}
}
}
This allows the user to query a specific driver and get all device security capabilities. It returns an array
of rte_security_capability structures which contains all the capabilities for that device.
Note: In case of inline processed packets, rte_mbuf.udata64 field would be used by the driver to
relay information on the security processing associated with the packet. In ingress, the driver would set
this in Rx path while in egress, rte_security_set_pkt_metadata() would perform a similar
operation. The application is expected not to modify the field when it has relevant info. For ingress,
this device-specific 64 bit value is required to derive other information (like userdata), required for
identifying the security processing done on the packet.
The configuration structure reuses the rte_crypto_sym_xform struct for crypto related configura-
tion. The rte_security_session_action_type struct is used to specify whether the session
is configured for Lookaside Protocol offload or Inline Crypto or Inline Protocol Offload.
enum rte_security_session_action_type {
RTE_SECURITY_ACTION_TYPE_NONE,
/**< No security actions */
RTE_SECURITY_ACTION_TYPE_INLINE_CRYPTO,
/**< Crypto processing for security protocol is processed inline
* during transmission */
RTE_SECURITY_ACTION_TYPE_INLINE_PROTOCOL,
/**< All security protocol processing is performed inline during
* transmission */
RTE_SECURITY_ACTION_TYPE_LOOKASIDE_PROTOCOL
/**< All security protocol processing including crypto is performed
* on a lookaside accelerator */
};
Currently the library defines configuration parameters for IPsec and PDCP only. For other protocols like
MACSec, structures and enums are defined as place holders which will be updated in the future.
IPsec related configuration parameters are defined in rte_security_ipsec_xform
struct rte_security_ipsec_xform {
uint32_t spi;
/**< SA security parameter index */
uint32_t salt;
/**< SA salt */
|
+--------V--------+
| Flow API |
+--------|--------+
|
+--------V--------+
| |
| NIC PMD | <------ Add/Remove SA to/from hw context
| |
+--------|--------+
|
+--------|--------+
| HW ACCELERATED |
| NIC |
| |
+--------|--------+
• Add/Delete SA flow: To add a new inline SA construct a rte_flow_item for Ethernet + IP + ESP
using the SA selectors and the rte_crypto_ipsec_xform as the rte_flow_action.
Note that any rte_flow_items may be empty, which means it is not checked.
In its most basic form, IPsec flow specification is as follows:
+-------+ +----------+ +--------+ +-----+
| Eth | -> | IP4/6 | -> | ESP | -> | END |
+-------+ +----------+ +--------+ +-----+
However, the API can represent, IPsec crypto offload with any encapsulation:
+-------+ +--------+ +-----+
| Eth | -> ... -> | ESP | -> | END |
+-------+ +--------+ +-----+
NINETEEN
RAWDEVICE LIBRARY
19.1 Introduction
In terms of device flavor (type) support, DPDK currently has ethernet (lib_ether), cryptodev (libcryp-
todev), eventdev (libeventdev) and vdev (virtual device) support.
For a new type of device, for example an accelerator, there are not many options except: 1. create another
lib/librte_MySpecialDev, driver/MySpecialDrv and use it through Bus/PMD model. 2. Or, create a vdev
and implement necessary custom APIs which are directly exposed from driver layer. However this may
still require changes in bus code in DPDK.
The DPDK Rawdev library is an abstraction that provides the DPDK framework a way to manage such
devices in a generic manner without expecting changes to library or EAL for each device type. This
library provides a generic set of operations and APIs for framework and Applications to use, respectively,
for interfacing with such type of devices.
19.2 Design
Key factors guiding design of the Rawdevice library:
1. Following are some generic operations which can be treated as applicable to a large subset of
device types. None of the operations are mandatory to be implemented by a driver. Application
should also be designed for proper handling for unsupported APIs.
• Device Start/Stop - In some cases, ‘reset’ might also be required which has different semantics
than a start-stop-start cycle.
• Configuration - Device, Queue or any other sub-system configuration
• I/O - Sending a series of buffers which can enclose any arbitrary data
• Statistics - Fetch arbitrary device statistics
• Firmware Management - Firmware load/unload/status
2. Application API should be able to pass along arbitrary state information to/from device driver.
This can be achieved by maintaining context information through opaque data or pointers.
Figure below outlines the layout of the rawdevice library and device vis-a-vis other well known device
types like eth and crypto:
+-----------------------------------------------------------+
| Application(s) |
+------------------------------.----------------------------+
|
|
184
Programmer’s Guide, Release 19.11.10
+------------------------------'----------------------------+
| DPDK Framework (APIs) |
+--------------|----|-----------------|---------------------+
/ \ \
(crypto ops) (eth ops) (rawdev ops) +----+
/ \ \ |DrvA|
+-----'---+ +----`----+ +---'-----+ +----+
| crypto | | ethdev | | raw |
+--/------+ +---/-----+ +----/----+ +----+
/\ __/\ / ..........|DrvB|
/ \ / \ / ../ \ +----+
+====+ +====+ +====+ +====+ +==/=+ ```Bus Probe
|DevA| |DevB| |DevC| |DevD| |DevF|
+====+ +====+ +====+ +====+ +====+
| | | | |
``|``````|````````|``````|`````````````````|````````Bus Scan
(PCI) | (PCI) (PCI) (PCI)
(BusA)
* It is assumed above that DrvB is a PCI type driver which registers itself
with PCI Bus
* Thereafter, when the PCI scan is done, during probe DrvB would match the
rawdev DevF ID and take control of device
* Applications can then continue using the device through rawdev API
interfaces
TWENTY
In addition to Poll Mode Drivers (PMDs) for physical and virtual hardware, DPDK also includes a
pure-software library that allows physical PMDs to be bonded together to create a single logical PMD.
User Application
DPDK
bonded ethdev
The Link Bonding PMD library(librte_pmd_bond) supports bonding of groups of rte_eth_dev ports
of the same speed and duplex to provide similar capabilities to that found in Linux bonding driver
to allow the aggregation of multiple (slave) NICs into a single logical interface between a server and
a switch. The new bonded PMD will then process these interfaces based on the mode of operation
specified to provide support for features such as redundant links, fault tolerance and/or load balancing.
The librte_pmd_bond library exports a C API which provides an API for the creation of bonded devices
as well as the configuration and management of the bonded device and its slave devices.
Note: The Link Bonding PMD Library is enabled by default in the build configuration files, the library
can be disabled by setting CONFIG_RTE_LIBRTE_PMD_BOND=n and recompiling the DPDK.
186
Programmer’s Guide, Release 19.11.10
User Application
5
4
3
2
DPDK
1
bonded ethdev
4 5
1 2 3
User Application
3
2
DPDK
1
bonded ethdev
3
2
1
User Application
6
5
4
3
2 DPDK
1
bonded ethdev
5
3 4
1 2 6
Note: The coloring differences of the packets are used to identify different flow classification calculated
by the selected transmit policy
User Application
3
2
DPDK
1
bonded ethdev
3 3 3
2 2 2
1 1 1
User Application
6
5
4
3
2 DPDK
1
bonded ethdev
5
3 O
O 4 6
1 2 O
User Application
12003
5006
5005
0002
0001 DPDK
bonded ethdev
0002 5006
Bonding device stores its own version of RSS settings i.e. RETA, RSS hash function and RSS key, used
to set up its slaves. That let to define the meaning of RSS configuration of bonding device as desired
configuration of whole bonding (as one unit), without pointing any of slave inside. It is required to
ensure consistency and made it more error-proof.
RSS hash function set for bonding device, is a maximal set of RSS hash functions supported by all
bonded slaves. RETA size is a GCD of all its RETA’s sizes, so it can be easily used as a pattern providing
expected behavior, even if slave RETAs’ sizes are different. If RSS Key is not set for bonded device, it’s
not changed on the slaves and default key for device is used.
As RSS configurations, there is flow consistency in the bonded slaves for the next rte flow operations:
Validate:
• Validate flow for each slave, failure at least for one slave causes to bond validation failure.
Create:
• Create the flow in all slaves.
• Save all the slaves created flows objects in bonding internal flow structure.
• Failure in flow creation for existed slave rejects the flow.
• Failure in flow creation for new slaves in slave adding time rejects the slave.
Destroy:
• Destroy the flow in all slaves and release the bond internal flow memory.
Flush:
• Destroy all the bonding PMD flows in all the slaves.
Note: Don’t call slaves flush directly, It destroys all the slave flows which may include external flows
or the bond internal LACP flow.
Query:
• Summarize flow counters from all the slaves, relevant only for
RTE_FLOW_ACTION_TYPE_COUNT.
Isolate:
• Call to flow isolate for all slaves.
• Failure in flow isolation for existed slave rejects the isolate mode.
• Failure in flow isolation for new slaves in slave adding time rejects the slave.
All settings are managed through the bonding port API and always are propagated in one direction (from
bonding to slaves).
callback notification when a single slave changes state and the previous conditions are not met. If a user
wishes to monitor individual slaves then they must register callbacks with that slave directly.
The link bonding library also supports devices which do not implement link status change inter-
rupts, this is achieved by polling the devices link status at a defined period which is set using the
rte_eth_bond_link_monitoring_set API, the default polling interval is 10ms. When a de-
vice is added as a slave to a bonding device it is determined using the RTE_PCI_DRV_INTR_LSC flag
whether the device supports interrupts or whether the link status should be monitored by polling it.
20.2.3 Configuration
Link bonding devices are created using the rte_eth_bond_create API which requires a unique
device name, the bonding mode, and the socket Id to allocate the bonding device’s resources on. The
other configurable parameters for a bonded device are its slave devices, its primary slave, a user defined
MAC address and transmission policy to use if the device is in balance XOR mode.
Slave Devices
Bonding devices support up to a maximum of RTE_MAX_ETHPORTS slave devices of the same speed
and duplex. Ethernet devices can be added as a slave to a maximum of one bonded device. Slave devices
are reconfigured with the configuration of the bonded device on being added to a bonded device.
The bonded also guarantees to return the MAC address of the slave device to its original value of removal
of a slave from it.
Primary Slave
The primary slave is used to define the default port to use when a bonded device is in active backup
mode. A different port will only be used if, and only if, the current primary port goes down. If the user
does not specify a primary port it will default to being the first port added to the bonded device.
MAC Address
The bonded device can be configured with a user specified MAC address, this address will be inherited
by the some/all slave devices depending on the operating mode. If the device is in active backup mode
then only the primary device will have the user specified MAC, all other slaves will retain their original
MAC address. In mode 0, 2, 3, 4 all slaves devices are configure with the bonded devices MAC address.
If a user defined MAC address is not defined then the bonded device will default to using the primary
slaves MAC address.
20.3.2 Using Link Bonding Devices from the EAL Command Line
Link bonding devices can be created at application startup time using the --vdev EAL command line
option. The device name must start with the net_bonding prefix followed by numbers or letters. The
name must be unique for each device. Each device can have multiple options arranged in a comma
separated list. Multiple devices definitions can be arranged by calling the --vdev option multiple
times.
Device names and bonding options must be separated by commas as shown below:
$RTE_TARGET/app/testpmd -l 0-3 -n 4 --vdev 'net_bonding0,bond_opt0=..,bond opt1=..'--vdev 'net_
• slave: Defines the PMD device which will be added as slave to the bonded device. This option
can be selected multiple times, for each device to be added as a slave. Physical devices should be
specified using their PCI address, in the format domain:bus:devid.function
slave=0000:0a:00.0,slave=0000:0a:00.1
• primary: Optional parameter which defines the primary slave port, is used in active backup mode
to select the primary slave for data TX/RX if it is available. The primary port also is used to select
the MAC address to use when it is not defined by the user. This defaults to the first slave added to
the device if it is specified. The primary device must be a slave of the bonded device.
primary=0000:0a:00.0
• socket_id: Optional parameter used to select which socket on a NUMA device the bonded devices
resources will be allocated on.
socket_id=0
• mac: Optional parameter to select a MAC address for link bonding device, this overrides the value
of the primary slave device.
mac=00:1e:67:1d:fd:1d
• xmit_policy: Optional parameter which defines the transmission policy when the bonded device
is in balance mode. If not user specified this defaults to l2 (layer 2) forwarding, the other trans-
mission policies available are l23 (layer 2+3) and l34 (layer 3+4)
xmit_policy=l23
• up_delay: Optional parameter which adds a delay in milli-seconds to the propagation of a devices
link status changing to up, by default this parameter is zero.
up_delay=10
Examples of Usage
Create a bonded device in round robin mode with two slaves specified by their PCI address:
$RTE_TARGET/app/testpmd -l 0-3 -n 4 --vdev 'net_bonding0,mode=0,slave=0000:0a:00.01,slave=0000:
Create a bonded device in round robin mode with two slaves specified by their PCI address and an
overriding MAC address:
$RTE_TARGET/app/testpmd -l 0-3 -n 4 --vdev 'net_bonding0,mode=0,slave=0000:0a:00.01,slave=0000:
Create a bonded device in active backup mode with two slaves specified, and a primary slave specified
by their PCI addresses:
$RTE_TARGET/app/testpmd -l 0-3 -n 4 --vdev 'net_bonding0,mode=1,slave=0000:0a:00.01,slave=0000:
Create a bonded device in balance mode with two slaves specified by their PCI addresses, and a trans-
mission policy of layer 3 + 4 forwarding:
$RTE_TARGET/app/testpmd -l 0-3 -n 4 --vdev 'net_bonding0,mode=2,slave=0000:0a:00.01,slave=0000:
TWENTYONE
TIMER LIBRARY
The Timer library provides a timer service to DPDK execution units to enable execution of callback
functions asynchronously. Features of the library are:
• Timers can be periodic (multi-shot) or single (one-shot).
• Timers can be loaded from one core and executed on another. It has to be specified in the call to
rte_timer_reset().
• Timers provide high precision (depends on the call frequency to rte_timer_manage() that checks
timer expiration for the local core).
• If not required in the application, timers can be disabled at compilation time by not calling the
rte_timer_manage() to increase performance.
The timer library uses the rte_get_timer_cycles() function that uses the High Precision Event Timer
(HPET) or the CPUs Time Stamp Counter (TSC) to provide a reliable time reference.
This library provides an interface to add, delete and restart a timer. The API is based on BSD callout()
with a few differences. Refer to the callout manual.
197
Programmer’s Guide, Release 19.11.10
Inside the rte_timer_manage() function, the skiplist is used as a regular list by iterating along the level 0
list, which contains all timer entries, until an entry which has not yet expired has been encountered. To
improve performance in the case where there are entries in the timer list but none of those timers have
yet expired, the expiry time of the first list entry is maintained within the per-core timer list structure
itself. On 64-bit platforms, this value can be checked without the need to take a lock on the overall
structure. (Since expiry times are maintained as 64-bit values, a check on the value cannot be done on
32-bit platforms without using either a compare-and-swap (CAS) instruction or using a lock, so this
additional check is skipped in favor of checking as normal once the lock has been taken.) On both 64-bit
and 32-bit platforms, a call to rte_timer_manage() returns without taking a lock in the case where the
timer list for the calling core is empty.
21.3 References
• callout manual - The callout facility that provides timers with a mechanism to execute a function
at a given time.
• HPET - Information about the High Precision Event Timer (HPET).
TWENTYTWO
HASH LIBRARY
The DPDK provides a Hash Library for creating hash table for fast lookup. The hash table is a data
structure optimized for searching through a set of entries that are each identified by a unique key. For
increased performance the DPDK Hash requires that all the keys have the same number of bytes which
is set at the hash creation time.
199
Programmer’s Guide, Release 19.11.10
• Add / lookup entry with key and data: A data is provided as input for add. Add allows the user
to store not only the key, but also the data which may be either a 8-byte integer or a pointer to
external data (if data size is more than 8 bytes).
• Combination of the two options above: User can provide key, precomputed hash, and data.
• Ability to not free the position of the entry in the hash table upon calling delete. This is useful for
multi-threaded scenarios where readers continue to use the position even after the entry is deleted.
Also, the API contains a method to allow the user to look up entries in batches, achieving higher perfor-
mance than looking up individual entries, as the function prefetches next entries at the time it is operating
with the current ones, which reduces significantly the performance overhead of the necessary memory
accesses.
The actual data associated with each key can be either managed by the user using a separate table
that mirrors the hash in terms of number of entries and position of each entry, as shown in the Flow
Classification use case described in the following sections, or stored in the hash table itself.
The example hash tables in the L2/L3 Forwarding sample applications define which port to forward a
packet to based on a packet flow identified by the five-tuple lookup. However, this table could also
be used for more sophisticated features and provide many other functions and actions that could be
performed on the packets and flows.
(e.g., current ARM based platforms) that do not support transactional memory, it is ad-
vised to set this flag to achieve greater scalability in performance. If this flag is set, the
(RTE_HASH_EXTRA_FLAGS_NO_FREE_ON_DEL) flag is set by default.
• If the ‘do not free on delete’ (RTE_HASH_EXTRA_FLAGS_NO_FREE_ON_DEL) flag is set,
the position of the entry in the hash table is not freed upon calling delete(). This flag is enabled
by default when the lock free read/write concurrency flag is set. The application should free the
position after all the readers have stopped referencing the position. Where required, the applica-
tion can make use of RCU mechanisms to determine when the readers have stopped referencing
the position.
bucket is looked up, where same procedure is carried out. If there is no match there either, key is not in
the table and a negative value will be returned.
Example of addition:
Like lookup, the primary and secondary buckets are identified. If there is an empty entry in the primary
bucket, a signature is stored in that entry, key and data (if any) are added to the second table and the
index in the second table is stored in the entry of the first table. If there is no space in the primary bucket,
one of the entries on that bucket is pushed to its alternative location, and the key to be added is inserted in
its position. To know where the alternative bucket of the evicted entry is, a mechanism called partial-key
hashing [partial-key] is used. If there is room in the alternative bucket, the evicted entry is stored in it. If
not, same process is repeated (one of the entries gets pushed) until an empty entry is found. Notice that
despite all the entry movement in the first table, the second table is not touched, which would impact
greatly in performance.
In the very unlikely event that an empty entry cannot be found after certain number of displacements,
key is considered not able to be added (unless extendable bucket flag is set, and in that case the bucket
is extended to insert the key, as will be explained later). With random keys, this method allows the
user to get more than 90% table utilization, without having to drop any stored entry (e.g. using a LRU
replacement policy) or allocate more memory (extendable buckets or rehashing).
Example of deletion:
Similar to lookup, the key is searched in its primary and secondary buckets. If the key is found, the entry
is marked as empty. If the hash table was configured with ‘no free on delete’ or ‘lock free read/write
concurrency’, the position of the key is not freed. It is the responsibility of the user to free the position
after readers are not referencing the position anymore.
See the tables below showing example entry distribution as table utilization increases.
Table 22.1: Entry distribution measured with an example table with 1024
random entries using jhash algorithm
% Table used % In Primary location % In Secondary location
25 100 0
50 96.1 3.9
75 88.2 11.8
80 86.3 13.7
85 83.1 16.9
90 77.3 22.7
95.8 64.5 35.5
Note: Last values on the tables above are the average maximum table utilization with random keys and
using Jenkins hash function.
The flow table operations on the application side are described below:
• Add flow: Add the flow key to hash. If the returned position is valid, use it to access the flow entry
in the flow table for adding a new flow or updating the information associated with an existing
flow. Otherwise, the flow addition failed, for example due to lack of free entries for storing new
flows.
• Delete flow: Delete the flow key from the hash. If the returned position is valid, use it to access
the flow entry in the flow table to invalidate the information associated with the flow.
• Free flow: Free flow key position. If ‘no free on delete’ or ‘lock-free read/write concurrency’
flags are set, wait till the readers are not referencing the position returned during add/delete flow
and then free the position. RCU mechanisms can be used to find out when the readers are not
referencing the position anymore.
• Lookup flow: Lookup for the flow key in the hash. If the returned position is valid (flow lookup
hit), use the returned position to access the flow entry in the flow table. Otherwise (flow lookup
miss) there is no flow registered for the current packet.
22.9 References
• Donald E. Knuth, The Art of Computer Programming, Volume 3: Sorting and Searching (2nd
Edition), 1998, Addison-Wesley Professional
• [partial-key] Bin Fan, David G. Andersen, and Michael Kaminsky, MemC3: compact and concur-
rent MemCache with dumber caching and smarter hashing, 2013, NSDI
TWENTYTHREE
23.1 Introduction
In Data Centers today, clustering and scheduling of distributed workloads is a very common task. Many
workloads require a deterministic partitioning of a flat key space among a cluster of machines. When a
packet enters the cluster, the ingress node will direct the packet to its handling node. For example, data-
centers with disaggregated storage use storage metadata tables to forward I/O requests to the correct
back end storage cluster, stateful packet inspection will use match incoming flows to signatures in flow
tables to send incoming packets to their intended deep packet inspection (DPI) devices, and so on.
EFD is a distributor library that uses perfect hashing to determine a target/value for a given incoming
flow key. It has the following advantages: first, because it uses perfect hashing it does not store the
key itself and hence lookup performance is not dependent on the key size. Second, the target/value can
be any arbitrary value hence the system designer and/or operator can better optimize service rates and
inter-cluster network traffic locating. Third, since the storage requirement is much smaller than a hash-
based flow table (i.e. better fit for CPU cache), EFD can scale to millions of flow keys. Finally, with the
current optimized library implementation, performance is fully scalable with any number of CPU cores.
205
Programmer’s Guide, Release 19.11.10
Target 1
Target 2
LB
Target N
Keys
Target Hashed
Value
to minimize inter-server traffic or to optimize for network traffic conditions, target load, etc.) is simply
not possible.
Key N Action N
Action
As shown in Fig. 23.3, when doing a lookup, the flow-table is indexed with the hash of the flow key and
the keys (more than one is possible, because of hash collision) stored in this index and corresponding
values are retrieved. The retrieved key(s) is matched with the input flow key and if there is a match the
value (target id) is returned.
The drawback of using a hash table for flow distribution/load balancing is the storage requirement, since
the flow table need to store keys, signatures and target values. This doesn’t allow this scheme to scale
to millions of flow keys. Large tables will usually not fit in the CPU cache, and hence, the lookup
performance is degraded because of the latency to access the main memory.
explained below), and it supports any arbitrary value for any given key.
Target
Value H1(x)H2(x)…..Hm(x)
Key 1 0 0 1 0
Key 2 1 0 1 1
...
Key 28 0 0 0 0
The basic idea of EFD is when a given key is to be inserted, a family of hash functions is searched until
the correct hash function that maps the input key to the correct value is found, as shown in Fig. 23.4.
However, rather than explicitly storing all keys and their associated values, EFD stores only indices of
hash functions that map keys to values, and thereby consumes much less space than conventional flow-
based tables. The lookup operation is very simple, similar to a computational-based scheme: given an
input key the lookup operation is reduced to hashing that key with the correct hash function.
All Keys
Intuitively, finding a hash function that maps each of a large number (millions) of input keys to the
correct output value is effectively impossible, as a result EFD, as shown in Fig. 23.5, breaks the problem
into smaller pieces (divide and conquer). EFD divides the entire input key set into many small groups.
Each group consists of approximately 20-28 keys (a configurable parameter for the library), then, for
each small group, a brute force search to find a hash function that produces the correct outputs for each
key in the group.
It should be mentioned that, since the online lookup table for EFD doesn’t store the key itself, the size
of the EFD table is independent of the key size and hence EFD lookup performance which is almost
constant irrespective of the length of the key which is a highly desirable feature especially for longer
keys.
In summary, EFD is a set separation data structure that supports millions of keys. It is used to distribute
a given key to an intended target. By itself EFD is not a FIB data structure with an exact match the input
flow key.
are received at a front end server before being forwarded to the target back end server for processing.
The system designer would deterministically co-locate flows together in order to minimize cross-server
interaction. (For example, flows requesting certain webpage objects are co-located together, to minimize
forwarding of common objects across servers).
Backend Server 1
Key x Action x Key y Action y Key z Action z
Frontend Server
or Load Balancer Key N Action N
Supports N Flows
Backend Server 2
Key N Action N
Backend Server X
Supports X*N Flows
As shown in Fig. 23.6, the front end server will have an EFD table that stores for each group what is
the perfect hash index that satisfies the correct output. Because the table size is small and fits in cache
(since keys are not stored), it sustains a large number of flows (N*X, where N is the maximum number
of flows served by each back end server of the X possible targets).
With an input flow key, the group id is computed (for example, using last few bits of CRC hash) and
then the EFD table is indexed with the group id to retrieve the corresponding hash index to use. Once the
index is retrieved the key is hashed using this hash function and the result will be the intended correct
target where this flow is supposed to be processed.
It should be noted that as a result of EFD not matching the exact key but rather distributing the flows to
a target back end node based on the perfect hash index, a key that has not been inserted before will be
distributed to a valid target. Hence, a local table which stores the flows served at each node is used and
is exact matched with the input key to rule out new never seen before flows.
Note: This function is not multi-thread safe and should only be called from one thread.
Note: This function is multi-thread safe, but there should not be other threads writing in the EFD table,
unless locks are used.
Note: This function is not multi-thread safe and should only be called from one thread.
Group
Identifier
(simplified)
0x0102 0x0103 0x0104 - · Keys separated
Keys into groups
separated into based
on some bits from hash.
Groups
groups based on
· Groups contain a small number of
group id4 2 1 some bits from hash
keys (<28)
- Groups contain a
Total # of keys small number of
in group so far keys (<28)
Fig. 23.7 depicts the group assignment for 7 flow keys as an example. Given a flow key, a hash function
(in our implementation CRC hash) is used to get the group id. As shown in the figure, the groups can be
unbalanced. (We highlight group rebalancing further below).
Fig. 23.8: Perfect Hash Search - Assigned Keys & Target Value
Focusing on one group that has four keys, Fig. 23.8 depicts the search algorithm to find the perfect hash
function. Assuming that the target value bit for the keys is as shown in the figure, then the online EFD
table will store a 16 bit hash index and 16 bit lookup table per group per value bit.
lookup_tablebit
CRC32 (32 Lookup Table has
Goal: Find a valid index for key1
bit output) 16 bits
Key1: Value = 0 hash_index
lookup_tablebit
Key3: Value = 1 index for key3
(hash(key, seed1) + hash_index * , seed2)) % 16
hash(key
Key4: Value = 0 lookup_tablebit
index for key4
Key7: Value = 1
CRC32 (32 lookup_tablebit
bit output) index for key7
Lookup_table
Values (16 bits)
For example, since both key3 and key7 have a target bit value of 1, it is okay if the hash function of both
keys point to the same bit in the lookup table. A conflict will occur if a hash index is used that maps both
Key4 and Key7 to the same index in the lookup_table, as shown in Fig. 23.10, since their target value
bit are not the same. Once a hash index is found that produces a lookup_table with no contradictions,
this index is stored for this group. This procedure is repeated for each bit of target value.
F(Key, hash_index =
Key Position = 6
38123
hash
Apply the equation
Apply the equation to the bit
to retrieve
0x0102ABCD retrieve the bitposition
positionininthe
the lookup_table
lookup_table
(Hash(key,seed1)+38123*hash(key,seed2))%16
Group ID: 0x0102
hash_index =
Retrieve the value “0' from
38123 the specified location in the
lookup table
lookup_table =
0110 1100 0101 1101
0 0 4
1
Insert key 2 3+1
3
1 4 0
hash 0 0
5
1 2
6 7
0x0102ABCD 2 99
7 2
4
… 8 3
5-3
9 4 2
4 9
10 1 5 10
11 6 98
bin id variable 1+4
12 7 97
# of
chunks … …
(power 255 5
chunk id of 2) 64 96
6
Move bin from group 1 to 4
7
Fig. 23.12 depicts the high level idea of group rebalancing, given an input key the hash result is split into
two parts a chunk id and 8-bit bin id. A chunk contains 64 different groups and 256 bins (i.e. for any
given bin it can map to 4 distinct groups). When a key is inserted, the bin id is computed, for example in
Fig. 23.12 bin_id=2, and since each bin can be mapped to one of four different groups (2 bit storage), the
four possible mappings are evaluated and the one that will result in a balanced key distribution across
these four is selected the mapping result is stored in these two bits.
23.6 References
1- EFD is based on collaborative research work between Intel and Carnegie Mel-
lon University (CMU), interested readers can refer to the paper “Scaling Up Clus-
tered Network Appliances with ScaleBricks” Dong Zhou et al. at SIGCOMM 2015
(http://conferences.sigcomm.org/sigcomm/2015/pdf/papers/p241.pdf ) for more information.
TWENTYFOUR
MEMBERSHIP LIBRARY
24.1 Introduction
The DPDK Membership Library provides an API for DPDK applications to insert a new member, delete
an existing member, or query the existence of a member in a given set, or a group of sets. For the case
of a group of sets, the library will return not only whether the element has been inserted before in one
of the sets but also which set it belongs to. The Membership Library is an extension and generalization
of a traditional filter structure (for example Bloom Filter [Member-bloom]) that has multiple usages in
a wide variety of workloads and applications. In general, the Membership Library is a data structure
that provides a “set-summary” on whether a member belongs to a set, and as discussed in detail later,
there are two advantages of using such a set-summary rather than operating on a “full-blown” complete
list of elements: first, it has a much smaller storage requirement than storing the whole list of elements
themselves, and secondly checking an element membership (or other operations) in this set-summary is
much faster than checking it for the original full-blown complete list of elements.
We use the term “Set-Summary” in this guide to refer to the space-efficient, probabilistic membership
data structure that is provided by the library. A membership test for an element will return the set
this element belongs to or that the element is “not-found” with very high probability of accuracy. Set-
summary is a fundamental data aggregation component that can be used in many network (and other)
applications. It is a crucial structure to address performance and scalability issues of diverse network
applications including overlay networks, data-centric networks, flow table summaries, network statistics
and traffic monitoring. A set-summary is useful for applications who need to include a list of elements
while a complete list requires too much space and/or too much processing cost. In these situations, the
set-summary works as a lossy hash-based representation of a set of members. It can dramatically reduce
space requirement and significantly improve the performance of set membership queries at the cost of
introducing a very small membership test error probability.
There are various usages for a Membership Library in a very large set of applications and workloads.
Interested readers can refer to [Member-survey] for a survey of possible networking usages. The above
figure provide a small set of examples of using the Membership Library:
• Sub-figure (a) depicts a distributed web cache architecture where a collection of proxies attempt to
share their web caches (cached from a set of back-end web servers) to provide faster responses to
clients, and the proxies use the Membership Library to share summaries of what web pages/objects
they are caching. With the Membership Library, a proxy receiving an http request will inquire the
set-summary to find its location and quickly determine whether to retrieve the requested web page
from a nearby proxy or from a back-end web server.
• Sub-figure (b) depicts another example for using the Membership Library to prevent routing loops
which is typically done using slow TTL countdown and dropping packets when TTL expires. As
shown in Sub-figure (b), an embedded set-summary in the packet header itself can be used to
summarize the set of nodes a packet has gone through, and each node upon receiving a packet can
215
Programmer’s Guide, Release 19.11.10
Encode
ID
setsum
SUM Packet
List
List 1
1
Flow Key matching
matching List
List 2
2
Criteria
Criteria 1
1
Summary
check whether its id is a member of the set of visited nodes, and if it is, then a routing loop is
detected.
• Sub-Figure (c) presents another usage of the Membership Library to load-balance flows to worker
threads with in-order guarantee where a set-summary is used to query if a packet belongs to an
existing flow or a new flow. Packets belonging to a new flow are forwarded to the current least
loaded worker thread, while those belonging to an existing flow are forwarded to the pre-assigned
thread to guarantee in-order processing.
• Sub-figure (d) highlights yet another usage example in the database domain where a set-summary
is used to determine joins between sets instead of creating a join by comparing each element of
a set against the other elements in a different set, a join is done on the summaries since they can
efficiently encode members of a given set.
Membership Library is a configurable library that is optimized to cover set membership functionality
for both a single set and multi-set scenarios. Two set-summary schemes are presented including (a)
vector of Bloom Filters and (b) Hash-Table based set-summary schemes with and without false negative
probability. This guide first briefly describes these different types of set-summaries, usage examples for
each, and then it highlights the Membership Library API.
are initially all set to 0. Then it chooses k independent hash functions h1, h2, ... hk with hash values
range from 0 to m-1 to perform hashing calculations on each element to be inserted. Every time when
an element X being inserted into the set, the bits at positions h1(X), h2(X), ... hk(X) in v are set to
1 (any particular bit might be set to 1 multiple times for multiple different inserted elements). Given a
query for any element Y, the bits at positions h1(Y), h2(Y), ... hk(Y) are checked. If any of them is
0, then Y is definitely not in the set. Otherwise there is a high probability that Y is a member of the set
with certain false positive probability. As shown in the next equation, the false positive probability can
be made arbitrarily small by changing the number of hash functions (k) and the vector length (m).
Without BF, an accurate membership testing could involve a costly hash table lookup and full element
comparison. The advantage of using a BF is to simplify the membership test into a series of hash
calculations and memory accesses for a small bit-vector, which can be easily optimized. Hence the
lookup throughput (set membership test) can be significantly faster than a normal hash table lookup
with element comparison.
Encode ID
BF of
Packet
IDs
BF of
Packet
IDs
BF is used for applications that need only one set, and the membership of elements is checked against
the BF. The example discussed in the above figure is one example of potential applications that uses only
one set to capture the node IDs that have been visited so far by the packet. Each node will then check
this embedded BF in the packet header for its own id, and if the BF indicates that the current node is
definitely not in the set then a loop-free route is guaranteed.
To support membership test for both multiple sets and a single set, the library implements a Vector
Bloom Filter (vBF) scheme. vBF basically composes multiple bloom filters into a vector of bloom
filers. The membership test is conducted on all of the bloom filters concurrently to determine which
Element
Lookup/Insertion is done in the series of BFs, one by one or can be optimized to do in parallel.
set(s) it belongs to or none of them. The basic idea of vBF is shown in the above figure where an
element is used to address multiple bloom filters concurrently and the bloom filter index(es) with a hit
is returned.
Flow Key
As previously mentioned, there are many usages of such structures. vBF is used for applications that
need to check membership against multiple sets simultaneously. The example shown in the above figure
uses a set to capture all flows being assigned for processing at a given worker thread. Upon receiving a
packet the vBF is used to quickly figure out if this packet belongs to a new flow so as to be forwarded
to the current least loaded worker thread, or otherwise it should be queued for an existing thread to
guarantee in-order processing (i.e. the property of vBF to indicate right away that a given flow is a new
one or not is critical to minimize response time latency).
It should be noted that vBF can be implemented using a set of single bloom filters with sequential lookup
of each BF. However, being able to concurrently search all set-summaries is a big throughput advantage.
In the library, certain parallelism is realized by the implementation of checking all bloom filters together.
Packet Payload
HTSS
Attack Signature Length 1 Attack Signature Length 2 Attack Signature Length X Attack Signature Length L
Match 1
As shown in the above figure, attack signature matching where each set represents a certain signature
length (for correctness of this example, an attack signature should not be a subset of another one) in the
payload is a good example for using HTSS with 0% false negative (i.e., when an element returns not
found, it has a 100% certainty that it is not a member of any set). The packet inspection application
benefits from knowing right away that the current payload does not match any attack signatures in the
database to establish its legitimacy, otherwise a deep inspection of the packet is needed.
HTSS employs a similar but simpler data structure to a traditional hash table, and the major difference
is that HTSS stores only the signatures but not the full keys/elements which can significantly reduce the
footprint of the table. Along with the signature, HTSS also stores a value to indicate the target set. When
looking up an element, the element is hashed and the HTSS is addressed to retrieve the signature stored.
If the signature matches then the value is retrieved corresponding to the index of the target set which
the element belongs to. Because signatures can collide, HTSS can still has false positive probability.
Furthermore, if elements are allowed to be overwritten or evicted when the hash table becomes full, it
will also have a false negative probability. We discuss this case in the next section.
Active
Flow ID1
New/Inactive
Flow ID2
Target for
Flow ID 1
Flow Mask 1 Flow Mask 2 Flow Mask X Flow Mask L
Fig. 24.7: Using HTSS with False Negatives for Wild Card Classification
HTSS with false negative (i.e. a cache) also has its wide set of applications. For example wild card flow
classification (e.g. ACL rules) highlighted in the above figure is an example of such application. In that
case each target set represents a sub-table with rules defined by a certain flow mask. The flow masks
are non-overlapping, and for flows matching more than one rule only the highest priority one is inserted
in the corresponding sub-table (interested readers can refer to the Open vSwitch (OvS) design of Mega
Flow Cache (MFC) [Member-OvS] for further details). Typically the rules will have a large number of
distinct unique masks and hence, a large number of target sets each corresponding to one mask. Because
the active set of flows varies widely based on the network traffic, HTSS with false negative will act as a
cache for <flowid, target ACL sub-table> pair for the current active set of flows. When a miss occurs (as
shown in red in the above figure) the sub-tables will be searched sequentially one by one for a possible
match, and when found the flow key and target sub-table will be inserted into the set-summary (i.e.
cache insertion) so subsequent packets from the same flow don’t incur the overhead of the sequential
search of sub-tables.
or which have a slightly different meaning for different types of set-summary. For example, num_keys
parameter means the maximum number of entries for Hash table based set-summary. However, for
bloom filter, this value means the expected number of keys that could be inserted into the bloom filter(s).
The value is used to calculate the size of each bloom filter.
We also pass two seeds: prim_hash_seed and sec_hash_seed for the primary and secondary
hash functions to calculate two independent hash values. socket_id parameter is the NUMA socket
ID for the memory used to create the set-summary. For HTSS, another parameter is_cache is used
to indicate if this set-summary is a cache (i.e. with false negative probability) or not. For vBF, extra
parameters are needed. For example, num_set is the number of sets needed to initialize the vector
bloom filters. This number is equal to the number of bloom filters will be created. false_pos_rate
is the false positive rate. num_keys and false_pos_rate will be used to determine the number of hash
functions and the bloom filter size.
entry count per bucket. max_match_per_key should be equal or smaller than the maximum number
of possible matches.
The rte_membership_lookup_multi_bulk() function looks up a bulk of keys/elements in the
set-summary structure for multiple matches, each key lookup returns ALL the matches (possibly more
than one) found for this key when it is matched against all target sets (cache mode HTSS matches at most
one target set). The return value is the number of keys that find one or more matches in the set-summary
structure. The arguments of the function include keys which is a pointer to a bulk of keys that are
to be looked up, num_keys is the number of keys that will be looked up, max_match_per_key is
the possible maximum number of matches for each key, match_count which is the returned number
of matches for each key, and set_ids are the returned target set ids for all matches found for each
keys. set_ids is 2-D array containing a 1-D array for each key (the size of 1-D array per key should
be set by the user according to max_match_per_key). max_match_per_key should be equal or
smaller than the maximum number of possible matches, similar to rte_member_lookup_multi.
24.5 References
[Member-bloom] B H Bloom, “Space/Time Trade-offs in Hash Coding with Allowable Errors,” Com-
munications of the ACM, 1970.
[Member-survey] A Broder and M Mitzenmacher, “Network Applications of Bloom Filters: A Survey,”
in Internet Mathematics, 2005.
[Member-cfilter] B Fan, D G Andersen and M Kaminsky, “Cuckoo Filter: Practically Better Than
Bloom,” in Conference on emerging Networking Experiments and Technologies, 2014.
[Member-OvS] B Pfaff, “The Design and Implementation of Open vSwitch,” in NSDI, 2015.
1
Traditional bloom filter does not support proactive deletion. Supporting proactive deletion require additional implemen-
tation and performance overhead.
TWENTYFIVE
LPM LIBRARY
The DPDK LPM library component implements the Longest Prefix Match (LPM) table search method
for 32-bit keys that is typically used to find the best route match in IP forwarding applications.
223
Programmer’s Guide, Release 19.11.10
The first table, called tbl24, is indexed using the first 24 bits of the IP address to be looked up, while the
second table(s), called tbl8, is indexed using the last 8 bits of the IP address. This means that depending
on the outcome of trying to match the IP address of an incoming packet to the rule stored in the tbl24
we might need to continue the lookup process in the second level.
Since every entry of the tbl24 can potentially point to a tbl8, ideally, we would have 2^24 tbl8s, which
would be the same as having a single table with 2^32 entries. This is not feasible due to resource
restrictions. Instead, this approach takes advantage of the fact that rules longer than 24 bits are very rare.
By splitting the process in two different tables/levels and limiting the number of tbl8s, we can greatly
reduce memory consumption while maintaining a very good lookup speed (one memory access, most of
the times).
• valid
• valid group
• depth
Next hop and depth contain the same information as in the tbl24. The two flags show whether the entry
and the table are valid respectively.
The other main data structure is a table containing the main information about the rules (IP and next
hop). This is a higher level table, used for different things:
• Check whether a rule already exists or not, prior to addition or deletion, without having to actually
perform a lookup.
• When deleting, to check whether there is a rule containing the one that is to be deleted. This is
important, since the main data structure will have to be updated accordingly.
25.2.1 Addition
When adding a rule, there are different possibilities. If the rule’s depth is exactly 24 bits, then:
• Use the rule (IP address) as an index to the tbl24.
• If the entry is invalid (i.e. it doesn’t already contain a rule) then set its next hop to its value, the
valid flag to 1 (meaning this entry is in use), and the external entry flag to 0 (meaning the lookup
process ends at this point, since this is the longest prefix that matches).
If the rule’s depth is exactly 32 bits, then:
• Use the first 24 bits of the rule as an index to the tbl24.
• If the entry is invalid (i.e. it doesn’t already contain a rule) then look for a free tbl8, set the index
to the tbl8 to this value, the valid flag to 1 (meaning this entry is in use), and the external entry flag
to 1 (meaning the lookup process must continue since the rule hasn’t been explored completely).
If the rule’s depth is any other value, prefix expansion must be performed. This means the rule is copied
to all the entries (as long as they are not in use) which would also cause a match.
As a simple example, let’s assume the depth is 20 bits. This means that there are 2^(24 - 20) = 16
different combinations of the first 24 bits of an IP address that would cause a match. Hence, in this case,
we copy the exact same entry to every position indexed by one of these combinations.
By doing this we ensure that during the lookup process, if a rule matching the IP address exists, it is
found in either one or two memory accesses, depending on whether we need to move to the next table
or not. Prefix expansion is one of the keys of this algorithm, since it improves the speed dramatically by
adding redundancy.
25.2.2 Lookup
The lookup process is much simpler and quicker. In this case:
• Use the first 24 bits of the IP address as an index to the tbl24. If the entry is not in use, then it
means we don’t have a rule matching this IP. If it is valid and the external entry flag is set to 0,
then the next hop is returned.
• If it is valid and the external entry flag is set to 1, then we use the tbl8 index to find out the tbl8
to be checked, and the last 8 bits of the IP address as an index to this table. Similarly, if the entry
is not in use, then we don’t have a rule matching this IP address. If it is valid then the next hop is
returned.
25.2.5 References
• RFC1519 Classless Inter-Domain Routing (CIDR): an Address Assignment and Aggregation
Strategy, http://www.ietf.org/rfc/rfc1519
• Pankaj Gupta, Algorithms for Routing Lookups and Packet Classification, PhD Thesis, Stanford
University, 2000 (http://klamath.stanford.edu/~pankaj/thesis/thesis_1sided.pdf )
TWENTYSIX
LPM6 LIBRARY
The LPM6 (LPM for IPv6) library component implements the Longest Prefix Match (LPM) table search
method for 128-bit keys that is typically used to find the best match route in IPv6 forwarding applica-
tions.
227
Programmer’s Guide, Release 19.11.10
26.1.2 Addition
When adding a rule, there are different possibilities. If the rule’s depth is exactly 24 bits, then:
• Use the rule (IP address) as an index to the tbl24.
• If the entry is invalid (i.e. it doesn’t already contain a rule) then set its next hop to its value, the
valid flag to 1 (meaning this entry is in use), and the external entry flag to 0 (meaning the lookup
process ends at this point, since this is the longest prefix that matches).
If the rule’s depth is bigger than 24 bits but a multiple of 8, then:
• Use the first 24 bits of the rule as an index to the tbl24.
• If the entry is invalid (i.e. it doesn’t already contain a rule) then look for a free tbl8, set the index
to the tbl8 to this value, the valid flag to 1 (meaning this entry is in use), and the external entry flag
to 1 (meaning the lookup process must continue since the rule hasn’t been explored completely).
• Use the following 8 bits of the rule as an index to the next tbl8.
• Repeat the process until the tbl8 at the right level (depending on the depth) has been reached and
fill it with the next hop, setting the next entry flag to 0.
If the rule’s depth is any other value, prefix expansion must be performed. This means the rule is copied
to all the entries (as long as they are not in use) which would also cause a match.
As a simple example, let’s assume the depth is 20 bits. This means that there are 2^(24-20) = 16 different
combinations of the first 24 bits of an IP address that would cause a match. Hence, in this case, we copy
the exact same entry to every position indexed by one of these combinations.
By doing this we ensure that during the lookup process, if a rule matching the IP address exists, it is
found in, at the most, 14 memory accesses, depending on how many times we need to move to the next
table. Prefix expansion is one of the keys of this algorithm, since it improves the speed dramatically by
adding redundancy.
Prefix expansion can be performed at any level. So, for example, is the depth is 34 bits, it will be
performed in the third level (second tbl8-based level).
26.1.3 Lookup
The lookup process is much simpler and quicker. In this case:
• Use the first 24 bits of the IP address as an index to the tbl24. If the entry is not in use, then it
means we don’t have a rule matching this IP. If it is valid and the external entry flag is set to 0,
then the next hop is returned.
• If it is valid and the external entry flag is set to 1, then we use the tbl8 index to find out the tbl8 to
be checked, and the next 8 bits of the IP address as an index to this table. Similarly, if the entry is
not in use, then we don’t have a rule matching this IP address. If it is valid then check the external
entry flag for a new tbl8 to be inspected.
• Repeat the process until either we find an invalid entry (lookup miss) or a valid entry with the
external entry flag set to 0. Return the next hop in the latter case.
TWENTYSEVEN
DPDK provides a Flow Classification library that provides the ability to classify an input packet by
matching it against a set of Flow rules.
The initial implementation supports counting of IPv4 5-tuple packets which match a particular Flow rule
only.
Please refer to the Generic flow API (rte_flow) for more information.
The Flow Classification library uses the librte_table API for managing Flow rules and matching
packets against the Flow rules. The library is table agnostic and can use the following tables: Access
Control List, Hash and Longest Prefix Match(LPM). The Access Control List
table is used in the initial implementation.
Please refer to the Packet Framework for more information.on librte_table.
DPDK provides an Access Control List library that provides the ability to classify an input packet based
on a set of classification rules.
Please refer to the Packet Classification and Access Control library for more information on
librte_acl.
There is also a Flow Classify sample application which demonstrates the use of the Flow Classification
Library API’s.
Please refer to the ../sample_app_ug/flow_classify for more information on the flow_classify sam-
ple application.
27.1 Overview
The library has the following API’s
/**
* Flow classifier create
*
* @param params
* Parameters for flow classifier creation
* @return
* Handle to flow classifier instance on success or NULL otherwise
*/
struct rte_flow_classifier *
rte_flow_classifier_create(struct rte_flow_classifier_params *params);
/**
* Flow classifier free
*
* @param cls
* Handle to flow classifier instance
231
Programmer’s Guide, Release 19.11.10
* @return
* 0 on success, error code otherwise
* /
int
rte_flow_classifier_free(struct rte_flow_classifier *cls);
/**
* Flow classify table create
*
* @param cls
* Handle to flow classifier instance
* @param params
* Parameters for flow_classify table creation
* @return
* 0 on success, error code otherwise
*/
int
rte_flow_classify_table_create(struct rte_flow_classifier *cls,
struct rte_flow_classify_table_params *params);
/**
* Validate the flow classify rule
*
* @param[in] cls
* Handle to flow classifier instance
* @param[in] attr
* Flow rule attributes
* @param[in] pattern
* Pattern specification (list terminated by the END pattern item).
* @param[in] actions
* Associated actions (list terminated by the END pattern item).
* @param[out] error
* Perform verbose error reporting if not NULL. Structure
* initialised in case of error only.
* @return
* 0 on success, error code otherwise
*/
int
rte_flow_classify_validate(struct rte_flow_classifier *cls,
const struct rte_flow_attr *attr,
const struct rte_flow_item pattern[],
const struct rte_flow_action actions[],
struct rte_flow_error *error);
/**
* Add a flow classify rule to the flow_classifier table.
*
* @param[in] cls
* Flow classifier handle
* @param[in] attr
* Flow rule attributes
* @param[in] pattern
* Pattern specification (list terminated by the END pattern item).
* @param[in] actions
* Associated actions (list terminated by the END pattern item).
* @param[out] key_found
* returns 1 if rule present already, 0 otherwise.
* @param[out] error
* Perform verbose error reporting if not NULL. Structure
* initialised in case of error only.
* @return
* A valid handle in case of success, NULL otherwise.
*/
struct rte_flow_classify_rule *
rte_flow_classify_table_entry_add(struct rte_flow_classifier *cls,
const struct rte_flow_attr *attr,
const struct rte_flow_item pattern[],
const struct rte_flow_action actions[],
int *key_found;
struct rte_flow_error *error);
/**
* Delete a flow classify rule from the flow_classifier table.
*
* @param[in] cls
* Flow classifier handle
* @param[in] rule
* Flow classify rule
* @return
* 0 on success, error code otherwise.
* /
int
rte_flow_classify_table_entry_delete(struct rte_flow_classifier *cls,
struct rte_flow_classify_rule *rule);
/**
* Query flow classifier for given rule.
*
* @param[in] cls
* Flow classifier handle
* @param[in] pkts
* Pointer to packets to process
* @param[in] nb_pkts
* Number of packets to process
* @param[in] rule
* Flow classify rule
* @param[in] stats
* Flow classify stats
*
* @return
* 0 on success, error code otherwise.
*/
int
rte_flow_classifier_query(struct rte_flow_classifier *cls,
struct rte_mbuf **pkts,
const uint16_t nb_pkts,
struct rte_flow_classify_rule *rule,
struct rte_flow_classify_stats *stats);
/** CPU socket ID where memory for the flow classifier and its */
/** elements (tables) should be allocated */
int socket_id;
};
struct rte_cls_table {
/* Input parameters */
struct rte_table_ops ops;
uint32_t entry_size;
enum rte_flow_classify_table_type type;
struct rte_flow_classifier {
/* Input parameters */
char name[RTE_FLOW_CLASSIFIER_MAX_NAME_SZ];
int socket_id;
/* Internal */
/* ntuple_filter */
struct rte_eth_ntuple_filter ntuple_filter;
/* classifier tables */
struct rte_cls_table tables[RTE_FLOW_CLASSIFY_TABLE_MAX];
uint32_t table_mask;
uint32_t num_tables;
uint16_t nb_pkts;
struct rte_flow_classify_table_entry
*entries[RTE_PORT_IN_BURST_SIZE_MAX];
} __rte_cache_aligned;
To create an ACL table the rte_table_acl_params structure must be initialised and assigned to
arg_create in the rte_flow_classify_table_params structure.
struct rte_table_acl_params {
/** Name */
const char *name;
};
The fields for the ACL rule must also be initialised by the application.
An ACL table can be added to the Classifier for each ACL rule, for example another table could
be added for the IPv6 5-tuple rule.
The API function rte_flow_classify_validate parses the IPv4 5-tuple pattern, attributes and
actions and returns the 5-tuple data in the rte_eth_ntuple_filter structure.
static int
rte_flow_classify_validate(struct rte_flow_classifier *cls,
const struct rte_flow_attr *attr,
const struct rte_flow_item pattern[],
const struct rte_flow_action actions[],
struct rte_flow_error *error)
struct classify_rules {
enum rte_flow_classify_rule_type type;
union {
struct rte_flow_classify_ipv4_5tuple ipv4_5tuple;
} u;
};
struct rte_flow_classify {
uint32_t id; /* unique ID of classify object */
enum rte_flow_classify_table_type tbl_type; /* rule table */
struct classify_rules rules; /* union of rules */
union {
struct acl_keys key;
} u;
int key_found; /* rule key found in table */
struct rte_flow_classify_table_entry entry; /* rule meta data */
void *entry_ptr; /* handle to the table entry for rule meta data */
};
It then calls the table.ops.f_add API to add the rule to the ACL table.
/**
* Flow stats
*
* For the count action, stats can be returned by the query API.
*
* Storage for stats is provided by the application.
*
*
*/
struct rte_flow_classify_stats {
void *stats;
};
struct rte_flow_classify_5tuple_stats {
TWENTYEIGHT
The DPDK Packet Distributor library is a library designed to be used for dynamic load balancing of
traffic while supporting single packet at a time operation. When using this library, the logical cores in
use are to be considered in two roles: firstly a distributor lcore, which is responsible for load balancing
or distributing packets, and a set of worker lcores which are responsible for receiving the packets from
the distributor and operating on them. The model of operation is shown in the diagram below.
There are two modes of operation of the API in the distributor library, one which sends one packet
238
Programmer’s Guide, Release 19.11.10
at a time to workers using 32-bits for flow_id, and an optimized mode which sends bursts of up to 8
packets at a time to workers, using 15 bits of flow_id. The mode is selected by the type field in the
rte_distributor_create() function.
TWENTYNINE
REORDER LIBRARY
The Reorder Library provides a mechanism for reordering mbufs based on their sequence number.
29.1 Operation
The reorder library is essentially a buffer that reorders mbufs. The user inserts out of order mbufs into
the reorder buffer and pulls in-order mbufs from it.
At a given time, the reorder buffer contains mbufs whose sequence number are inside the sequence
window. The sequence window is determined by the minimum sequence number and the number of
entries that the buffer was configured to hold. For example, given a reorder buffer with 200 entries and
a minimum sequence number of 350, the sequence window has low and high limits of 350 and 550
respectively.
When inserting mbufs, the reorder library differentiates between valid, early and late mbufs depending
on the sequence number of the inserted mbuf:
• valid: the sequence number is inside the window.
• late: the sequence number is outside the window and less than the low limit.
• early: the sequence number is outside the window and greater than the high limit.
The reorder buffer directly returns late mbufs and tries to accommodate early mbufs.
241
Programmer’s Guide, Release 19.11.10
as late packets when they arrive. The process of moving packets to the Ready buffer continues beyond
the minimum required until a gap, i.e. missing mbuf, in the Order buffer is encountered.
When draining mbufs, the reorder buffer would return mbufs in the Ready buffer first and then from the
Order buffer until a gap is found (mbufs that have not arrived yet).
THIRTY
The IP Fragmentation and Reassembly Library implements IPv4 and IPv6 packet fragmentation and
reassembly.
243
Programmer’s Guide, Release 19.11.10
Internally Fragment table is a simple hash table. The basic idea is to use two hash functions and
<bucket_entries> * associativity. This provides 2 * <bucket_entries> possible locations in the hash
table for each key. When the collision occurs and all 2 * <bucket_entries> are occupied, instead of
reinserting existing keys into alternative locations, ip_frag_tbl_add() just returns a failure.
Also, entries that resides in the table longer then <max_cycles> are considered as invalid, and could be
removed/replaced by the new ones.
Note that reassembly demands a lot of mbuf’s to be allocated. At any given time up to (2 * bucket_entries
* RTE_LIBRTE_IP_FRAG_MAX * <maximum number of mbufs per packet>) can be stored inside
Fragment Table waiting for remaining fragments.
THIRTYONE
Generic Receive Offload (GRO) is a widely used SW-based offloading technique to reduce per-packet
processing overheads. By reassembling small packets into larger ones, GRO enables applications to
process fewer large packets directly, thus reducing the number of packets to be processed. To bene-
fit DPDK-based applications, like Open vSwitch, DPDK also provides own GRO implementation. In
DPDK, GRO is implemented as a standalone library. Applications explicitly use the GRO library to
reassemble packets.
31.1 Overview
In the GRO library, there are many GRO types which are defined by packet types. One GRO type is in
charge of process one kind of packets. For example, TCP/IPv4 GRO processes TCP/IPv4 packets.
Each GRO type has a reassembly function, which defines own algorithm and table structure to reassem-
ble packets. We assign input packets to the corresponding GRO functions by MBUF->packet_type.
The GRO library doesn’t check if input packets have correct checksums and doesn’t re-calculate check-
sums for merged packets. The GRO library assumes the packets are complete (i.e., MF==0 &&
frag_off==0), when IP fragmentation is possible (i.e., DF==0). Additionally, it complies RFC 6864
to process the IPv4 ID field.
Currently, the GRO library provides GRO supports for TCP/IPv4 packets and VxLAN packets which
contain an outer IPv4 header and an inner TCP/IPv4 packet.
245
Programmer’s Guide, Release 19.11.10
31.3.1 Challenges
The reassembly algorithm determines the efficiency of GRO. There are two challenges in the algorithm
design:
• a high cost algorithm/implementation would cause packet dropping in a high speed network.
• packet reordering makes it hard to merge packets. For example, Linux GRO fails to merge packets
when encounters packet reordering.
The above two challenges require our algorithm is:
• lightweight enough to scale fast networking speed
• capable of handling packet reordering
In DPDK GRO, we use a key-based algorithm to address the two challenges.
Note: Packets in the same “flow” that can’t merge are always caused by packet reordering.
packet
not find
Categorize into an existed “flow” Insert a new “flow” and store the packet
find a “flow”
not find
Search for a “neighbor” Store the packet
find a “neighbor”
Note: We comply RFC 6864 to process the IPv4 ID field. Specifically, we check IPv4 ID fields for
the packets whose DF bit is 0 and ignore IPv4 ID fields for the packets whose DF bit is 1. Additionally,
packets which have different value of DF bit can’t be merged.
THIRTYTWO
32.1 Overview
Generic Segmentation Offload (GSO) is a widely used software implementation of TCP Segmentation
Offload (TSO), which reduces per-packet processing overhead. Much like TSO, GSO gains performance
by enabling upper layer applications to process a smaller number of large packets (e.g. MTU size of
64KB), instead of processing higher numbers of small packets (e.g. MTU size of 1500B), thus reducing
per-packet overhead.
For example, GSO allows guest kernel stacks to transmit over-sized TCP segments that far exceed the
kernel interface’s MTU; this eliminates the need to segment packets within the guest, and improves the
data-to-overhead ratio of both the guest-host link, and PCI bus. The expectation of the guest network
stack in this scenario is that segmentation of egress frames will take place either in the NIC HW, or
where that hardware capability is unavailable, either in the host application, or network stack.
Bearing that in mind, the GSO library enables DPDK applications to segment packets in software.
Note however, that GSO is implemented as a standalone library, and not via a ‘fallback’ mechanism
(i.e. for when TSO is unsupported in the underlying hardware); that is, applications must explicitly
invoke the GSO library to segment packets. The size of GSO segments (segsz) is configurable by the
application.
32.2 Limitations
1. The GSO library doesn’t check if input packets have correct checksums.
2. In addition, the GSO library doesn’t re-calculate checksums for segmented packets (that task is
left to the application).
3. IP fragments are unsupported by the GSO library.
4. The egress interface’s driver must support multi-segment packets.
5. Currently, the GSO library supports the following IPv4 packet types:
• TCP
• UDP
• VxLAN
• GRE
See Supported GSO Packet Types for further details.
249
Programmer’s Guide, Release 19.11.10
segsz
In one situation, the output segment may contain additional ‘data’ segments. This only occurs when:
• the input packet on which GSO is to be performed is represented by a multi-segment mbuf.
• the output segment is required to contain data that spans the boundaries between segments of the
input multi-segment mbuf.
The GSO library traverses each segment of the input packet, and produces numerous output segments;
for optimal performance, the number of output segments is kept to a minimum. Consequently, the GSO
library maximizes the amount of data contained within each output segment; i.e. each output segment
segsz bytes of data. The only exception to this is in the case of the very final output segment; if
pkt_len % segsz, then the final segment is smaller than the rest.
In order for an output segment to meet its MSS, it may need to include data from multiple input segments.
Due to the nature of indirect mbufs (each indirect mbuf can point to only one direct mbuf), the solution
here is to add another indirect mbuf to the output segment; this additional segment then points to the next
input segment. If necessary, this chaining process is repeated, until the sum of all of the data ‘contained’
in the output segment reaches segsz. This ensures that the amount of data contained within each output
segment is uniform, with the possible exception of the last segment, as previously described.
Fig. 32.2 illustrates an example of a three-part output segment. In this example, the output segment
needs to include data from the end of one input segment, and the beginning of another. To achieve this,
an additional indirect mbuf is chained to the second part of the output segment, and is attached to the
next input segment (i.e. it points to the data in the next input segment).
pkt_len
segsz
% segsz
next
Header Payload 0 Payload 1 Payload 1 Payload 2 Multi-segment input packet
1
Direct mbuf next 2
Indirect mbuf next (pointer
3 to data)
Indirect mbuf Three-part output segment
(copy of headers) (pointer to data) (pointer to data)
Note: An application may use the same pool for both direct and indirect buffers. However,
since indirect mbufs simply store a pointer, the application may reduce its memory con-
sumption by creating a separate memory pool, containing smaller elements, for the indirect
pool.
• the size of each output segment, including packet headers and payload, measured in bytes.
• the bit mask of required GSO types. The GSO library uses the same macros as those that de-
scribe a physical device’s TX offloading capabilities (i.e. DEV_TX_OFFLOAD_*_TSO)
for gso_types. For example, if an application wants to segment TCP/IPv4 packets, it
should set gso_types to DEV_TX_OFFLOAD_TCP_TSO. The only other supported val-
ues currently supported for gso_types are DEV_TX_OFFLOAD_VXLAN_TNL_TSO, and
DEV_TX_OFFLOAD_GRE_TNL_TSO; a combination of these macros is also allowed.
• a flag, that indicates whether the IPv4 headers of output segments should contain fixed or
incremental ID values.
2. Set the appropriate ol_flags in the mbuf.
• The GSO library use the value of an mbuf’s ol_flags attribute to determine how a packet
should be segmented. It is the application’s responsibility to ensure that these flags are set.
• For example, in order to segment TCP/IPv4 packets, the application should add the
PKT_TX_IPV4 and PKT_TX_TCP_SEG flags to the mbuf’s ol_flags.
• If checksum calculation in hardware is required, the application should also add the
PKT_TX_TCP_CKSUM and PKT_TX_IP_CKSUM flags.
3. Check if the packet should be processed. Packets with one of the following properties are not
processed and are returned immediately:
• Packet length is less than segsz (i.e. GSO is not required).
• Packet type is not supported by GSO library (see Supported GSO Packet Types).
• Application has not enabled GSO support for the packet type.
• Packet’s ol_flags have been incorrectly set.
4. Allocate space in which to store the output GSO segments. If the amount of space allocated by
the application is insufficient, segmentation will fail.
5. Invoke the GSO segmentation API, rte_gso_segment().
6. If required, update the L3 and L4 checksums of the newly-created segments. For tunneled packets,
the outer IPv4 headers’ checksums should also be updated. Alternatively, the application may
offload checksum calculation to HW.
THIRTYTHREE
The librte_pdump library provides a framework for packet capturing in DPDK. The library does the
complete copy of the Rx and Tx mbufs to a new mempool and hence it slows down the performance of
the applications, so it is recommended to use this library for debugging purposes.
The library provides the following APIs to initialize the packet capture framework, to enable or disable
the packet capture, and to uninitialize it:
• rte_pdump_init(): This API initializes the packet capture framework.
• rte_pdump_enable(): This API enables the packet capture on a given port and queue. Note:
The filter option in the API is a place holder for future enhancements.
• rte_pdump_enable_by_deviceid(): This API enables the packet capture on a given
device id (vdev name or pci address) and queue. Note: The filter option in the API
is a place holder for future enhancements.
• rte_pdump_disable(): This API disables the packet capture on a given port and queue.
• rte_pdump_disable_by_deviceid(): This API disables the packet capture on a given
device id (vdev name or pci address) and queue.
• rte_pdump_uninit(): This API uninitializes the packet capture framework.
33.1 Operation
The librte_pdump library works on a client/server model. The server is responsible for enabling or
disabling the packet capture and the clients are responsible for requesting the enabling or disabling of
the packet capture.
The packet capture framework, as part of its initialization, creates the pthread and the server socket in the
pthread. The application that calls the framework initialization will have the server socket created, either
under the path that the application has passed or under the default path i.e. either /var/run/.dpdk
for root user or ~/.dpdk for non root user.
Applications that request enabling or disabling of the packet capture will have the client socket
created either under the path that the application has passed or under the default path i.e. either
/var/run/.dpdk for root user or ~/.dpdk for not root user to send the requests to the server.
The server socket will listen for client requests for enabling or disabling the packet capture.
254
Programmer’s Guide, Release 19.11.10
THIRTYFOUR
MULTI-PROCESS SUPPORT
In the DPDK, multi-process support is designed to allow a group of DPDK processes to work together
in a simple transparent manner to perform packet processing, or other workloads. To support this func-
tionality, a number of additions have been made to the core DPDK Environment Abstraction Layer
(EAL).
The EAL has been modified to allow different types of DPDK processes to be spawned, each with
different permissions on the hugepage memory used by the applications. For now, there are two types
of process specified:
• primary processes, which can initialize and which have full permissions on shared memory
• secondary processes, which cannot initialize shared memory, but can attach to pre- initialized
shared memory and create objects in it.
Standalone DPDK processes are primary processes, while secondary processes can only run alongside
a primary process or after a primary process has already configured the hugepage shared memory for
them.
Note: Secondary processes should run alongside primary process with same DPDK version.
Secondary processes which requires access to physical devices in Primary process, must be passed with
the same whitelist and blacklist options.
To support these two process types, and other multi-process setups described later, two additional
command-line parameters are available to the EAL:
• --proc-type: for specifying a given process instance as the primary or secondary DPDK
instance
• --file-prefix: to allow processes that do not want to co-operate to have different memory
regions
A number of example applications are provided that demonstrate how multiple DPDK processes can be
used together. These are more fully documented in the “Multi- process Sample Application” chapter in
the DPDK Sample Application’s User Guide.
256
Programmer’s Guide, Release 19.11.10
On application start-up in a primary or standalone process, the DPDK records to memory-mapped files
the details of the memory configuration it is using - hugepages in use, the virtual addresses they are
mapped at, the number of memory channels present, etc. When a secondary process is started, these
files are read and the EAL recreates the same memory configuration in the secondary process so that all
memory zones are shared between processes and all pointers to that memory are valid, and point to the
same objects, in both processes.
Note: Refer to Multi-process Limitations for details of how Linux kernel Address-Space Layout Ran-
domization (ASLR) can affect memory sharing.
If the primary process was run with --legacy-mem or --single-file-segments switch, sec-
ondary processes must be run with the same switch specified. Otherwise, memory corruption may occur.
Primary Process
Secondary Process
struct rte_config
struct hugepage[]
IPC Queue
IPC Queue
Hugepage
Local Data DPDK Local Data
Memory
Mbuf Pool
The EAL also supports an auto-detection mode (set by EAL --proc-type=auto flag ), whereby a
DPDK process is started as a secondary instance if a primary instance is already running.
of the processes spawned should be spawned using the --proc-type=primary EAL flag, while all
subsequent instances should be spawned using the --proc-type=secondary flag.
The simple_mp and symmetric_mp sample applications demonstrate this usage model. They are de-
scribed in the “Multi-process Sample Application” chapter in the DPDK Sample Application’s User
Guide.
Note: Independent DPDK instances running side-by-side on a single machine cannot share any network
ports. Any network ports being used by one process should be blacklisted in every other process.
Note: All restrictions and issues with multiple independent DPDK processes running side-by-side
apply in this usage scenario also.
Warning: Disabling Address-Space Layout Randomization (ASLR) may have security implica-
tions, so it is recommended that it be disabled only when absolutely necessary, and only when the
implications of this change have been understood.
• All DPDK processes running as a single application and using shared memory must have distinct
coremask/corelist arguments. It is not possible to have a primary and secondary instance, or two
secondary instances, using any of the same logical cores. Attempting to do so can cause corruption
of memory pool caches, among other issues.
• The delivery of interrupts, such as Ethernet* device link status interrupts, do not work in sec-
ondary processes. All interrupts are triggered inside the primary process only. Any application
needing interrupt notification in multiple processes should provide its own mechanism to trans-
fer the interrupt information from the primary process to any secondary process that needs the
information.
• The use of function pointers between multiple processes running based of different compiled
binaries is not supported, since the location of a given function in one process may be different
to its location in a second. This prevents the librte_hash library from behaving properly as in a
multi-process instance, since it uses a pointer to the hash function internally.
To work around this issue, it is recommended that multi-process applications perform the
hash calculations by directly calling the hashing function from the code and then using the
rte_hash_add_with_hash()/rte_hash_lookup_with_hash() functions instead of the functions which do
the hashing internally, such as rte_hash_add()/rte_hash_lookup().
• Depending upon the hardware in use, and the number of DPDK processes used, it may not be
possible to have HPET timers available in each DPDK instance. The minimum number of HPET
comparators available to Linux* userspace can be just a single comparator, which means that only
the first, primary DPDK process instance can open and mmap /dev/hpet. If the number of required
DPDK processes exceeds that of the number of available HPET comparators, the TSC (which is
the default timer in this release) must be used as a time source across all processes instead of the
HPET.
• nb_sent - number indicating how many requests were sent (i.e. how many peer processes were
active at the time of the request).
• nb_received - number indicating how many responses were received (i.e. of those peer pro-
cesses that were active at the time of request, how many have replied)
• msgs - pointer to where all of the responses are stored. The order in which responses appear is
undefined. When doing synchronous requests, this memory must be freed by the requestor after
request completes!
For asynchronous requests, a function pointer to the callback function must be provided instead. This
callback will be called when the request either has timed out, or will have received a response to all the
messages that were sent.
Warning: When an asynchronous request times out, the callback will be called not by a dedicated
IPC thread, but rather from EAL interrupt thread. Because of this, it may not be possible for DPDK to
trigger another interrupt-based event (such as an alarm) while handling asynchronous IPC callback.
When the callback is called, the original request descriptor will be provided (so that it would be pos-
sible to determine for which sent message this is a callback to), along with a response descriptor like
the one described above. When doing asynchronous requests, there is no need to free the resulting
rte_mp_reply descriptor.
Warning: Simply returning a value when processing a request callback will not send a response
to the request - it must always be explicitly sent even in case of errors. Implementation of error
signalling rests with the application, there is no built-in way to indicate success or error for a request.
Failing to do so will cause the requestor to time out while waiting on a response.
Asynchronous request callbacks may be triggered either from IPC thread or from interrupt thread, de-
pending on whether the request has timed out. It is therefore suggested to avoid waiting for interrupt-
based events (such as alarms) inside asynchronous IPC request callbacks. This limitation does not apply
to messages or synchronous requests.
If callbacks spend a long time processing the incoming requests, the requestor might time out, so setting
the right timeout value on the requestor side is imperative.
If some of the messages timed out, nb_sent and nb_received fields in the rte_mp_reply de-
scriptor will not have matching values. This is not treated as error by the IPC API, and it is expected
that the user will be responsible for deciding how to handle such cases.
If a callback has been registered, IPC will assume that it is safe to call it. This is important when
registering callbacks during DPDK initialization. During initialization, IPC will consider the receiving
side as non-existing if the callback has not been registered yet. However, once the callback has been
registered, it is expected that IPC should be safe to trigger it, even if the rest of the DPDK initialization
hasn’t finished yet.
THIRTYFIVE
The DPDK Kernel NIC Interface (KNI) allows userspace applications access to the Linux* control
plane.
The benefits of using the DPDK KNI are:
• Faster than existing Linux TUN/TAP interfaces (by eliminating system calls and
copy_to_user()/copy_from_user() operations.
• Allows management of DPDK ports using standard Linux net tools such as ethtool, ifconfig and
tcpdump.
• Allows an interface with the kernel network stack.
The components of an application using the DPDK Kernel NIC Interface are shown in Fig. 35.1.
Loading the rte_kni kernel module without any optional parameters is the typical way a DPDK ap-
plication gets packets into and out of the kernel network stack. Without any parameters, only one kernel
thread is created for all KNI devices for packet receiving in kernel side, loopback mode is disabled, and
the default carrier state of KNI interfaces is set to off.
# insmod kmod/rte_kni.ko
263
Programmer’s Guide, Release 19.11.10
The lo_mode_fifo loopback option will loop back ring enqueue/dequeue operations in kernel space.
# insmod kmod/rte_kni.ko lo_mode=lo_mode_fifo_skb
The lo_mode_fifo_skb loopback option will loop back ring enqueue/dequeue operations and sk
buffer copies in kernel space.
If the lo_mode parameter is not specified, loopback mode is disabled.
This mode will create only one kernel thread for all KNI interfaces to receive data on the kernel side. By
default, this kernel thread is not bound to any particular core, but the user can set the core affinity for this
kernel thread by setting the core_id and force_bind parameters in struct rte_kni_conf
when the first KNI interface is created:
For optimum performance, the kernel thread should be bound to a core in on the same socket as the
DPDK lcores used in the application.
The KNI kernel module can also be configured to start a separate kernel thread for each KNI interface
created by the DPDK application. Multiple kernel thread mode is enabled as follows:
# insmod kmod/rte_kni.ko kthread_mode=multiple
This mode will create a separate kernel thread for each KNI interface to receive data on the kernel side.
The core affinity of each kni_thread kernel thread can be specified by setting the core_id and
force_bind parameters in struct rte_kni_conf when each KNI interface is created.
Multiple kernel thread mode can provide scalable higher performance if sufficient unused cores are
available on the host system.
If the kthread_mode parameter is not specified, the “single kernel thread” mode is used.
KNI interface as a purely virtual interface that does not correspond to any physical hardware and do
not wish to explicitly set the carrier state of the interface with rte_kni_update_link(). It is also
useful for testing in loopback mode where the NIC port may not be physically connected to anything.
To set the default carrier state to on:
# insmod kmod/rte_kni.ko carrier=on
If the carrier parameter is not specified, the default carrier state of KNI interfaces will be set to off.
this callback function to NULL, but sets the port_id field to a value other than -1, a
default callback handler in the rte_kni library kni_config_promiscusity() will be
called which calls rte_eth_promiscuous_<enable|disable>() on the speci-
fied port_id.
config_allmulticast:
Called when the user changes the allmulticast state of the KNI interface. For example, when
the user runs ifconfig <ifaceX> [-]allmulti. If the user sets this callback func-
tion to NULL, but sets the port_id field to a value other than -1, a default callback han-
dler in the rte_kni library kni_config_allmulticast() will be called which calls
rte_eth_allmulticast_<enable|disable>() on the specified port_id.
In order to run these callbacks, the application must periodically call the
rte_kni_handle_request() function. Any user callback function registered will be called
directly from rte_kni_handle_request() so care must be taken to prevent deadlock and to not
block any DPDK fastpath tasks. Typically DPDK applications which use these callbacks will need to
create a separate thread or secondary process to periodically call rte_kni_handle_request().
The KNI interfaces can be deleted by a DPDK application with rte_kni_release(). All KNI
interfaces not explicitly deleted will be deleted when the /dev/kni device is closed, either explicitly
with rte_kni_close() or when the DPDK application is closed.
35.7 Ethtool
Ethtool is a Linux-specific tool with corresponding support in the kernel. The current version of kni
provides minimal ethtool functionality including querying version and link state. It does not support
link control, statistics, or dumping device registers.
THIRTYSIX
The DPDK is comprised of several libraries. Some of the functions in these libraries can be safely called
from multiple threads simultaneously, while others cannot. This section allows the developer to take
these issues into account when building their own application.
The run-time environment of the DPDK is typically a single thread per logical core. In some cases, it is
not only multi-threaded, but multi-process. Typically, it is best to avoid sharing data structures between
threads and/or processes where possible. Where this is not possible, then the execution blocks must
access the data in a thread- safe manner. Mechanisms such as atomics or locking can be used that will
allow execution blocks to operate serially. However, this can have an effect on the performance of the
application.
269
Programmer’s Guide, Release 19.11.10
The setup and configuration of the PMD is not performance sensitive, but is not thread safe either. It
is possible that the multiple read/writes during PMD setup and configuration could be corrupted in a
multi-thread environment. Since this is not performance sensitive, the developer can choose to add their
own layer to provide thread-safe setup and configuration. It is expected that, in most applications, the
initial configuration of the network ports would be done by a single thread at startup.
THIRTYSEVEN
The DPDK Event device library is an abstraction that provides the application with features to schedule
events. This is achieved using the PMD architecture similar to the ethdev or cryptodev APIs, which may
already be familiar to the reader.
The eventdev framework introduces the event driven programming model. In a polling model, lcores
poll ethdev ports and associated Rx queues directly to look for a packet. By contrast in an event driven
model, lcores call the scheduler that selects packets for them based on programmer-specified criteria.
The Eventdev library adds support for an event driven programming model, which offers applications
automatic multicore scaling, dynamic load balancing, pipelining, packet ingress order maintenance and
synchronization services to simplify application packet processing.
By introducing an event driven programming model, DPDK can support both polling and event driven
programming models for packet processing, and applications are free to choose whatever model (or
combination of the two) best suits their needs.
Step-by-step instructions of the eventdev design is available in the API Walk-through section later in this
document.
271
Programmer’s Guide, Release 19.11.10
37.1.3 Queues
An event queue is a queue containing events that are scheduled by the event device. An event queue
contains events of different flows associated with scheduling types, such as atomic, ordered, or parallel.
37.1.4 Ports
Ports are the points of contact between worker cores and the eventdev. The general use case will see one
CPU core using one port to enqueue and dequeue events from an eventdev. Ports are linked to queues in
order to retrieve events from those queues (more details in Linking Queues and Ports below).
Worker4 Worker4
RX Worker3 Worker3 TX
In Intf
Core
Worker2 Core
Worker2 Out Intf
Fig. 37.1: Sample eventdev usage, with RX, two atomic stages and a single-link to TX.
In the following code, we configure eventdev instance with 3 queues and 6 ports as follows. The 3
queues consist of 2 Atomic and 1 Single-Link, while the 6 ports consist of 4 workers, 1 RX and 1 TX.
const struct rte_event_dev_config config = {
.nb_event_queues = 3,
.nb_event_ports = 6,
.nb_events_limit = 4096,
.nb_event_queue_flows = 1024,
.nb_event_port_dequeue_depth = 128,
.nb_event_port_enqueue_depth = 128,
};
int err = rte_event_dev_configure(dev_id, &config);
int tx_port_id = 5;
int err = rte_event_port_setup(dev_id, tx_port_id, &tx_conf);
As shown above:
• port 0: RX core
• ports 1,2,3,4: Workers
• port 5: TX core
These ports are used for the remainder of this walk-through.
Note: EventDev needs to be started before starting the event producers such as event_eth_rx_adapter,
event_timer_adapter and event_crypto_adapter.
ev[i].priority = RTE_EVENT_DEV_PRIORITY_NORMAL;
ev[i].mbuf = mbufs[i];
}
37.3 Summary
The eventdev library allows an application to easily schedule events as it requires, either using a run-
to-completion or pipeline processing model. The queues and ports abstract the logical functionality
of an eventdev, providing the application with a generic method to schedule events. With the flexible
PMD infrastructure applications benefit of improvements in existing eventdevs and additions of new
ones without modification.
THIRTYEIGHT
The DPDK Eventdev API allows the application to use an event driven programming model for packet
processing. In this model, the application polls an event device port for receiving events that reference
packets instead of polling Rx queues of ethdev ports. Packet transfer between ethdev and the event
device can be supported in hardware or require a software thread to receive packets from the ethdev port
using ethdev poll mode APIs and enqueue these as events to the event device using the eventdev API.
Both transfer mechanisms may be present on the same platform depending on the particular combination
of the ethdev and the event device. For SW based packet transfer, if the mbuf does not have a timestamp
set, the adapter adds a timestamp to the mbuf using rte_get_tsc_cycles(), this provides a more accurate
timestamp as compared to if the application were to set the timestamp since it avoids event device
schedule latency.
The Event Ethernet Rx Adapter library is intended for the application code to configure both transfer
mechanisms using a common API. A capability API allows the eventdev PMD to advertise features sup-
ported for a given ethdev and allows the application to perform configuration as per supported features.
rx_p_conf.new_event_threshold = dev_info.max_num_events;
rx_p_conf.dequeue_depth = dev_info.max_event_port_dequeue_depth;
rx_p_conf.enqueue_depth = dev_info.max_event_port_enqueue_depth;
err = rte_event_eth_rx_adapter_create(id, dev_id, &rx_p_conf);
If the application desires to have finer control of eventdev port allocation and setup,
it can use the rte_event_eth_rx_adapter_create_ext() function. The
277
Programmer’s Guide, Release 19.11.10
queue_config.rx_queue_flags = 0;
queue_config.ev = ev;
queue_config.servicing_weight = 1;
err = rte_event_eth_rx_adapter_queue_add(id,
eth_dev_id,
0, &queue_config);
queue_config.rx_queue_flags = 0;
if (cap & RTE_EVENT_ETH_RX_ADAPTER_CAP_OVERRIDE_FLOW_ID) {
ev.flow_id = 1;
queue_config.rx_queue_flags =
RTE_EVENT_ETH_RX_ADAPTER_QUEUE_FLOW_ID_VALID;
}
if (rte_event_eth_rx_adapter_service_id_get(0, &service_id) == 0)
rte_service_map_lcore_set(service_id, RX_CORE_ID);
Note: The eventdev to which the event_eth_rx_adapter is connected needs to be started before calling
rte_event_eth_rx_adapter_start().
THIRTYNINE
The DPDK Eventdev API allows the application to use an event driven programming model for packet
processing in which the event device distributes events referencing packets to the application cores in a
dynamic load balanced fashion while handling atomicity and packet ordering. Event adapters provide the
interface between the ethernet, crypto and timer devices and the event device. Event adapter APIs enable
common application code by abstracting PMD specific capabilities. The Event ethernet Tx adapter
provides configuration and data path APIs for the transmit stage of the application allowing the same
application code to use eventdev PMD support or in its absence, a common implementation.
In the common implementation, the application enqueues mbufs to the adapter which runs as a
rte_service function. The service function dequeues events from its event port and transmits the mbufs
referenced by these events.
tx_p_conf.new_event_threshold = dev_info.max_num_events;
tx_p_conf.dequeue_depth = dev_info.max_event_port_dequeue_depth;
tx_p_conf.enqueue_depth = dev_info.max_event_port_enqueue_depth;
280
Programmer’s Guide, Release 19.11.10
event.mbuf = m;
eq_flags = 0;
m->port = tx_port;
rte_event_eth_tx_adapter_txq_set(m, tx_queue_id);
m->port = tx_port;
rte_event_eth_tx_adapter_txq_set(m, tx_queue_id);
FORTY
The DPDK Event Device library introduces an event driven programming model which presents appli-
cations with an alternative to the polling model traditionally used in DPDK applications. Event devices
can be coupled with arbitrary components to provide new event sources by using event adapters. The
Event Timer Adapter is one such adapter; it bridges event devices and timer mechanisms.
The Event Timer Adapter library extends the event driven model by introducing a new type of event that
represents a timer expiration, and providing an API with which adapters can be created or destroyed,
and event timers can be armed and canceled.
The Event Timer Adapter library is designed to interface with hardware or software implementations
of the timer mechanism; it will query an eventdev PMD to determine which implementation should be
used. The default software implementation manages timers using the DPDK Timer library.
Examples of using the API are presented in the API Overview and Processing Timer Expiry Events
sections. Code samples are abstracted and are based on the example of handling a TCP retransmission.
283
Programmer’s Guide, Release 19.11.10
40.1.3 State
Before arming an event timer, the application should initialize its state to
RTE_EVENT_TIMER_NOT_ARMED. The event timer’s state will be updated when a request
to arm or cancel it takes effect.
If the application wishes to rearm the timer after it has expired, it should reset the state back to
RTE_EVENT_TIMER_NOT_ARMED before doing so.
Before creating an instance of a timer adapter, the application should create and configure an
event device along with its event ports. Based on the event device capability, it might re-
quire creating an additional event port to be used by the timer adapter. If required, the
if (rte_event_timer_adapter_service_id_get(adapter, &service_id) == 0)
rte_service_map_lcore_set(service_id, EVTIM_CORE_ID);
An event timer adapter uses a service component if the event device PMD indicates that the adapter
should use a software implementation.
Note: The eventdev to which the event_timer_adapter is connected needs to be started before calling
rte_event_timer_adapter_start().
Once an event timer expires, the application may free it or rearm it as necessary. If the application will
rearm the timer, the state should be reset to RTE_EVENT_TIMER_NOT_ARMED by the application
before rearming it.
case RTE_EVENT_TYPE_TIMER:
process_timer_event(ev);
...
break;
}
}
}
uint8_t
process_timer_event(...)
{
/* A retransmission timeout for the connection has been received. */
conn = ev.event_ptr;
/* Retransmit last packet (e.g. TCP segment). */
...
/* Re-arm timer using original values. */
rte_event_timer_arm_burst(adapter_id, &conn->timer, 1);
}
40.4 Summary
The Event Timer Adapter library extends the DPDK event-based programming model by representing
timer expirations as events in the system and allowing applications to use existing event processing loops
to arm and cancel event timers or handle timer expiry events.
FORTYONE
The DPDK Eventdev library provides event driven programming model with features to schedule events.
The Cryptodev library provides an interface to the crypto poll mode drivers which supports different
crypto operations. The Event Crypto Adapter is one of the adapter which is intended to bridge between
the event device and the crypto device.
The packet flow from crypto device to the event device can be accomplished using SW and HW based
transfer mechanism. The Adapter queries an eventdev PMD to determine which mechanism to be used.
The adapter uses an EAL service core function for SW based packet transfer and uses the eventdev PMD
functions to configure HW based packet transfer between the crypto device and the event device. The
crypto adapter uses a new event type called RTE_EVENT_TYPE_CRYPTODEV to indicate the event
source.
The application can choose to submit a crypto operation directly to
crypto device or send it to the crypto adapter via eventdev based on
RTE_EVENT_CRYPTO_ADAPTER_CAP_INTERNAL_PORT_OP_FWD capability. The first
mode is known as the event new(RTE_EVENT_CRYPTO_ADAPTER_OP_NEW) mode and the sec-
ond as the event forward(RTE_EVENT_CRYPTO_ADAPTER_OP_FORWARD) mode. The choice of
mode can be specified while creating the adapter. In the former mode, it is an application responsibility
to enable ingress packet ordering. In the latter mode, it is the adapter responsibility to enable the ingress
packet ordering.
288
Programmer’s Guide, Release 19.11.10
1. Application dequeues
events from the previous
Atomic stage
Application + stage
2
enqueue to
cryptodev 2. Application prepares the
crypto operations.
6 1
3 3. Crypto operations are
submitted to cryptodev
Eventdev by application..
ordering is needed. In this mode, events dequeued from the adapter will be treated as forwarded events.
The application needs to specify the cryptodev ID and queue pair ID (request information) needed
to enqueue a crypto operation in addition to the event information (response information) needed to
enqueue an event after the crypto operation has completed.
conf.new_event_threshold = dev_info.max_num_events;
conf.dequeue_depth = dev_info.max_event_port_dequeue_depth;
conf.enqueue_depth = dev_info.max_event_port_enqueue_depth;
mode = RTE_EVENT_CRYPTO_ADAPTER_OP_FORWARD;
err = rte_event_crypto_adapter_create(id, dev_id, &conf, mode);
If the application desires to have finer control of eventdev port allocation and setup,
it can use the rte_event_crypto_adapter_create_ext() function. The
rte_event_crypto_adapter_create_ext() function is passed as a callback function. The
callback function is invoked if the adapter needs to use a service function and needs to create an event
port for it. The callback is expected to fill the struct rte_event_crypto_adapter_conf
structure passed to it.
For RTE_EVENT_CRYPTO_ADAPTER_OP_FORWARD mode, the event port created by adapter can
be retrieved using rte_event_crypto_adapter_event_port_get() API. Application can
use this event port to link with event queue on which it enqueues events towards the crypto adapter.
uint8_t id, evdev, crypto_ev_port_id, app_qid;
struct rte_event ev;
int ret;
rte_cryptodev_configure(cdev_id, &conf);
rte_cryptodev_queue_pair_setup(cdev_id, qp_id, &qp_conf);
These cryptodev id and queue pair are added to the instance using the
rte_event_crypto_adapter_queue_pair_add() API. The same is removed
using rte_event_crypto_adapter_queue_pair_del() API. If HW supports
RTE_EVENT_CRYPTO_ADAPTER_CAP_INTERNAL_PORT_QP_EV_BIND capability, event
information must be passed to the add API.
uint32_t cap;
int ret;
if (rte_event_crypto_adapter_service_id_get(id, &service_id) == 0)
rte_service_map_lcore_set(service_id, CORE_ID);
memset(&m_data, 0, sizeof(m_data));
memset(&ev, 0, sizeof(ev));
/* Fill event information and update event_ptr to rte_crypto_op */
ev.event_ptr = op;
if (op->sess_type == RTE_CRYPTO_OP_WITH_SESSION) {
/* Copy response information */
rte_memcpy(&m_data.response_info, &ev, sizeof(ev));
/* Copy request information */
m_data.request_info.cdev_id = cdev_id;
m_data.request_info.queue_pair_id = qp_id;
/* Call set API to store private data information */
rte_cryptodev_sym_session_set_user_data(
op->sym->session,
&m_data,
sizeof(m_data));
} if (op->sess_type == RTE_CRYPTO_OP_SESSIONLESS) {
uint32_t len = IV_OFFSET + MAXIMUM_IV_LENGTH +
(sizeof(struct rte_crypto_sym_xform) * 2);
op->private_data_offset = len;
/* Copy response information */
rte_memcpy(&m_data.response_info, &ev, sizeof(ev));
/* Copy request information */
m_data.request_info.cdev_id = cdev_id;
m_data.request_info.queue_pair_id = qp_id;
/* Store private data information along with rte_crypto_op */
rte_memcpy(op + len, &m_data, sizeof(m_data));
}
Note: The eventdev to which the event_crypto_adapter is connected needs to be started before calling
rte_event_crypto_adapter_start().
FORTYTWO
This pipeline can be built using reusable DPDK software libraries. The main blocks implementing QoS
in this pipeline are: the policer, the dropper and the scheduler. A functional description of each block is
provided in the following table.
294
Programmer’s Guide, Release 19.11.10
42.2.1 Overview
The hierarchical scheduler block is similar to the traffic manager block used by network processors that
typically implement per flow (or per group of flows) packet queuing and scheduling. It typically acts
like a buffer that is able to temporarily store a large number of packets just before their transmission
(enqueue operation); as the NIC TX is requesting more packets for transmission, these packets are later
on removed and handed over to the NIC TX with the packet selection logic observing the predefined
SLAs (dequeue operation).
The hierarchical scheduler is optimized for a large number of packet queues. When only a small number
of queues are needed, message passing queues should be used instead of this block. See Worst Case
Scenarios for Performance for a more detailed discussion.
Port
Subport
Pipe
Traffic
Class
Queue
Usage Example
/* File "application.c" */
#define N_PKTS_RX 64
#define N_PKTS_TX 48
#define NIC_RX_PORT 0
#define NIC_RX_QUEUE 0
#define NIC_TX_PORT 1
#define NIC_TX_QUEUE 0
/* Initialization */
<initialization code>
/* Runtime */
while (1) {
/* Read packets from NIC RX queue */
42.2.4 Implementation
Internal Data Structures per Port
A schematic of the internal data structures in shown in with details in.
Running enqueue and dequeue operations for the same output port from different cores is likely to cause
significant impact on scheduler’s performance and it is therefore not recommended.
The port enqueue and dequeue operations share access to the following data structures:
1. Packet descriptors
2. Queue table
3. Queue storage area
4. Bitmap of active queues
The expected drop in performance is due to:
1. Need to make the queue and bitmap operations thread safe, which requires either using locking
primitives for access serialization (for example, spinlocks/ semaphores) or using atomic primitives
for lockless access (for example, Test and Set, Compare And Swap, an so on). The impact is much
higher in the former case.
2. Ping-pong of cache lines storing the shared data structures between the cache hierarchies of the
two cores (done transparently by the MESI protocol cache coherency CPU hardware).
Therefore, the scheduler enqueue and dequeue operations have to be run from the same thread, which
allows the queues and the bitmap operations to be non-thread safe and keeps the scheduler data structures
internal to the same core.
Performance Scaling
Scaling up the number of NIC ports simply requires a proportional increase in the number of CPU cores
to be used for traffic scheduling.
Enqueue Pipeline
The sequence of steps per packet:
1. Access the mbuf to read the data fields required to identify the destination queue for the packet.
These fields are: port, subport, traffic class and queue within traffic class, and are typically set by
the classification stage.
2. Access the queue structure to identify the write location in the queue array. If the queue is full,
then the packet is discarded.
3. Access the queue array location to store the packet (i.e. write the mbuf pointer).
It should be noted the strong data dependency between these steps, as steps 2 and 3 cannot start before
the result from steps 1 and 2 becomes available, which prevents the processor out of order execution
engine to provide any significant performance optimizations.
Given the high rate of input packets and the large amount of queues, it is expected that the data structures
accessed to enqueue the current packet are not present in the L1 or L2 data cache of the current core,
thus the above 3 memory accesses would result (on average) in L1 and L2 data cache misses. A number
of 3 L1/L2 cache misses per packet is not acceptable for performance reasons.
The workaround is to prefetch the required data structures in advance. The prefetch operation has an
execution latency during which the processor should not attempt to access the data structure currently
under prefetch, so the processor should execute other work. The only other work available is to exe-
cute different stages of the enqueue sequence of operations on other input packets, thus resulting in a
pipelined implementation for the enqueue operation.
Fig. 42.5 illustrates a pipelined implementation for the enqueue operation with 4 pipeline stages and
each stage executing 2 different input packets. No input packet can be part of more than one pipeline
stage at a given time.
Fig. 42.5: Prefetch Pipeline for the Hierarchical Scheduler Enqueue Operation
The congestion management scheme implemented by the enqueue pipeline described above is very
basic: packets are enqueued until a specific queue becomes full, then all the packets destined to the
same queue are dropped until packets are consumed (by the dequeue operation). This can be improved
by enabling RED/WRED as part of the enqueue pipeline which looks at the queue occupancy and packet
priority in order to yield the enqueue/drop decision for a specific packet (as opposed to enqueuing all
packets / dropping all packets indiscriminately).
The dequeue pipe state machine exploits the data presence into the processor cache, therefore it tries
to send as many packets from the same pipe TC and pipe as possible (up to the available packets and
credits) before moving to the next active TC from the same pipe (if any) or to another active pipe.
Fig. 42.6: Pipe Prefetch State Machine for the Hierarchical Scheduler Dequeue Operation
The scheduler needs to keep track of time advancement for the credit logic, which requires credit updates
based on time (for example, subport and pipe traffic shaping, traffic class upper limit enforcement, and
so on).
Every time the scheduler decides to send a packet out to the NIC TX for transmission, the scheduler will
increment its internal time reference accordingly. Therefore, it is convenient to keep the internal time
reference in units of bytes, where a byte signifies the time duration required by the physical interface to
send out a byte on the transmission medium. This way, as a packet is scheduled for transmission, the
time is incremented with (n + h), where n is the packet length in bytes and h is the number of framing
overhead bytes per packet.
The scheduler needs to align its internal time reference to the pace of the port conveyor belt. The reason
is to make sure that the scheduler does not feed the NIC TX with more bytes than the line rate of the
physical medium in order to prevent packet drop (by the scheduler, due to the NIC TX input queue being
full, or later on, internally by the NIC TX).
The scheduler reads the current time on every dequeue invocation. The CPU time stamp can be obtained
by reading either the Time Stamp Counter (TSC) register or the High Precision Event Timer (HPET)
register. The current CPU time stamp is converted from number of CPU clocks to number of bytes:
time_bytes = time_cycles / cycles_per_byte, where cycles_per_byte is the amount of CPU cycles that is
equivalent to the transmission time for one byte on the wire (e.g. for a CPU frequency of 2 GHz and a
10GbE port,*cycles_per_byte = 1.6*).
The scheduler maintains an internal time reference of the NIC time. Whenever a packet is scheduled,
the NIC time is incremented with the packet length (including framing overhead). On every dequeue
invocation, the scheduler checks its internal reference of the NIC time against the current time:
1. If NIC time is in the future (NIC time >= current time), no adjustment of NIC time is needed. This
means that scheduler is able to schedule NIC packets before the NIC actually needs those packets,
so the NIC TX is well supplied with packets;
2. If NIC time is in the past (NIC time < current time), then NIC time should be adjusted by setting
it to the current time. This means that the scheduler is not able to keep up with the speed of the
NIC byte conveyor belt, so NIC bandwidth is wasted due to poor packet supply to the NIC TX.
The scheduler round trip delay (SRTD) is the time (number of CPU cycles) between two consecutive
examinations of the same pipe by the scheduler.
To keep up with the output port (that is, avoid bandwidth loss), the scheduler should be able to schedule
n packets faster than the same n packets are transmitted by NIC TX.
The scheduler needs to keep up with the rate of each individual pipe, as configured for the pipe token
bucket, assuming that no port oversubscription is taking place. This means that the size of the pipe token
bucket should be set high enough to prevent it from overflowing due to big SRTD, as this would result
in credit loss (and therefore bandwidth loss) for the pipe.
Credit Logic
Scheduling Decision
The scheduling decision to send next packet from (subport S, pipe P, traffic class TC, queue Q) is
favorable (packet is sent) when all the conditions below are met:
• Pipe P of subport S is currently selected by one of the port grinders;
• Traffic class TC is the highest priority active traffic class of pipe P;
• Queue Q is the next queue selected by WRR within traffic class TC of pipe P;
• Subport S has enough credits to send the packet;
• Subport S has enough credits for traffic class TC to send the packet;
• Pipe P has enough credits to send the packet;
• Pipe P has enough credits for traffic class TC to send the packet.
If all the above conditions are met, then the packet is selected for transmission and the necessary credits
are subtracted from subport S, subport S traffic class TC, pipe P, pipe P traffic class TC.
Framing Overhead
As the greatest common divisor for all packet lengths is one byte, the unit of credit is selected as one
byte. The number of credits required for the transmission of a packet of n bytes is equal to (n+h), where
h is equal to the number of framing overhead bytes per packet.
Traffic Shaping
The traffic shaping for subport and pipe is implemented using a token bucket per subport/per pipe. Each
token bucket is implemented using one saturated counter that keeps track of the number of available
credits.
The token bucket generic parameters and operations are presented in Table 42.6 and Table 42.7.
Traffic Classes
Strict priority scheduling of traffic classes within the same pipe is implemented by the pipe dequeue state
machine, which selects the queues in ascending order. Therefore, queue 0 (associated with TC 0, highest
priority TC) is handled before queue 1 (TC 1, lower priority than TC 0), which is handled before queue
2 (TC 2, lower priority than TC 1) and it continues until queues of all TCs except the lowest priority TC
are handled. At last, queues 12..15 (best effort TC, lowest priority TC) are handled.
The traffic classes at the pipe and subport levels are not traffic shaped, so there is no token bucket
maintained in this context. The upper limit for the traffic classes at the subport and pipe levels is enforced
by periodically refilling the subport / pipe traffic class credit counter, out of which credits are consumed
every time a packet is scheduled for that subport / pipe, as described in Table 42.10 and Table 42.11.
Table 42.10: Subport/Pipe Traffic Class Upper Limit Enforcement Persistent Data Structure
# Subport or pipe field Unit Description
1 tc_time Bytes Time of the next update
(upper limit refill) for
the TCs of the current
subport / pipe.
See Section Internal
Time Reference for the
explanation of why the
time is maintained in
byte units.
2 tc_period Bytes Time between two con-
secutive updates for the
all TCs of the current
subport / pipe. This
is expected to be many
times bigger than the
typical value of the to-
ken bucket tb_period.
3 tc_credits_per_period Bytes Upper limit for the num-
ber of credits allowed
to be consumed by the
current TC during each
enforcement period
tc_period.
4 tc_credits Bytes Current upper limit for
the number of credits
that can be consumed by
the current traffic class
for the remainder of the
current enforcement pe-
riod.
The evolution of the WRR design solution for the lowest priority traffic class (best effort TC) from
simple to complex is shown in Table 42.12.
Problem Statement
Oversubscription for subport traffic class X is a configuration-time event that occurs when more band-
width is allocated for traffic class X at the level of subport member pipes than allocated for the same
traffic class at the parent subport level.
The existence of the oversubscription for a specific subport and traffic class is solely the result of pipe
and subport-level configuration as opposed to being created due to dynamic evolution of the traffic load
at run-time (as congestion is).
When the overall demand for traffic class X for the current subport is low, the existence of the oversub-
scription condition does not represent a problem, as demand for traffic class X is completely satisfied
for all member pipes. However, this can no longer be achieved when the aggregated demand for traffic
class X for all subport member pipes exceeds the limit configured at the subport level.
Solution Space
summarizes some of the possible approaches for handling this problem, with the third approach selected
for implementation.
Implementation Overview
The algorithm computes a watermark, which is periodically updated based on the current demand expe-
rienced by the subport member pipes, whose purpose is to limit the amount of traffic that each pipe is
allowed to send for best effort TC. The watermark is computed at the subport level at the beginning of
each traffic class upper limit enforcement period and the same value is used by all the subport member
pipes throughout the current enforcement period. illustrates how the watermark computed as subport
level at the beginning of each period is propagated to all subport member pipes.
At the beginning of the current enforcement period (which coincides with the end of the previous en-
forcement period), the value of the watermark is adjusted based on the amount of bandwidth allocated
to best effort TC at the beginning of the previous period that was not left unused by the subport member
pipes at the end of the previous period.
If there was subport best effort TC bandwidth left unused, the value of the watermark for the current
period is increased to encourage the subport member pipes to consume more bandwidth. Otherwise,
the value of the watermark is decreased to enforce equality of bandwidth consumption among subport
member pipes for best effort TC.
The increase or decrease in the watermark value is done in small increments, so several enforcement
periods might be required to reach the equilibrium state. This state can change at any moment due to
variations in the demand experienced by the subport member pipes for best effort TC, for example, as
a result of demand increase (when the watermark needs to be lowered) or demand decrease (when the
watermark needs to be increased).
When demand is low, the watermark is set high to prevent it from impeding the subport member pipes
from consuming more bandwidth. The highest value for the watermark is picked as the highest rate
configured for a subport member pipe. Table 42.14 and Table 42.15 illustrates the watermark operation.
Table 42.14: Watermark Propagation from Subport Level to Member Pipes at the Beginning of Each Traffic
Class Upper Limit Enforcement Period
No. Subport Traffic Class Opera- Description
tion
1 Initialization Subport level: sub-
port_period_id= 0
Pipe level: pipe_period_id = 0
2 Credit update Subport Level:
if (time>=subport_tc_time)
{ subport_wm = wa-
ter_mark_update();
subport_tc_time = time +
subport_tc_period;
subport_period_id++;
}
Pipelevel:
if(pipe_period_id != sub-
port_period_id)
{
pipe_ov_credits
= subport_wm *
pipe_weight;
pipe_period_id =
subport_period_id;
}
3 Credit consumption (on packet Pipe level:
scheduling) pkt_credits = pk_len +
frame_overhead;
if(pipe_ov_credits >=
pkt_credits{
pipe_ov_credits -=
pkt_credits;
}
42.3 Dropper
The purpose of the DPDK dropper is to drop packets arriving at a packet scheduler to avoid conges-
tion. The dropper supports the Random Early Detection (RED), Weighted Random Early Detection
(WRED) and tail drop algorithms. Fig. 42.7 illustrates how the dropper integrates with the scheduler.
The DPDK currently does not support congestion management so the dropper provides the only method
for congestion avoidance.
The dropper uses the Random Early Detection (RED) congestion avoidance algorithm as documented
in the reference publication. The purpose of the RED algorithm is to monitor a packet queue, determine
the current congestion level in the queue and decide whether an arriving packet should be enqueued or
dropped. The RED algorithm uses an Exponential Weighted Moving Average (EWMA) filter to compute
average queue size which gives an indication of the current congestion level in the queue.
For each enqueue operation, the RED algorithm compares the average queue size to minimum and
maximum thresholds. Depending on whether the average queue size is below, above or in between these
thresholds, the RED algorithm calculates the probability that an arriving packet should be dropped and
makes a random decision based on this probability.
The dropper also supports Weighted Random Early Detection (WRED) by allowing the scheduler to
select different RED configurations for the same packet queue at run-time. In the case of severe conges-
tion, the dropper resorts to tail drop. This occurs when a packet queue has reached maximum capacity
and cannot store any more packets. In this situation, all arriving packets are dropped.
The flow through the dropper is illustrated in Fig. 42.8. The RED/WRED algorithm is exercised first
and tail drop second.
The use cases supported by the dropper are:
• – Initialize configuration data
• – Initialize run-time data
42.3.1 Configuration
A RED configuration contains the parameters given in Table 42.16.
Where:
• avg = average queue size
• wq = filter weight
• q = actual queue size
Note:
The filter weight, wq = 1/2^n, where n is the filter weight parameter value passed to the dropper module
on configuration (see Section2.23.3.1 ).
Where:
• m = the number of enqueue operations that could have occurred on this queue while the queue
was empty
In the dropper module, m is defined as:
Where:
• time = current time
• qtime = time the queue became empty
• s = typical time between successive enqueue operations on this queue
The time reference is in units of bytes, where a byte signifies the time duration required by the physical
interface to send out a byte on the transmission medium (see Section Internal Time Reference). The
parameter s is defined in the dropper module as a constant with the value: s=2^22. This corresponds to
the time required by every leaf node in a hierarchy with 64K leaf nodes to transmit one 64-byte packet
onto the wire and represents the worst case scenario. For much smaller scheduler hierarchies, it may be
necessary to reduce the parameter s, which is defined in the red header source file (rte_red.h) as:
#define RTE_RED_S
Since the time reference is in bytes, the port speed is implied in the expression: time-qtime. The dropper
does not have to be configured with the actual port speed. It adjusts automatically to low speed and high
speed links.
Implementation
A numerical method is used to compute the factor (1-wq)^m that appears in Equation 2.
This method is based on the following identity:
In the dropper module, a look-up table is used to compute log2(1-wq) for each value of wq supported by
the dropper module. The factor (1-wq)^m can then be obtained by multiplying the table value by m and
applying shift operations. To avoid overflow in the multiplication, the value, m, and the look-up table
values are limited to 16 bits. The total size of the look-up table is 56 bytes. Once the factor (1-wq)^m is
obtained using this method, the average queue size can be calculated from Equation 2.
Alternative Approaches
Other methods for calculating the factor (1-wq)^m in the expression for computing average queue size
when the queue is empty (Equation 2) were considered. These approaches include:
• Floating-point evaluation
• Fixed-point evaluation using a small look-up table (512B) and up to 16 multiplications (this is the
approach used in the FreeBSD* ALTQ RED implementation)
• Fixed-point evaluation using a small look-up table (512B) and 16 SSE multiplications (SSE opti-
mized version of the approach used in the FreeBSD* ALTQ RED implementation)
• Large look-up table (76 KB)
The method that was finally selected (described above in Section 26.3.2.2.1) out performs all of these
approaches in terms of run-time performance and memory requirements and also achieves accuracy
comparable to floating-point evaluation. Table 42.17 lists the performance of each of these alterna-
tive approaches relative to the method that is used in the dropper. As can be seen, the floating-point
implementation achieved the worst performance.
Where:
• maxp = mark probability
• avg = average queue size
• minth = minimum threshold
• maxth = maximum threshold
The calculation of the packet drop probability using Equation 3 is illustrated in Fig. 42.10. If the average
queue size is below the minimum threshold, an arriving packet is enqueued. If the average queue size is
at or above the maximum threshold, an arriving packet is dropped. If the average queue size is between
the minimum and maximum thresholds, a drop probability is calculated to determine if the packet should
be enqueued or dropped.
If the average queue size is between the minimum and maximum thresholds, then the actual drop prob-
ability is calculated from the following equation.
Where:
• Pb = initial drop probability (from Equation 3)
• count = number of packets that have arrived since the last drop
The constant 2, in Equation 4 is the only deviation from the drop probability formulae given in the
reference document where a value of 1 is used instead. It should be noted that the value pa computed
from can be negative or greater than 1. If this is the case, then a value of 1 should be used instead.
The initial and actual drop probabilities are shown in Fig. 42.11. The actual drop probability is shown for
the case where the formula given in the reference document1 is used (blue curve) and also for the case
where the formula implemented in the dropper module, is used (red curve). The formula in the reference
document results in a significantly higher drop rate compared to the mark probability configuration
parameter specified by the user. The choice to deviate from the reference document is simply a design
decision and one that has been taken by other RED implementations, for example, FreeBSD* ALTQ
RED.
Fig. 42.11: Initial Drop Probability (pb), Actual Drop probability (pa) Computed Using a Factor 1 (Blue
Curve) and a Factor 2 (Red Curve)
• DPDK/lib/librte_sched/rte_red.h
• DPDK/lib/librte_sched/rte_red.c
This parameter must be set to y. The parameter is found in the build configuration files in the
DPDK/config directory, for example, DPDK/config/common_linux. RED configuration parameters are
specified in the rte_red_params structure within the rte_sched_port_params structure that is passed to
the scheduler on initialization. RED parameters are specified separately for four traffic classes and three
packet colors (green, yellow and red) allowing the scheduler to implement Weighted Random Early
Detection (WRED).
Note: For correct operation, the same EWMA filter weight parameter (wred weight) should be used for
each packet color (green, yellow, red) in the same traffic class (tc).
; RED params per traffic class and color (Green / Yellow / Red)
[red]
tc 0 wred min = 28 22 16
tc 0 wred max = 32 32 32
tc 0 wred inv prob = 10 10 10
tc 0 wred weight = 9 9 9
tc 1 wred min = 28 22 16
tc 1 wred max = 32 32 32
tc 1 wred inv prob = 10 10 10
tc 1 wred weight = 9 9 9
tc 2 wred min = 28 22 16
tc 2 wred max = 32 32 32
tc 2 wred inv prob = 10 10 10
tc 2 wred weight = 9 9 9
tc 3 wred min = 28 22 16
tc 3 wred max = 32 32 32
tc 3 wred inv prob = 10 10 10
tc 3 wred weight = 9 9 9
tc 4 wred min = 28 22 16
tc 4 wred max = 32 32 32
tc 4 wred inv prob = 10 10 10
tc 4 wred weight = 9 9 9
tc 5 wred min = 28 22 16
tc 5 wred max = 32 32 32
tc 5 wred inv prob = 10 10 10
tc 5 wred weight = 9 9 9
tc 6 wred min = 28 22 16
tc 6 wred max = 32 32 32
tc 6 wred inv prob = 10 10 10
tc 6 wred weight = 9 9 9
tc 7 wred min = 28 22 16
tc 7 wred max = 32 32 32
tc 7 wred inv prob = 10 10 10
tc 7 wred weight = 9 9 9
tc 8 wred min = 28 22 16
tc 8 wred max = 32 32 32
tc 8 wred inv prob = 10 10 10
tc 8 wred weight = 9 9 9
tc 9 wred min = 28 22 16
tc 9 wred max = 32 32 32
tc 9 wred inv prob = 10 10 10
tc 9 wred weight = 9 9 9
tc 10 wred min = 28 22 16
tc 10 wred max = 32 32 32
tc 10 wred inv prob = 10 10 10
tc 10 wred weight = 9 9 9
tc 11 wred min = 28 22 16
tc 11 wred max = 32 32 32
tc 11 wred inv prob = 10 10 10
tc 11 wred weight = 9 9 9
tc 12 wred min = 28 22 16
tc 12 wred max = 32 32 32
tc 12 wred inv prob = 10 10 10
tc 12 wred weight = 9 9 9
With this configuration file, the RED configuration that applies to green, yellow and red packets in traffic
class 0 is shown in Table 42.18.
The arguments passed to the enqueue API are configuration data, run-time data, the current size of the
packet queue (in packets) and a value representing the current time. The time reference is in units of
bytes, where a byte signifies the time duration required by the physical interface to send out a byte on
the transmission medium (see Section 26.2.4.5.1 “Internal Time Reference” ). The dropper reuses the
Empty API
The syntax of the empty API is as follows:
void rte_red_mark_queue_empty(struct rte_red *red, const uint64_t time)
The arguments passed to the empty API are run-time data and the current time in bytes.
• Update the C and E / P token buckets. This is done by reading the current time (from the CPU
timestamp counter), identifying the amount of time since the last bucket update and computing the
associated number of tokens (according to the pre-configured bucket rate). The number of tokens
in the bucket is limited by the pre-configured bucket size;
• Identify the output color for the current packet based on the size of the IP packet and the amount
of tokens currently available in the C and E / P buckets; for color aware mode only, the input color
of the packet is also considered. When the output color is not red, a number of tokens equal to
the length of the IP packet are subtracted from the C or E /P or both buckets, depending on the
algorithm and the output color of the packet.
FORTYTHREE
POWER MANAGEMENT
The DPDK Power Management feature allows users space applications to save power by dynamically
adjusting CPU frequency or entering into different C-States.
• Adjusting the CPU frequency dynamically according to the utilization of RX queue.
• Entering into different deeper C-States according to the adaptive algorithms to speculate brief
periods of time suspending the application if no packets are received.
The interfaces for adjusting the operating CPU frequency are in the power management library. C-State
control is implemented in applications according to the different use cases.
330
Programmer’s Guide, Release 19.11.10
• User does not know how much real load is on a system, resulting in wasted energy as no power
management is utilized
Compared to the original l3fwd-power design, instead of going to sleep after detecting an empty poll,
the new mechanism just lowers the core frequency. As a result, the application does not stop polling the
device, which leads to improved handling of bursts of traffic.
When the system become busy, the empty poll mechanism can also increase the core frequency (in-
cluding turbo) to do best effort for intensive traffic. This gives us more flexible and balanced traffic
awareness over the standard l3fwd-power application.
43.9 References
• The ../sample_app_ug/l3_forward_power_man chapter in the ../sample_app_ug/index section.
• The ../sample_app_ug/vm_power_management chapter in the ../sample_app_ug/index section.
FORTYFOUR
The DPDK provides an Access Control library that gives the ability to classify an input packet based on
a set of classification rules.
The ACL library is used to perform an N-tuple search over a set of rules with multiple categories and
find the best match (highest priority) for each category. The library API provides the following basic
operations:
• Create a new Access Control (AC) context.
• Add rules into the context.
• For all rules in the context, build the runtime structures necessary to perform packet classification.
• Perform input packet classifications.
• Destroy an AC context and its runtime structures and free the associated memory.
44.1 Overview
44.1.1 Rule definition
The current implementation allows the user for each AC context to specify its own rule (set of fields)
over which packet classification will be performed. Though there are few restrictions on the rule fields
layout:
• First field in the rule definition has to be one byte long.
• All subsequent fields has to be grouped into sets of 4 consecutive bytes.
This is done mainly for performance reasons - search function processes the first input byte as part of
the flow setup and then the inner loop of the search function is unrolled to process four input bytes at a
time.
To define each field inside an AC rule, the following structure is used:
struct rte_acl_field_def {
uint8_t type; /*< type - ACL_FIELD_TYPE. */
uint8_t size; /*< size of field 1,2,4, or 8. */
uint8_t field_index; /*< index of field inside the rule. */
uint8_t input_index; /*< 0-N input index. */
uint32_t offset; /*< offset to start of field. */
};
334
Programmer’s Guide, Release 19.11.10
– _RANGE - for fields such as ports that have a lower and upper value for the field.
– _BITMASK - for fields such as protocol identifiers that have a value and a bit mask.
• size The size parameter defines the length of the field in bytes. Allowable values are 1, 2, 4,
or 8 bytes. Note that due to the grouping of input bytes, 1 or 2 byte fields must be defined as
consecutive fields that make up 4 consecutive input bytes. Also, it is best to define fields of 8 or
more bytes as 4 byte fields so that the build processes can eliminate fields that are all wild.
• field_index A zero-based value that represents the position of the field inside the rule; 0 to N-1 for
N fields.
• input_index As mentioned above, all input fields, except the very first one, must be in groups of 4
consecutive bytes. The input index specifies to which input group that field belongs to.
• offset The offset field defines the offset for the field. This is the offset from the beginning of the
buffer parameter for the search.
For example, to define classification for the following IPv4 5-tuple structure:
struct ipv4_5tuple {
uint8_t proto;
uint32_t ip_src;
uint32_t ip_dst;
uint16_t port_src;
uint16_t port_dst;
};
/*
* Next 2 fields (src & dst ports) form 4 consecutive bytes.
* They share the same input index.
*/
{
.type = RTE_ACL_FIELD_TYPE_RANGE,
{
.type = RTE_ACL_FIELD_TYPE_RANGE,
.size = sizeof (uint16_t),
.field_index = 4,
.input_index = 3,
.offset = offsetof (struct ipv4_5tuple, port_dst),
},
};
Any IPv4 packets with protocol ID 17 (UDP), source address 192.168.1.[0-255], destination address
192.168.2.31, source port [0-65535] and destination port 1234 matches the above rule.
To define classification for the IPv6 2-tuple: <protocol, IPv6 source address> over the following IPv6
header structure:
struct rte_ipv6_hdr {
uint32_t vtc_flow; /* IP version, traffic class & flow label. */
uint16_t payload_len; /* IP packet length - includes sizeof(ip_header). */
uint8_t proto; /* Protocol, next header. */
uint8_t hop_limits; /* Hop limits. */
uint8_t src_addr[16]; /* IP address of source host. */
uint8_t dst_addr[16]; /* IP address of destination host(s). */
} __attribute__((__packed__));
{
.type = RTE_ACL_FIELD_TYPE_MASK,
.size = sizeof (uint32_t),
.field_index = 1,
.input_index = 1,
.offset = offsetof (struct rte_ipv6_hdr, src_addr[0]),
},
{
.type = RTE_ACL_FIELD_TYPE_MASK,
.size = sizeof (uint32_t),
.field_index = 2,
.input_index = 2,
.offset = offsetof (struct rte_ipv6_hdr, src_addr[4]),
},
{
.type = RTE_ACL_FIELD_TYPE_MASK,
{
.type = RTE_ACL_FIELD_TYPE_MASK,
.size = sizeof (uint32_t),
.field_index = 4,
.input_index = 4,
.offset = offsetof (struct rte_ipv6_hdr, src_addr[12]),
},
};
Any IPv6 packets with protocol ID 6 (TCP), and source address inside the range
[2001:db8:1234:0000:0000:0000:0000:0000 - 2001:db8:1234:ffff:ffff:ffff:ffff:ffff] matches the above
rule.
In the following example the last element of the search key is 8-bit long. So it is a case where the 4
consecutive bytes of an input field are not fully occupied. The structure for the classification is:
struct acl_key {
uint8_t ip_proto;
uint32_t ip_src;
uint32_t ip_dst;
uint8_t tos; /*< This is partially using a 32-bit input element */
};
/*
* Next element of search key (Type of Service) is indeed 1 byte long.
* Anyway we need to allocate all the 4 consecutive bytes for it.
*/
{
.type = RTE_ACL_FIELD_TYPE_BITMASK,
.size = sizeof (uint32_t), /* All the 4 consecutive bytes are allocated */
.field_index = 3,
.input_index = 3,
.offset = offsetof (struct acl_key, tos),
},
};
Any IPv4 packets with protocol ID 6 (TCP), source address 192.168.1.[0-255], destination address
192.168.2.31, ToS 1 matches the above rule.
When creating a set of rules, for each rule, additional information must be supplied also:
• priority: A weight to measure the priority of the rules (higher is better). If the input tuple matches
more than one rule, then the rule with the higher priority is returned. Note that if the input tuple
matches more than one rule and these rules have equal priority, it is undefined which rule is
returned as a match. It is recommended to assign a unique priority for each rule.
• category_mask: Each rule uses a bit mask value to select the relevant category(s) for the rule.
When a lookup is performed, the result for each category is returned. This effectively provides a
“parallel lookup” by enabling a single search to return multiple results if, for example, there were
four different sets of ACL rules, one for access control, one for routing, and so on. Each set could
be assigned its own category and by combining them into a single database, one lookup returns a
result for each of the four sets.
• userdata: A user-defined value. For each category, a successful match returns the userdata field
of the highest priority matched rule. When no rules match, returned value is zero.
Note: When adding new rules into an ACL context, all fields must be in host byte order (LSB). When
the search is performed for an input tuple, all fields in that tuple must be in network byte order (MSB).
That gives the user the ability to decisions about performance/space trade-off. For example:
struct rte_acl_ctx * acx;
struct rte_acl_config cfg;
int ret;
/*
* assuming that acx points to already created and
* populated with rules AC context and cfg filled properly.
*/
/*
* RT structures can't fit into 8MB for given context.
* Try to build without exposing any hard limit.
*/
if (ret == -ERANGE) {
cfg.max_size = 0;
ret = rte_acl_build(acx, &cfg);
}
Note: For more details about the Access Control API, please refer to the DPDK API Reference.
The following example demonstrates IPv4, 5-tuple classification for rules defined above with multiple
categories in more detail.
RTE_ACL_RULE_DEF(acl_ipv4_rule, RTE_DIM(ipv4_defs));
/* destination IPv4 */
.field[2] = {.value.u32 = RTE_IPV4(192,168,0,0),. mask_range.u32 = 16,},
/* source port */
.field[3] = {.value.u16 = 0, .mask_range.u16 = 0xffff,},
/* destination port */
.field[4] = {.value.u16 = 0, .mask_range.u16 = 0xffff,},
},
/* destination IPv4 */
.field[2] = {.value.u32 = RTE_IPV4(192,168,1,0),. mask_range.u32 = 24,},
/* source port */
.field[3] = {.value.u16 = 0, .mask_range.u16 = 0xffff,},
/* destination port */
.field[4] = {.value.u16 = 0, .mask_range.u16 = 0xffff,},
},
/* source IPv4 */
.field[1] = {.value.u32 = RTE_IPV4(10,1,1,1),. mask_range.u32 = 32,},
/* source port */
/* destination port */
.field[4] = {.value.u16 = 0, .mask_range.u16 = 0xffff,},
},
};
cfg.num_categories = 2;
cfg.num_fields = RTE_DIM(ipv4_defs);
For a tuple with source IP address: 10.1.1.1 and destination IP address: 192.168.1.15, once the following
lines are executed:
uint32_t results[4]; /* make classify for 4 categories. */
• For category 0, both rules 1 and 2 match, but rule 2 has higher priority, therefore results[0] contains
the userdata for rule 2.
• For category 1, both rules 1 and 3 match, but rule 3 has higher priority, therefore results[1] contains
the userdata for rule 3.
• For categories 2 and 3, there are no matches, so results[2] and results[3] contain zero, which
indicates that no matches were found for those categories.
For a tuple with source IP address: 192.168.1.1 and destination IP address: 192.168.2.11, once the
following lines are executed:
uint32_t results[4]; /* make classify by 4 categories. */
FORTYFIVE
PACKET FRAMEWORK
45.2 Overview
Packet processing applications are frequently structured as pipelines of multiple stages, with the logic
of each stage glued around a lookup table. For each incoming packet, the table defines the set of actions
to be applied to the packet, as well as the next stage to send the packet to.
The DPDK Packet Framework minimizes the development effort required to build packet processing
pipelines by defining a standard methodology for pipeline development, as well as providing libraries of
reusable templates for the commonly used pipeline blocks.
The pipeline is constructed by connecting the set of input ports with the set of output ports through the
set of tables in a tree-like topology. As result of lookup operation for the current packet in the current
table, one of the table entries (on lookup hit) or the default table entry (on lookup miss) provides the set
of actions to be applied on the current packet, as well as the next hop for the packet, which can be either
another table, an output port or packet drop.
An example of packet processing pipeline is presented in Fig. 45.1:
343
Programmer’s Guide, Release 19.11.10
Fig. 45.1: Example of Packet Processing Pipeline where Input Ports 0 and 1 are Connected with Output
Ports 0, 1 and 2 through Tables 0 and 1
2. delete key: When no value is currently associated with key, this operation has no effect. When key
is already associated value, then association (key, value) is removed;
3. lookup key: When no value is currently associated with key, then this operation returns void value
(lookup miss). When key is associated with value, then this operation returns value. The (key,
value) association is not changed.
The matching criterion used to compare the input key against the keys in the associative array is exact
match, as the key size (number of bytes) and the key value (array of bytes) have to match exactly for the
two keys under comparison.
Hash Function
A hash function deterministically maps data of variable length (key) to data of fixed size (hash value or
key signature). Typically, the size of the key is bigger than the size of the key signature. The hash func-
tion basically compresses a long key into a short signature. Several keys can share the same signature
(collisions).
High quality hash functions have uniform distribution. For large number of keys, when dividing the
space of signature values into a fixed number of equal intervals (buckets), it is desirable to have the
key signatures evenly distributed across these intervals (uniform distribution), as opposed to most of
the signatures going into only a few of the intervals and the rest of the intervals being largely unused
(non-uniform distribution).
Hash Table
A hash table is an associative array that uses a hash function for its operation. The reason for using a
hash function is to optimize the performance of the lookup operation by minimizing the number of table
keys that have to be compared against the input key.
Instead of storing the (key, value) pairs in a single list, the hash table maintains multiple lists (buckets).
For any given key, there is a single bucket where that key might exist, and this bucket is uniquely
identified based on the key signature. Once the key signature is computed and the hash table bucket
identified, the key is either located in this bucket or it is not present in the hash table at all, so the key
search can be narrowed down from the full set of keys currently in the table to just the set of keys
currently in the identified table bucket.
The performance of the hash table lookup operation is greatly improved, provided that the table keys are
evenly distributed among the hash table buckets, which can be achieved by using a hash function with
uniform distribution. The rule to map a key to its bucket can simply be to use the key signature (modulo
the number of table buckets) as the table bucket ID:
bucket_id = f_hash(key) % n_buckets;
By selecting the number of buckets to be a power of two, the modulo operator can be replaced by a
bitwise AND logical operation:
bucket_id = f_hash(key) & (n_buckets - 1);
considering n_bits as the number of bits set in bucket_mask = n_buckets - 1, this means that all the keys
that end up in the same hash table bucket have the lower n_bits of their signature identical. In order to
reduce the number of keys in the same bucket (collisions), the number of hash table buckets needs to be
increased.
In packet processing context, the sequence of operations involved in hash table operations is described
in Fig. 45.2:
Fig. 45.2: Sequence of Steps for Hash Table Operations in a Packet Processing Context
Table 45.5: Configuration Parameters Common for All Hash Table Types
# Parameter Details
1 Key size Measured as number of bytes. All keys have the same size.
2 Key value (key data) Measured as number of bytes.
size
3 Number of buckets Needs to be a power of two.
4 Maximum number Needs to be a power of two.
of keys
5 Hash function Examples: jhash, CRC hash, etc.
6 Hash function seed Parameter to be passed to the hash function.
7 Key offset Offset of the lookup key byte array within the packet meta-data stored
in the packet buffer.
On initialization, each hash table bucket is allocated space for exactly 4 keys. As keys are added to the
table, it can happen that a given bucket already has 4 keys when a new key has to be added to this bucket.
The possible options are:
1. Least Recently Used (LRU) Hash Table. One of the existing keys in the bucket is deleted and
the new key is added in its place. The number of keys in each bucket never grows bigger than 4.
The logic to pick the key to be dropped from the bucket is LRU. The hash table lookup operation
maintains the order in which the keys in the same bucket are hit, so every time a key is hit, it
becomes the new Most Recently Used (MRU) key, i.e. the last candidate for drop. When a key
is added to the bucket, it also becomes the new MRU key. When a key needs to be picked and
dropped, the first candidate for drop, i.e. the current LRU key, is always picked. The LRU logic
requires maintaining specific data structures per each bucket.
2. Extendable Bucket Hash Table. The bucket is extended with space for 4 more keys. This is done
by allocating additional memory at table initialization time, which is used to create a pool of free
keys (the size of this pool is configurable and always a multiple of 4). On key add operation, the
allocation of a group of 4 keys only happens successfully within the limit of free keys, otherwise
the key add operation fails. On key delete operation, a group of 4 keys is freed back to the pool of
free keys when the key to be deleted is the only key that was used within its group of 4 keys at that
time. On key lookup operation, if the current bucket is in extended state and a match is not found
in the first group of 4 keys, the search continues beyond the first group of 4 keys, potentially until
all keys in this bucket are examined. The extendable bucket logic requires maintaining specific
data structures per table and per each bucket.
Signature Computation
buffer as packet meta-data. The second CPU core reads both the key and the key signature from
the packet meta-data and performs the bucket search step of the key lookup operation.
2. Key signature computed on lookup (“do-sig” version). The same CPU core reads the key from
the packet meta-data, uses it to compute the key signature and also performs the bucket search
step of the key lookup operation.
Table 45.7: Configuration Parameters Specific to Pre-computed Key Signature Hash Table
# Parameter Details
1 Signature offset Offset of the pre-computed key signature within the packet meta-data.
For specific key sizes, the data structures and algorithm of key lookup operation can be specially hand-
crafted for further performance improvements, so following options are possible:
1. Implementation supporting configurable key size.
2. Implementation supporting a single key size. Typical key sizes are 8 bytes and 16 bytes.
By splitting the processing into several stages that are executed on different packets (the packets from
the input burst are interlaced), enough work is created to allow the prefetch instructions to complete
successfully (before the prefetched data structures are actually accessed) and also the data dependency
between instructions is loosened. For example, for the 4-stage pipeline, stage 0 is executed on packets
0 and 1 and then, before same packets 0 and 1 are used (i.e. before stage 1 is executed on packets
0 and 1), different packets are used: packets 2 and 3 (executing stage 1), packets 4 and 5 (executing
stage 2) and packets 6 and 7 (executing stage 3). By executing useful work while the data structures
are brought into the L1 or L2 cache memory, the latency of the read memory accesses is hidden. By
increasing the gap between two consecutive accesses to the same data structure, the data dependency
between instructions is loosened; this allows making the best use of the super-scalar and out-of-order
execution CPU architecture, as the number of CPU core execution units that are active (rather than idle
or stalled due to data dependency constraints between instructions) is maximized.
The bucket search logic is also implemented without using any branch instructions. This avoids the
important cost associated with flushing the CPU core execution pipeline on every instance of branch
misprediction.
Fig. 45.3, Table 45.8 and Table 45.9 detail the main data structures used to implement configurable key
size hash tables (either LRU or extendable bucket, either with pre-computed signature or “do-sig”).
Fig. 45.3: Data Structures for Configurable Key Size Hash Tables
Table 45.8: Main Large Data Structures (Arrays) used for Configurable Key Size Hash Tables
# Array name Number of Entry size Description
entries (bytes)
1 Bucket array n_buckets 32 Buckets of the hash table.
(configurable)
2 Bucket n_buckets_ext 32 This array is only created for
extensions (configurable) extendable bucket tables.
array
3 Key array n_keys key_size Keys added to the hash table.
(configurable)
4 Data array n_keys entry_size Key values (key data) associated
(configurable) with the hash table keys.
Table 45.9: Field Description for Bucket Array Entry (Configurable Key Size Hash Tables)
# Field name Field size (bytes) Description
1 Next Ptr/LRU 8 For LRU tables, this
fields represents the
LRU list for the current
bucket stored as array of
4 entries of 2 bytes each.
Entry 0 stores the index
(0 .. 3) of the MRU key,
while entry 3 stores the
index of the LRU key.
For extendable bucket
tables, this field repre-
sents the next pointer
(i.e. the pointer to
the next group of 4
keys linked to the cur-
rent bucket). The next
pointer is not NULL if
the bucket is currently
extended or NULL oth-
erwise. To help the
branchless implementa-
tion, bit 0 (least signif-
icant bit) of this field
is set to 1 if the next
pointer is not NULL and
to 0 otherwise.
2 Sig[0 .. 3] 4x2 If key X (X = 0 .. 3) is
valid, then sig X bits 15
.. 1 store the most sig-
nificant 15 bits of key X
signature and sig X bit 0
is set to 1.
If key X is not valid,
then sig X is set to zero.
3 Key Pos [0 .. 3] 4x4 If key X is valid (X = 0
.. 3), then Key Pos X
represents the index into
the key array where key
X is stored, as well as
the index into the data
array where the value as-
sociated with key X is
stored.
If key X is not valid,
then the value of Key
Pos X is undefined.
Fig. 45.4 and Table 45.10 detail the bucket search pipeline stages (either LRU or extendable bucket,
either with pre-computed signature or “do-sig”). For each pipeline stage, the described operations are
applied to each of the two packets handled by that stage.
Fig. 45.4: Bucket Search Pipeline for Key Lookup Operation (Configurable Key Size Hash Tables)
Table 45.10: Description of the Bucket Search Pipeline Stages (Configurable Key Size Hash Tables)
# Stage name Description
0 Prefetch packet meta-data Select next two packets from the
burst of input packets.
Prefetch packet meta-data con-
taining the key and key signa-
ture.
1 Prefetch table bucket Read the key signature from the
packet meta-data (for extendable
bucket hash tables) or read the
key from the packet meta-data
and compute key signature (for
LRU tables).
Identify the bucket ID using the
key signature.
Set bit 0 of the signature to 1 (to
match only signatures of valid
keys from the table).
Prefetch the bucket.
2 Prefetch table key Read the key signatures from the
bucket.
Compare the signature of the in-
put key against the 4 key signa-
tures from the packet. As result,
the following is obtained:
match = equal to TRUE if there
was at least one signature match
and to FALSE in the case of no
signature match;
match_many = equal to TRUE is
there were more than one signa-
ture matches (can be up to 4 sig-
nature matches in the worst case
scenario) and to FALSE other-
wise;
match_pos = the index of the
first key that produced signature
match (only valid if match is
true).
For extendable bucket hash ta-
bles only, set match_many to
TRUE if next pointer is valid.
Prefetch the bucket key indi-
cated by match_pos (even if
match_pos does not point to
valid key valid).
3 Prefetch table data Read the bucket key indicated by
match_pos.
Compare the bucket key against
the input key. As result, the fol-
lowing is obtained: match_key =
45.4. Table Library Design equal to TRUE if the two355 keys
match and to FALSE otherwise.
Report input key as lookup
hit only when both match and
Programmer’s Guide, Release 19.11.10
Additional notes:
1. The pipelined version of the bucket search algorithm is executed only if there are at least 7 packets
in the burst of input packets. If there are less than 7 packets in the burst of input packets, a non-
optimized implementation of the bucket search algorithm is executed.
2. Once the pipelined version of the bucket search algorithm has been executed for all the packets
in the burst of input packets, the non-optimized implementation of the bucket search algorithm is
also executed for any packets that did not produce a lookup hit, but have the match_many flag set.
As result of executing the non-optimized version, some of these packets may produce a lookup
hit or lookup miss. This does not impact the performance of the key lookup operation, as the
probability of matching more than one signature in the same group of 4 keys or of having the
bucket in extended state (for extendable bucket hash tables only) is relatively small.
Key Signature Comparison Logic
The key signature comparison logic is described in Table 45.11.
Table 45.12: Collapsed Lookup Tables for Match, Match_Many and Match_Pos
Bit array Hexadecimal value
match 1111_1111_1111_1110 0xFFFELLU
match_many 1111_1110_1110_1000 0xFEE8LLU
match_pos 0001_0010_0001_0011__0001_0010_0001_0000 0x12131210LLU
Fig. 45.5, Fig. 45.6, Table 45.13 and Table 45.14 detail the main data structures used to implement 8-
byte and 16-byte key hash tables (either LRU or extendable bucket, either with pre-computed signature
or “do-sig”).
Table 45.13: Main Large Data Structures (Arrays) used for 8-byte and 16-byte Key Size Hash Tables
# Array name Number of entries Entry size (bytes) Description
1 Bucket array n_buckets (config- 8-byte key size: Buckets of the hash
urable) 64 + 4 x entry_size table.
16-byte key size:
128 + 4 x en-
try_size
2 Bucket extensions n_buckets_ext 8-byte key size: This array is only
array (configurable) 64 + 4 x entry_size created for extend-
16-byte key size: able bucket tables.
128 + 4 x en-
try_size
Table 45.14: Field Description for Bucket Array Entry (8-byte and 16-byte Key Hash Tables)
# Field name Field size (bytes) Description
1 Valid 8 Bit X (X = 0 .. 3) is set
to 1 if key X is valid or
to 0 otherwise.
Bit 4 is only used for ex-
tendable bucket tables to
help with the implemen-
tation of the branchless
logic. In this case, bit 4
is set to 1 if next pointer
is valid (not NULL) or
to 0 otherwise.
2 Next Ptr/LRU 8 For LRU tables, this
fields represents the
LRU list for the current
bucket stored as array of
4 entries of 2 bytes each.
Entry 0 stores the index
(0 .. 3) of the MRU key,
while entry 3 stores the
index of the LRU key.
For extendable bucket
tables, this field repre-
sents the next pointer
(i.e. the pointer to
the next group of 4
keys linked to the cur-
rent bucket). The next
pointer is not NULL if
the bucket is currently
extended or NULL oth-
erwise.
3 Key [0 .. 3] 4 x key_size Full keys.
4 Data [0 .. 3] 4 x entry_size Full key values (key
data) associated with
keys 0 .. 3.
and detail the bucket search pipeline used to implement 8-byte and 16-byte key hash tables (either LRU
or extendable bucket, either with pre-computed signature or “do-sig”). For each pipeline stage, the
described operations are applied to each of the two packets handled by that stage.
Fig. 45.7: Bucket Search Pipeline for Key Lookup Operation (Single Key Size Hash Tables)
Table 45.15: Description of the Bucket Search Pipeline Stages (8-byte and 16-byte Key Hash Tables)
# Stage name Description
0 Prefetch packet meta-data
1. Select next two packets
from the burst of input
packets.
2. Prefetch packet meta-data
containing the key and key
signature.
Additional notes:
1. The pipelined version of the bucket search algorithm is executed only if there are at least 5 packets
in the burst of input packets. If there are less than 5 packets in the burst of input packets, a non-
optimized implementation of the bucket search algorithm is executed.
2. For extendable bucket hash tables only, once the pipelined version of the bucket search algorithm
has been executed for all the packets in the burst of input packets, the non-optimized implementa-
tion of the bucket search algorithm is also executed for any packets that did not produce a lookup
hit, but have the bucket in extended state. As result of executing the non-optimized version, some
of these packets may produce a lookup hit or lookup miss. This does not impact the performance
of the key lookup operation, as the probability of having the bucket in extended state is relatively
small.
Reserved Actions
The reserved actions are handled directly by the Packet Framework without the user being able to change
their meaning through the table action handler configuration. A special category of the reserved actions
is represented by the next hop actions, which regulate the packet flow between input ports, tables and
output ports through the pipeline. Table 45.16 lists the next hop actions.
User Actions
For each table, the meaning of user actions is defined through the configuration of the table action
handler. Different tables can be configured with different action handlers, therefore the meaning of
the user actions and their associated meta-data is private to each table. Within the same table, all the
table entries (including the table default entry) share the same definition for the user actions and their
associated meta-data, with each table entry having its own set of enabled user actions and its own copy
of the action meta-data. Table 45.17 contains a non-exhaustive list of user action examples.
even when the semaphore is free. The cost of atomic instructions is normally higher than the cost
of regular instructions.
2. Multiple writer threads, with single thread performing table lookup operations and multiple
threads performing table entry add/delete operations. The threads performing table entry
add/delete operations send table update requests to the reader (typically through message passing
queues), which does the actual table updates and then sends the response back to the request
initiator.
3. Single writer thread performing table entry add/delete operations and multiple reader
threads that perform table lookup operations with read-only access to the table entries. The
reader threads use the main table copy while the writer is updating the mirror copy. Once the
writer update is done, the writer can signal to the readers and busy wait until all readers swaps
between the mirror copy (which now becomes the main copy) and the mirror copy (which now
becomes the main copy).
FORTYSIX
VHOST LIBRARY
The vhost library implements a user space virtio net server allowing the user to manipulate the virtio
ring directly. In another words, it allows the user to fetch/put packets from/to the VM virtio net device.
To achieve this, a vhost library should be able to:
• Access the guest memory:
For QEMU, this is done by using the -object memory-backend-file,share=on,...
option. Which means QEMU will create a file to serve as the guest RAM. The share=on option
allows another process to map that file, which means it can access the guest RAM.
• Know all the necessary information about the vring:
Information such as where the available ring is stored. Vhost defines some messages (passed
through a Unix domain socket file) to tell the backend all the information it needs to know how to
manipulate the vring.
364
Programmer’s Guide, Release 19.11.10
There are some truths (including limitations) you might want to know while setting this flag:
* zero copy is not good for small packets (typically for packet size below 512).
* zero copy is really good for VM2VM case. For iperf between two VMs, the boost could
be above 70% (when TSO is enabled).
* For zero copy in VM2NIC case, guest Tx used vring may be starved if the PMD driver
consume the mbuf but not release them timely.
For example, i40e driver has an optimization to maximum NIC pipeline which post-
pones returning transmitted mbuf until only tx_free_threshold free descs left. The virtio
TX used ring will be starved if the formula (num_i40e_tx_desc - num_virtio_tx_desc >
tx_free_threshold) is true, since i40e will not return back mbuf.
A performance tip for tuning zero copy in VM2NIC case is to adjust the frequency
of mbuf free (i.e. adjust tx_free_threshold of i40e driver) to balance consumer and
producer.
* Guest memory should be backended with huge pages to achieve better performance.
Using 1G page size is the best.
When dequeue zero copy is enabled, the guest phys address and host phys address
mapping has to be established. Using non-huge pages means far more page segments.
To make it simple, DPDK vhost does a linear search of those segments, thus the fewer
the segments, the quicker we will get the mapping. NOTE: we may speed it by using
tree searching in future.
* zero copy can not work when using vfio-pci with iommu mode currently, this is because
we don’t setup iommu dma mapping for guest memory. If you have to use vfio-pci
driver, please insert vfio-pci kernel module in noiommu mode.
* The consumer of zero copy mbufs should consume these mbufs as soon as possible,
otherwise it may block the operations in vhost.
– RTE_VHOST_USER_IOMMU_SUPPORT
IOMMU support will be enabled when this flag is set. It is disabled by default.
Enabling this flag makes possible to use guest vIOMMU to protect vhost from accessing
memory the virtio device isn’t allowed to, when the feature is negotiated and an IOMMU
device is declared.
However, this feature enables vhost-user’s reply-ack protocol feature, which implementation
is buggy in Qemu v2.7.0-v2.9.0 when doing multiqueue. Enabling this flag with these Qemu
version results in Qemu being blocked when multiple queue pairs are declared.
– RTE_VHOST_USER_POSTCOPY_SUPPORT
Postcopy live-migration support will be enabled when this flag is set. It is disabled by default.
Enabling this flag should only be done when the calling application does not pre-fault the
guest shared memory, otherwise migration would fail.
– RTE_VHOST_USER_LINEARBUF_SUPPORT
Enabling this flag forces vhost dequeue function to only provide linear pktmbuf (no multi-
segmented pktmbuf).
The vhost library by default provides a single pktmbuf for given a packet, but if for some
reason the data doesn’t fit into a single pktmbuf (e.g., TSO is enabled), the library will
allocate additional pktmbufs from the same mempool and chain them together to create a
multi-segmented pktmbuf.
However, the vhost application needs to support multi-segmented format. If the vhost appli-
cation does not support that format and requires large buffers to be dequeue, this flag should
be enabled to force only linear buffers (see RTE_VHOST_USER_EXTBUF_SUPPORT) or
drop the packet.
It is disabled by default.
– RTE_VHOST_USER_EXTBUF_SUPPORT
Enabling this flag allows vhost dequeue function to allocate and attach an external buffer to
a pktmbuf if the pkmbuf doesn’t provide enough space to store all data.
This is useful when the vhost application wants to support large packets but doesn’t want to
increase the default mempool object size nor to support multi-segmented mbufs (non-linear).
In this case, a fresh buffer is allocated using rte_malloc() which gets attached to a pktmbuf
using rte_pktmbuf_attach_extbuf().
See RTE_VHOST_USER_LINEARBUF_SUPPORT as well to disable multi-segmented
mbufs for application that doesn’t support chained mbufs.
It is disabled by default.
• rte_vhost_driver_set_features(path,features)
This function sets the feature bits the vhost-user driver supports. The vhost-user driver could be
vhost-user net, yet it could be something else, say, vhost-user SCSI.
• rte_vhost_driver_callback_register(path,vhost_device_ops)
This function registers a set of callbacks, to let DPDK applications take the appropriate action
when some events happen. The following events are currently supported:
– new_device(int vid)
This callback is invoked when a virtio device becomes ready. vid is the vhost device ID.
– destroy_device(int vid)
This callback is invoked when a virtio device is paused or shut down.
– vring_state_changed(int vid,uint16_t queue_id,int enable)
This callback is invoked when a specific queue’s state is changed, for example to enabled or
disabled.
– features_changed(int vid,uint64_t features)
This callback is invoked when the features is changed. For example, VHOST_F_LOG_ALL
will be set/cleared at the start/end of live migration, respectively.
– new_connection(int vid)
This callback is invoked on new vhost-user socket connection. If DPDK acts as the server
the device should not be deleted before destroy_connection callback is received.
– destroy_connection(int vid)
This callback is invoked when vhost-user socket connection is closed. It indicates that device
with id vid is no longer in use and can be safely deleted.
• rte_vhost_driver_disable/enable_features(path,features))
This function disables/enables some features. For example, it can be used to disable mergeable
buffers and TSO features, which both are enabled by default.
• rte_vhost_driver_start(path)
This function triggers the vhost-user negotiation. It should be invoked at the end of initializing a
vhost-user driver.
• rte_vhost_enqueue_burst(vid,queue_id,pkts,count)
Transmits (enqueues) count packets from host to guest.
• rte_vhost_dequeue_burst(vid,queue_id,mbuf_pool,pkts,count)
Receives (dequeues) count packets from guest, and stored them at pkts.
• rte_vhost_crypto_create(vid,cryptodev_id,sess_mempool,socket_id)
As an extension of new_device(), this function adds virtio-crypto workload acceleration capabil-
ity to the device. All crypto workload is processed by DPDK cryptodev with the device ID of
cryptodev_id.
• rte_vhost_crypto_free(vid)
Frees the memory and vhost-user message handlers created in rte_vhost_crypto_create().
• rte_vhost_crypto_fetch_requests(vid,queue_id,ops,nb_ops)
Receives (dequeues) nb_ops virtio-crypto requests from guest, parses them to DPDK Crypto
Operations, and fills the ops with parsing results.
• rte_vhost_crypto_finalize_requests(queue_id,ops,nb_ops)
After the ops are dequeued from Cryptodev, finalizes the jobs and notifies the guest(s).
• rte_vhost_crypto_set_zero_copy(vid,option)
Enable or disable zero copy feature of the vhost crypto backend.
Note:
No matter which mode is used, once a connection is established, DPDK vhost-user will start receiving
and processing vhost messages from QEMU.
For messages with a file descriptor, the file descriptor can be used directly in the vhost process as it is
already installed by the Unix domain socket.
The supported vhost messages are:
• VHOST_SET_MEM_TABLE
• VHOST_SET_VRING_KICK
• VHOST_SET_VRING_CALL
• VHOST_SET_LOG_FD
• VHOST_SET_VRING_ERR
For VHOST_SET_MEM_TABLE message, QEMU will send information for each memory region and
its file descriptor in the ancillary data of the message. The file descriptor is used to map that region.
VHOST_SET_VRING_KICK is used as the signal to put the vhost device into the data plane, and
VHOST_GET_VRING_BASE is used as the signal to remove the vhost device from the data plane.
When the socket connection is closed, vhost will destroy the device.
FORTYSEVEN
METRICS LIBRARY
The Metrics library implements a mechanism by which producers can publish numeric information for
later querying by consumers. In practice producers will typically be other libraries or primary processes,
whereas consumers will typically be applications.
Metrics themselves are statistics that are not generated by PMDs. Metric information is populated using
a push model, where producers update the values contained within the metric library by calling an update
function on the relevant metrics. Consumers receive metric information by querying the central metric
data, which is held in shared memory.
For each metric, a separate value is maintained for each port id, and when publishing metric val-
ues the producers need to specify which port is being updated. In addition there is a special id
RTE_METRICS_GLOBAL that is intended for global statistics that are not associated with any indi-
vidual device. Since the metrics library is self-contained, the only restriction on port numbers is that
they are less than RTE_MAX_ETHPORTS - there is no requirement for the ports to actually exist.
This function must be called from a primary process, but otherwise producers and consumers can be in
either primary or secondary processes.
370
Programmer’s Guide, Release 19.11.10
If the return value is negative, it means registration failed. Otherwise the return value is the key for the
metric, which is used when updating values. A table mapping together these key values and the metrics’
names can be obtained using rte_metrics_get_names().
if metrics were registered as a single set, they can either be updated individ-
ually using rte_metrics_update_value(), or updated together using the
rte_metrics_update_values() function:
rte_metrics_update_value(port_id, id_set, values[0]);
rte_metrics_update_value(port_id, id_set + 1, values[1]);
rte_metrics_update_value(port_id, id_set + 2, values[2]);
rte_metrics_update_value(port_id, id_set + 3, values[3]);
Note that rte_metrics_update_values() cannot be used to update metric values from multiple
sets, as there is no guarantee two sets registered one after the other have contiguous id values.
}
ret = rte_metrics_get_values(port_id, metrics, len);
if (ret < 0 || ret > len) {
printf("Cannot get metrics values\n");
free(metrics);
free(names);
return;
}
printf("Metrics for port %i:\n", port_id);
for (i = 0; i < len; i++)
printf(" %s: %"PRIu64"\n",
names[metrics[i].key].name, metrics[i].value);
free(metrics);
free(names);
}
If the return value is negative, it means deinitialization failed. This function must be called from a
primary process.
47.6.1 Initialization
Before the library can be used, it has to be initialised by calling rte_stats_bitrate_create(),
which will return a bit-rate calculation object. Since the bit-rate library uses the metrics library to report
the calculated statistics, the bit-rate library then needs to register the calculated statistics with the metrics
library. This is done using the helper function rte_stats_bitrate_reg().
struct rte_stats_bitrates *bitrate_data;
bitrate_data = rte_stats_bitrate_create();
if (bitrate_data == NULL)
rte_exit(EXIT_FAILURE, "Could not allocate bit-rate data.\n");
rte_stats_bitrate_reg(bitrate_data);
while( 1 ) {
/* ... */
tics_current = rte_rdtsc();
if (tics_current - tics_datum >= tics_per_1sec) {
/* Periodic bitrate calculation */
for (idx_port = 0; idx_port < cnt_ports; idx_port++)
rte_stats_bitrate_calc(bitrate_data, idx_port);
tics_datum = tics_current;
}
/* ... */
}
47.7.1 Initialization
Before the library can be used, it has to be initialised by calling rte_latencystats_init().
lcoreid_t latencystats_lcore_id = -1;
rte_latencystats_uninit();
FORTYEIGHT
The DPDK provides an BPF library that gives the ability to load and execute Enhanced Berkeley Packet
Filter (eBPF) bytecode within user-space dpdk application.
It supports basic set of features from eBPF spec. Please refer to the eBPF spec
<https://www.kernel.org/doc/Documentation/networking/filter.txt> for more information. Also it intro-
duces basic framework to load/unload BPF-based filters on eth devices (right now only via SW RX/TX
callbacks).
The library API provides the following basic operations:
• Create a new BPF execution context and load user provided eBPF code into it.
• Destroy an BPF execution context and its runtime structures and free the associated memory.
• Execute eBPF bytecode associated with provided input parameter.
• Provide information about natively compiled code for given BPF context.
• Load BPF program from the ELF file and install callback to execute it on given ethdev port/queue.
375
CHAPTER
FORTYNINE
DPDK provides a library for IPsec data-path processing. The library utilizes the existing DPDK crypto-
dev and security API to provide the application with a transparent and high performant IPsec packet
processing API. The library is concentrated on data-path protocols processing (ESP and AH), IKE pro-
tocol(s) implementation is out of scope for this library.
For packets destined for inline processing no extra overhead is required and the synchronous API call:
rte_ipsec_pkt_process() is sufficient for that case.
Note: For more details about the IPsec API, please refer to the DPDK API Reference.
The current implementation supports all four currently defined rte_security types:
376
Programmer’s Guide, Release 19.11.10
49.1.1 RTE_SECURITY_ACTION_TYPE_NONE
In that mode the library functions perform
• for inbound packets:
– check SQN
– prepare rte_crypto_op structure for each input packet
– verify that integrity check and decryption performed by crypto device completed success-
fully
– check padding data
– remove outer IP header (tunnel mode) / update IP header (transport mode)
– remove ESP header and trailer, padding, IV and ICV data
– update SA replay window
• for outbound packets:
– generate SQN and IV
– add outer IP header (tunnel mode) / update IP header (transport mode)
– add ESP header and trailer, padding and IV data
– prepare rte_crypto_op structure for each input packet
– verify that crypto device operations (encryption, ICV generation) were completed success-
fully
49.1.2 RTE_SECURITY_ACTION_TYPE_INLINE_CRYPTO
In that mode the library functions perform
• for inbound packets:
– verify that integrity check and decryption performed by rte_security device completed suc-
cessfully
– check SQN
– check padding data
– remove outer IP header (tunnel mode) / update IP header (transport mode)
– remove ESP header and trailer, padding, IV and ICV data
– update SA replay window
• for outbound packets:
– generate SQN and IV
– add outer IP header (tunnel mode) / update IP header (transport mode)
– add ESP header and trailer, padding and IV data
– update ol_flags inside struct rte_mbuf to indicate that inline-crypto processing has to be
performed by HW on this packet
– invoke rte_security device specific set_pkt_metadata() to associate security device specific
data with the packet
49.1.3 RTE_SECURITY_ACTION_TYPE_INLINE_PROTOCOL
In that mode the library functions perform
• for inbound packets:
– verify that integrity check and decryption performed by rte_security device completed suc-
cessfully
• for outbound packets:
– update ol_flags inside struct rte_mbuf to indicate that inline-crypto processing has to be
performed by HW on this packet
– invoke rte_security device specific set_pkt_metadata() to associate security device specific
data with the packet
49.1.4 RTE_SECURITY_ACTION_TYPE_LOOKASIDE_PROTOCOL
In that mode the library functions perform
• for inbound packets:
– prepare rte_crypto_op structure for each input packet
– verify that integrity check and decryption performed by crypto device completed success-
fully
• for outbound packets:
– prepare rte_crypto_op structure for each input packet
– verify that crypto device operations (encryption, ICV generation) were completed success-
fully
To accommodate future custom implementations function pointers model is used for both
crypto_prepare and process implementations.
49.2.1 Create/destroy
librte_ipsec SAD implementation provides ability to create/destroy SAD tables.
To create SAD table user has to specify how many entries of each key type is required and IP protocol
type (IPv4/IPv6). As an example:
conf.socket_id = -1;
conf.max_sa[RTE_IPSEC_SAD_SPI_ONLY] = some_nb_rules_spi_only;
conf.max_sa[RTE_IPSEC_SAD_SPI_DIP] = some_nb_rules_spi_dip;
conf.max_sa[RTE_IPSEC_SAD_SPI_DIP_SIP] = some_nb_rules_spi_dip_sip;
conf.flags = RTE_IPSEC_SAD_FLAG_RW_CONCURRENCY;
Note: for more information please refer to ipsec library API reference
As an example, to add new entry into the SAD for IPv4 addresses:
struct rte_ipsec_sa *sa;
union rte_ipsec_sad_key key;
key.v4.spi = rte_cpu_to_be_32(spi_val);
if (key_type >= RTE_IPSEC_SAD_SPI_DIP) /* DIP is optional*/
key.v4.dip = rte_cpu_to_be_32(dip_val);
if (key_type == RTE_IPSEC_SAD_SPI_DIP_SIP) /* SIP is optional*/
key.v4.sip = rte_cpu_to_be_32(sip_val);
Note: By performance reason it is better to keep spi/dip/sip in net byte order to eliminate byteswap on
lookup
key.v4.spi = rte_cpu_to_be_32(necessary_spi);
if (key_type >= RTE_IPSEC_SAD_SPI_DIP) /* DIP is optional*/
key.v4.dip = rte_cpu_to_be_32(necessary_dip);
if (key_type == RTE_IPSEC_SAD_SPI_DIP_SIP) /* SIP is optional*/
key.v4.sip = rte_cpu_to_be_32(necessary_sip);
49.2.3 Lookup
Library provides lookup by the given {SPI,DIP,SIP} tuple of inbound ipsec packet as a key.
49.4 Limitations
The following features are not properly supported in the current version:
• Hard/soft limit for SA lifetime (time interval/byte count).
Part 2: Development Environment
FIFTY
SOURCE ORGANIZATION
Note: In the following descriptions, RTE_SDK is the environment variable that points to the base
directory into which the tarball was extracted. See Useful Variables Provided by the Build System for
descriptions of other variables.
Makefiles that are provided by the DPDK libraries and applications are located in $(RTE_SDK)/mk.
Config templates are located in $(RTE_SDK)/config. The templates describe the options that are
enabled for each target. The config file also contains items that can be enabled and disabled for many
of the DPDK libraries, including debug options. The user should look at the config file and become
familiar with these options. The config file is also used to create a header file, which will be located in
the new build directory.
50.2 Libraries
Libraries are located in subdirectories of $(RTE_SDK)/lib. By convention a library refers to any
code that provides an API to an application. Typically, it generates an archive file (.a), but a kernel
module would also go in the same directory.
50.3 Drivers
Drivers are special libraries which provide poll-mode driver implementations for devices: either hard-
ware devices or pseudo/virtual devices. They are contained in the drivers subdirectory, classified by
type, and each compiles to a library with the format librte_pmd_X.a where X is the driver name.
Note: Several of the driver/net directories contain a base sub-directory. The base directory
generally contains code the shouldn’t be modified directly by the user. Any enhancements should be
done via the X_osdep.c and/or X_osdep.h files in that directory. Refer to the local README in
the base directories for driver specific instructions.
381
Programmer’s Guide, Release 19.11.10
50.4 Applications
Applications are source files that contain a main() function. They are located in the
$(RTE_SDK)/app and $(RTE_SDK)/examples directories.
The app directory contains sample applications that are used to test DPDK (such as autotests) or the Poll
Mode Drivers (test-pmd).
The examples directory contains Sample applications that show how libraries can be used.
FIFTYONE
The DPDK requires a build system for compilation activities and so on. This section describes the
constraints and the mechanisms used in the DPDK framework.
There are two use cases for the framework:
• Compilation of the DPDK libraries and sample applications; the framework generates specific
binary libraries, include files and sample applications
• Compilation of an external application or library, using an installed binary DPDK
This creates a new my_sdk_build_dir directory. After that, we can compile by doing:
cd my_sdk_build_dir
make
Refer to Development Kit Root Makefile Help for details about make commands that can be used from
the root of DPDK.
383
Programmer’s Guide, Release 19.11.10
export RTE_SDK=/opt/DPDK
export RTE_TARGET=x86_64-native-linux-gcc
cd /path/to/my_app
For a new application, the user must create their own Makefile that includes some .mk files, such as
${RTE_SDK}/mk/rte.vars.mk, and ${RTE_SDK}/mk/ rte.app.mk. This is described in Building Your
Own Application.
Depending on the chosen target (architecture, machine, executive environment, toolchain) defined in the
Makefile or as an environment variable, the applications and libraries will compile using the appropriate
.h files and will link with the appropriate .a files. These files are located in ${RTE_SDK}/arch-machine-
execenv-toolchain, which is referenced internally by ${RTE_BIN_SDK}.
To compile their application, the user just has to call make. The compilation result will be located in
/path/to/my_app/build directory.
Sample applications are provided in the examples directory.
# binary name
APP = helloworld
CFLAGS += -O3
CFLAGS += $(WERROR_FLAGS)
include $(RTE_SDK)/mk/rte.extapp.mk
Application
These Makefiles generate a binary application.
Library
Generate a .a library.
• rte.lib.mk: Library in the development kit framework
• rte.extlib.mk: external library
• rte.hostlib.mk: host library in the development kit framework
Install
• rte.install.mk: Does not build anything, it is only used to create links or copy files to the installation
directory. This is useful for including files in the development kit framework.
Kernel Module
• rte.module.mk: Build a kernel module in the development kit framework.
Objects
• rte.obj.mk: Object aggregation (merge several .o in one) in the development kit framework.
• rte.extobj.mk: Object aggregation (merge several .o in one) outside the development kit frame-
work.
Misc
• rte.gnuconfigure.mk: Build an application that is configure-based.
• rte.subdir.mk: Build several directories in the development kit framework.
Which dpdk-pmdinfogen scans for. Using this information other relevant bits of data can
be exported from the object file and used to produce a hardware support description, that
dpdk-pmdinfogen then encodes into a JSON formatted string in the following format:
static char <name_pmd_string>="PMD_INFO_STRING=\"{'name' : '<name>', ...}\"";
These strings can then be searched for by external tools to determine the hardware support of a given
library or application.
51.3.6 Variables that can be Set/Overridden by the User on the Command Line
Only
Some variables can be used to configure the build system behavior. They are documented in Develop-
ment Kit Root Makefile Help and External Application/Library Makefile Help
• WERROR_CFLAGS: By default, this is set to a specific value that depends on the compiler. Users
are encouraged to use this variable as follows:
CFLAGS += $(WERROR_CFLAGS)
This avoids the use of different cases depending on the compiler (icc or gcc). Also, this variable can be
overridden from the command line, which allows bypassing of the flags for testing purposes.
FIFTYTWO
The DPDK provides a root level Makefile with targets for configuration, building, cleaning, testing,
installation and others. These targets are explained in the following sections.
• clean
Clean all objects created using make build.
Example:
make clean O=mybuild
• %_sub
Build a subdirectory only, without managing dependencies on other directories.
Example:
make lib/librte_eal_sub O=mybuild
388
Programmer’s Guide, Release 19.11.10
• %_clean
Clean a subdirectory only.
Example:
make lib/librte_eal_clean O=mybuild
is equivalent to:
cd $(RTE_SDK)
make config O=mybuild T=x86_64-native-linux-gcc
cd mybuild
FIFTYTHREE
This chapter describes how a developer can extend the DPDK to provide a new library, a new target, or
support a new target.
Declaration is in foo.h:
extern void foo(void);
4. Update lib/Makefile:
vi ${RTE_SDK}/lib/Makefile
# add:
# DIRS-$(CONFIG_RTE_LIBFOO) += libfoo
5. Create a new Makefile for this library, for example, derived from mempool Makefile:
cp ${RTE_SDK}/lib/librte_mempool/Makefile ${RTE_SDK}/lib/libfoo/
vi ${RTE_SDK}/lib/libfoo/Makefile
# replace:
# librte_mempool -> libfoo
# rte_mempool -> foo
6. Update mk/DPDK.app.mk, and add -lfoo in LDLIBS variable when the option is enabled. This
will automatically add this flag when linking a DPDK application.
7. Build the DPDK with the new library (we only show a specific target here):
391
Programmer’s Guide, Release 19.11.10
cd ${RTE_SDK}
make config T=x86_64-native-linux-gcc
make
FIFTYFOUR
393
Programmer’s Guide, Release 19.11.10
FIFTYFIVE
External applications or libraries should include specific Makefiles from RTE_SDK, located in mk di-
rectory. These Makefiles are:
• ${RTE_SDK}/mk/rte.extapp.mk: Build an application
• ${RTE_SDK}/mk/rte.extlib.mk: Build a static library
• ${RTE_SDK}/mk/rte.extobj.mk: Build objects (.o)
55.1 Prerequisites
The following variables must be defined:
• ${RTE_SDK}: Points to the root directory of the DPDK.
• ${RTE_TARGET}: Reference the target to be used for compilation (for example, x86_64-native-
linux-gcc).
• clean
Clean all objects created using make build.
Example:
make clean O=mybuild
395
Programmer’s Guide, Release 19.11.10
FIFTYSIX
56.1 Introduction
The following sections describe optimizations used in DPDK and optimizations that should be consid-
ered for new applications.
They also highlight the performance-impacting coding techniques that should, and should not be, used
when developing an application using the DPDK.
And finally, they give an introduction to application profiling using a Performance Analyzer from Intel
to optimize the software.
397
CHAPTER
FIFTYSEVEN
This chapter provides some tips for developing efficient code using the DPDK. For additional and more
general information, please refer to the Intel® 64 and IA-32 Architectures Optimization Reference Man-
ual which is a valuable reference to writing efficient code.
57.1 Memory
This section describes some key memory considerations when developing applications in the DPDK
environment.
398
Programmer’s Guide, Release 19.11.10
• Use RTE_PER_LCORE variables. Note that in this case, data on lcore X is not available to lcore
Y.
• Use a table of structures (one per lcore). In this case, each structure must be cache-aligned.
Read-mostly variables can be shared among lcores without performance losses if there are no RW vari-
ables in the same cache line.
57.1.4 NUMA
On a NUMA system, it is preferable to access local memory since remote memory access is slower.
In the DPDK, the memzone, ring, rte_malloc and mempool APIs provide a way to create a pool on a
specific socket.
Sometimes, it can be a good idea to duplicate data to optimize speed. For read-mostly variables that are
often accessed, it should not be a problem to keep them in one socket only, since data will be present in
cache.
while (1) {
/* Process as many elements as can be dequeued. */
count = rte_ring_dequeue_burst(ring, obj_table, MAX_BULK, NULL);
if (unlikely(count == 0))
continue;
my_process_bulk(obj_table, count);
}
FIFTYEIGHT
The DPDK supports compilation with link time optimization turned on. This depends obviously on
the ability of the compiler to do “whole program” optimization at link time and is available only for
compilers that support that feature. To be more specific, compiler (in addition to performing LTO)
have to support creation of ELF objects containing both normal code and internal representation (called
fat-lto-objects in gcc and icc). This is required since during build some code is generated by parsing
produced ELF objects (pmdinfogen).
The amount of performance gain that one can get from LTO depends on the compiler and the code that
is being compiled. However LTO is also useful for additional code analysis done by the compiler. In
particular due to interprocedural analysis compiler can produce additional warnings about variables that
might be used uninitialized. Some of these warnings might be “false positives” though and you might
need to explicitly initialize variable in order to silence the compiler.
Please note that turning LTO on causes considerable extension of build time.
When using make based build, link time optimization can be enabled for the whole DPDK by setting:
CONFIG_RTE_ENABLE_LTO=y
in config file.
For the meson based build it can be enabled by setting meson built-in ‘b_lto’ option:
meson build -Db_lto=true
402
CHAPTER
FIFTYNINE
The following sections describe methods of profiling DPDK applications on different architectures.
403
Programmer’s Guide, Release 19.11.10
The alternative method to enable rte_rdtsc() for a high resolution wall clock counter is through the
ARMv8 PMU subsystem. The PMU cycle counter runs at CPU frequency. However, access to the PMU
cycle counter from user space is not enabled by default in the arm64 linux kernel. It is possible to enable
cycle counter for user space access by configuring the PMU from the privileged mode (kernel space).
By default the rte_rdtsc() implementation uses a portable cntvct_el0 scheme. Application can
choose the PMU based implementation with CONFIG_RTE_ARM_EAL_RDTSC_USE_PMU.
The example below shows the steps to configure the PMU based cycle counter on an ARMv8 machine.
git clone https://github.com/jerinjacobk/armv8_pmu_cycle_counter_el0
cd armv8_pmu_cycle_counter_el0
make
sudo insmod pmu_el0_cycle_counter.ko
cd $DPDK_DIR
make config T=arm64-armv8a-linux-gcc
echo "CONFIG_RTE_ARM_EAL_RDTSC_USE_PMU=y" >> build/.config
make
Warning: The PMU based scheme is useful for high accuracy performance profiling with
rte_rdtsc(). However, this method can not be used in conjunction with Linux userspace profil-
ing tools like perf as this scheme alters the PMU registers state.
SIXTY
GLOSSARY
405
Programmer’s Guide, Release 19.11.10
HPET High Precision Event Timer; a hardware timer that provides a precise time reference on x86
platforms.
ID Identifier
IOCTL Input/Output Control
I/O Input/Output
IP Internet Protocol
IPv4 Internet Protocol version 4
IPv6 Internet Protocol version 6
lcore A logical execution unit of the processor, sometimes called a hardware thread.
KNI Kernel Network Interface
L1 Layer 1
L2 Layer 2
L3 Layer 3
L4 Layer 4
LAN Local Area Network
LPM Longest Prefix Match
master lcore The execution unit that executes the main() function and that launches other lcores.
mbuf An mbuf is a data structure used internally to carry messages (mainly network packets). The
name is derived from BSD stacks. To understand the concepts of packet buffers or mbuf, refer to
TCP/IP Illustrated, Volume 2: The Implementation.
MESI Modified Exclusive Shared Invalid (CPU cache coherency protocol)
MTU Maximum Transfer Unit
NIC Network Interface Card
OOO Out Of Order (execution of instructions within the CPU pipeline)
NUMA Non-uniform Memory Access
PCI Peripheral Connect Interface
PHY An abbreviation for the physical layer of the OSI model.
pktmbuf An mbuf carrying a network packet.
PMD Poll Mode Driver
QoS Quality of Service
RCU Read-Copy-Update algorithm, an alternative to simple rwlocks.
Rd Read
RED Random Early Detection
RSS Receive Side Scaling
406
Programmer’s Guide, Release 19.11.10
RTE Run Time Environment. Provides a fast and simple framework for fast packet processing, in a
lightweight environment as a Linux* application and using Poll Mode Drivers (PMDs) to increase
speed.
Rx Reception
Slave lcore Any lcore that is not the master lcore.
Socket A physical CPU, that includes several cores.
SLA Service Level Agreement
srTCM Single Rate Three Color Marking
SRTD Scheduler Round Trip Delay
SW Software
Target In the DPDK, the target is a combination of architecture, machine, executive environment and
toolchain. For example: i686-native-linux-gcc.
TCP Transmission Control Protocol
TC Traffic Class
TLB Translation Lookaside Buffer
TLS Thread Local Storage
trTCM Two Rate Three Color Marking
TSC Time Stamp Counter
Tx Transmission
TUN/TAP TUN and TAP are virtual network kernel devices.
VLAN Virtual Local Area Network
Wr Write
WRED Weighted Random Early Detection
WRR Weighted Round Robin
407