Learning BPF: Offloading DPDK AF_XDP Traffic with XDP

Posted on February 18, 2026 by Denys Haryachyy

When you run a DPDK application with the AF_XDP poll-mode driver, every packet destined for your application travels from the NIC through the kernel’s XDP hook and into userspace via an AF_XDP socket. That includes packets your application could forward without ever touching userspace — if only the kernel knew how.

In this post I describe how to write a BPF program that sits in front of DPDK’s AF_XDP datapath and offloads packet rewriting to the kernel. Matched flows get their IP/UDP headers rewritten and forwarded at XDP speed, while unmatched traffic passes through to DPDK as usual. The result: the kernel handles the fast path, and DPDK handles the control path.

I’ll use a UDP proxy as the running example — an application that bridges traffic between clients and servers by rewriting IP/UDP headers on every packet. The same technique applies to any DPDK AF_XDP application that forwards known flows with predictable header transformations.

The Problem

Consider a proxy that sits between clients and backend servers. For every active session, it rewrites IP addresses and UDP ports on both legs — essentially a specialized NAT. In a pure DPDK setup, every single packet flows through userspace:

For a simple header rewrite on a known flow, that round trip through userspace is unnecessary overhead. The kernel’s XDP hook can do the same rewrite at line rate, before the packet ever reaches the AF_XDP socket.

The goal is a hybrid datapath where:

Known flows (established sessions) get rewritten and forwarded entirely in XDP
Unknown flows (new connections, control traffic) pass through to DPDK for full processing
Non-matching traffic (SSH, management) goes to the kernel stack as usual

Architecture Overview

The XDP program runs at the earliest point in the kernel’s receive path, before the normal networking stack and before AF_XDP delivery. It makes a per-packet decision:

Three XDP actions do the heavy lifting:

XDP_PASS — hand the packet to the kernel networking stack (for management traffic not destined to our application)
XDP_TX — transmit back out the same interface the packet arrived on
XDP_REDIRECT — forward to a different interface, or into an AF_XDP socket for DPDK

BPF Maps: The Shared State

BPF maps are the communication channel between the XDP program (kernel) and the DPDK application (userspace). We define four maps, each with a distinct role.

Rules Map — The Forwarding Table

The core data structure is a hash map keyed by a 5-tuple (interface index + source/destination IP and port). The value contains the rewrite target and per-flow packet counters:

/* 5-tuple match key. All fields in network byte order. */
struct flow_key {
  __u32 ifindex;    /* ingress interface */
  __u32 srcip;
  __u32 dstip;
  __u16 srcport;
  __u16 dstport;
};

/* Rewrite target + counters. */
struct flow_value {
  struct flow_key rewrite;   /* new headers */
  __u64 packets;
  __u64 bytes;
};

The map definition uses LIBBPF_PIN_BY_NAME so it persists across program reloads and is accessible from userspace via /sys/fs/bpf/xdp_fwd_rules:

struct {
  __uint(type, BPF_MAP_TYPE_HASH);
  __uint(max_entries, 524288);
  __type(key, struct flow_key);
  __type(value, struct flow_value);
  __uint(pinning, LIBBPF_PIN_BY_NAME);
} xdp_fwd_rules SEC(".maps");

Each bidirectional flow creates two rules (ingress and egress), so the 524K entry limit supports up to 262K concurrent streams.

IP Allowlist — Steering Traffic to DPDK

Not every packet with a matching destination IP should be rewritten. New flows and control traffic need to reach DPDK for processing. The IP allowlist map controls which destination IPs get redirected to AF_XDP sockets when they don’t match a rewrite rule:

struct {
  __uint(type, BPF_MAP_TYPE_HASH);
  __uint(max_entries, 1024);
  __type(key, __u32);                /* IPv4 address */
  __type(value, struct ip_counter); /* packet counter */
  __uint(pinning, LIBBPF_PIN_BY_NAME);
} xdp_local_ips SEC(".maps");

The userspace application populates this map with all IP addresses it serves. If a packet’s destination IP is in this map, it gets redirected to DPDK via AF_XDP. If not, it passes to the kernel stack — ensuring SSH, DNS, and other management traffic is unaffected.

AF_XDP Socket Map

The xsks_map is an XSKMAP type that DPDK’s AF_XDP PMD manages automatically. It maps RX queue indices to AF_XDP socket file descriptors:

struct {
  __uint(type, BPF_MAP_TYPE_XSKMAP);
  __uint(max_entries, 256);
  __type(key, __u32);   /* RX queue index */
  __type(value, __u32); /* XSK file descriptor */
} xsks_map SEC(".maps");

We never write to this map from our application — DPDK populates it when it opens AF_XDP sockets during port initialization.

Per-CPU Stats

A PERCPU_ARRAY map tracks datapath counters without any locking:

struct {
  __uint(type, BPF_MAP_TYPE_PERCPU_ARRAY);
  __uint(max_entries, XDP_STAT_MAX);
  __type(key, __u32);
  __type(value, __u64);
  __uint(pinning, LIBBPF_PIN_BY_NAME);
} xdp_counters SEC(".maps");

Each CPU gets its own copy of the counters. Userspace sums across all CPUs when reading. The counter indices track each decision point in the datapath:

enum xdp_stat {
  XDP_STAT_REWRITE_HIT,    /* rule matched, packet rewritten */
  XDP_STAT_REWRITE_MISS,   /* no rule for this UDP flow */
  XDP_STAT_FIB_DROP,       /* FIB lookup failed */
  XDP_STAT_HAIRPIN_TX,     /* forwarded on same interface */
  XDP_STAT_REDIRECT_OUT,   /* forwarded to different interface */
  XDP_STAT_AFXDP_REDIRECT, /* sent to DPDK via AF_XDP */
  XDP_STAT_PASS_TO_KERNEL, /* passed to kernel stack */
  XDP_STAT_MAX,
};

Sharing Structures Between Kernel and Userspace

One practical challenge: the BPF program uses kernel types (__u32, __u64), while the userspace C application uses standard types (uint32_t, uint64_t). A shared header bridges both worlds with conditional typedefs:

#ifdef __BPF__
#include <linux/types.h>
typedef __u32 fwd_u32;
typedef __u64 fwd_u64;
#else
#include <stdint.h>
typedef uint32_t fwd_u32;
typedef uint64_t fwd_u64;
#endif

The BPF Makefile passes -D__BPF__ when compiling the kernel program, while the userspace build gets the standard types. Both see identical struct layouts — critical for correct map access from either side.

The XDP Program

Entry Point

The XDP entry function xdp_main() implements the decision tree from the architecture diagram. It parses headers, attempts a rewrite, and falls back to AF_XDP redirection:

SEC("xdp")
int xdp_main(struct xdp_md *ctx)
{
  void *data     = (void *)(long)ctx->data;
  void *data_end = (void *)(long)ctx->data_end;

  struct iphdr *iph = extract_ip_hdr(data, data_end);
  if (!iph)
    return XDP_PASS;

  if (iph->protocol == IPPROTO_UDP &&
      (void *)iph + iph->ihl * 4 + sizeof(struct udphdr) < data_end)
  {
    struct udphdr *udp = (void *)iph + iph->ihl * 4;

    if (lookup_and_rewrite(ctx->ingress_ifindex, iph, udp, data_end) != -1)
      return resolve_and_forward(ctx, data, iph);
  }

  return steer_to_dpdk(iph, ctx);
}

The logic is intentionally flat: parse, try rewrite, fall back. Non-IPv4 packets pass to the kernel immediately. Non-UDP packets (like ICMP) skip the rewrite attempt and go straight to the AF_XDP redirection check.

Packet Parsing

BPF programs must prove to the verifier that every pointer dereference is within packet bounds. The extract_ip_hdr() helper does this for Ethernet + IPv4:

static __always_inline struct iphdr *extract_ip_hdr(void *data, void *data_end)
{
  struct ethhdr *eth = data;
  if ((void *)(eth + 1) >= data_end)
    return NULL;

  if (eth->h_proto != bpf_htons(ETH_P_IP))
    return NULL;

  struct iphdr *iph = (void *)(eth + 1);
  if ((void *)(iph + 1) >= data_end)
    return NULL;

  return iph;
}

Each bounds check satisfies the verifier and also serves as a protocol filter — only IPv4 packets proceed.

Rewriting Headers

lookup_and_rewrite() is the core of the offload. It builds a 5-tuple key from the packet, looks up the rules map, and rewrites in place:

static __always_inline int lookup_and_rewrite(
  __u32 ifindex, struct iphdr *ip, struct udphdr *udp, void *data_end)
{
  struct flow_key key = {
    .ifindex = ifindex,
    .srcip   = ip->saddr,
    .dstip   = ip->daddr,
    .srcport = udp->source,
    .dstport = udp->dest,
  };

  struct flow_value *value =
    bpf_map_lookup_elem(&xdp_fwd_rules, &key);
  if (!value)
    return -1;

  /* Rewrite IP and UDP headers */
  ip->saddr    = value->rewrite.srcip;
  ip->daddr    = value->rewrite.dstip;
  udp->source  = value->rewrite.srcport;
  udp->dest    = value->rewrite.dstport;

  /* ... checksum fixup and counter updates ... */

  return value->rewrite.ifindex;
}

The function returns the target interface index on a hit (used by resolve_and_forward()), or -1 on a miss.

Incremental Checksum Update

After rewriting IP and UDP headers, the checksums must be recalculated. Rather than recomputing from scratch, we use bpf_csum_diff() for an incremental update — only the changed fields contribute to the new checksum:

/* L3 checksum: IP addresses changed */
__u32 l3_old[2] = { key.srcip, key.dstip };
__u32 l3_new[2] = { value->rewrite.srcip, value->rewrite.dstip };
__s64 l3_diff = bpf_csum_diff(l3_old, 8, l3_new, 8, 0);

/* L4 checksum: ports changed (only if UDP checksum is non-zero) */
if (udp->check != 0)
{
  __u32 l4_old = (key.dstport << 16) + key.srcport;
  __u32 l4_new = (value->rewrite.dstport << 16) + value->rewrite.srcport;

  __s64 l4_diff = bpf_csum_diff(&l4_old, 4, &l4_new, 4, l3_diff);
  udp->check = fold_csum((0xFFFF & ~udp->check) + l4_diff);
}

ip->check = fold_csum((0xFFFF & ~ip->check) + l3_diff);

The UDP checksum is optional for IPv4 (a zero value means “not computed”), so we skip it when the original checksum is zero. The fold_csum() folds a 64-bit intermediate back to 16 bits:

static __always_inline __u16 fold_csum(__u64 csum)
{
  int i;
#pragma unroll
  for (i = 0; i < 4; i++)
  {
    if (csum >> 16)
      csum = (csum & 0xffff) + (csum >> 16);
  }
  return ~csum;
}

FIB Forwarding

After rewriting, the packet has new destination IP addresses but stale MAC addresses. The bpf_fib_lookup() helper queries the kernel’s routing table (FIB) to resolve the next-hop MAC addresses:

static __always_inline int resolve_and_forward(
  struct xdp_md *ctx, void *data, struct iphdr *iph)
{
  struct bpf_fib_lookup s = { 0 };
  s.family      = 2;             /* AF_INET */
  s.tos         = iph->tos;
  s.l4_protocol = iph->protocol;
  s.tot_len     = bpf_ntohs(iph->tot_len);
  s.ifindex     = ctx->ingress_ifindex;
  s.ipv4_src    = iph->saddr;
  s.ipv4_dst    = iph->daddr;

  int ret = bpf_fib_lookup(ctx, &s, sizeof(s), 0);
  if (ret != 0)
    return XDP_DROP;

  struct ethhdr *eth = data;
  __builtin_memcpy(eth->h_dest, s.dmac, 6);
  __builtin_memcpy(eth->h_source, s.smac, 6);

  if (ctx->ingress_ifindex == s.ifindex)
    return XDP_TX;                /* hairpin: same interface */

  return bpf_redirect(s.ifindex, 0);  /* cross-interface forward */
}

One important detail: the flags parameter to bpf_fib_lookup() is set to 0 (not BPF_FIB_LOOKUP_OUTPUT). With BPF_FIB_LOOKUP_OUTPUT, the kernel constrains the lookup to the input interface, which prevents cross-interface forwarding. Without that flag, the FIB lookup can resolve routes through any interface — essential when ingress and egress use different NICs.

AF_XDP Redirection

When a packet doesn’t match a rewrite rule, it may still be destined for our application. The steer_to_dpdk() function checks the IP allowlist and redirects matching traffic to DPDK:

static __always_inline int steer_to_dpdk(
  struct iphdr *iph, struct xdp_md *ctx)
{
  struct ip_counter *val =
    bpf_map_lookup_elem(&xdp_local_ips, &iph->daddr);

  if (!val)
    return XDP_PASS;   /* not our IP, let the kernel handle it */

  __sync_fetch_and_add(&val->packets, 1);
  return bpf_redirect_map(&xsks_map, ctx->rx_queue_index, XDP_PASS);
}

bpf_redirect_map() sends the packet to the AF_XDP socket for the appropriate RX queue. The third argument (XDP_PASS) is the fallback action if the queue has no socket — in that case the packet goes to the kernel stack.

Building the BPF Program

The BPF object file is compiled with clang targeting the BPF backend:

xdp_fwd.o: xdp_fwd.c xdp_maps.h xdp_common.h xdp_shared.h
    clang -O2 -target bpf -D__BPF__ $(INCLUDES) -c $< -o $@ -g

Key compiler flags:

-target bpf — emit BPF bytecode instead of native code
-O2 — required for the BPF verifier to accept the program (without optimization, the code often contains constructs the verifier rejects)
-D__BPF__ — activates kernel-side type definitions in the shared header
-g — includes debug info for bpftool introspection

How DPDK Loads the XDP Program

An important detail: you don’t load the XDP program yourself. DPDK’s AF_XDP PMD handles this automatically.

When you pass --vdev=net_af_xdp0,iface=eth0,xdp_prog=xdp_fwd.o to your DPDK application’s EAL arguments, the AF_XDP PMD:

Opens the specified .o file and loads the XDP program onto the network interface
Creates AF_XDP sockets (one per RX queue) and populates the xsks_map with their file descriptors
Attaches the XDP program to the interface so it runs on every incoming packet

From your application’s perspective, DPDK starts receiving packets through rte_eth_rx_burst() as usual — but now the XDP program is running in front of it, and any BPF maps defined with LIBBPF_PIN_BY_NAME are automatically pinned to /sys/fs/bpf/.

This means the deployment workflow is straightforward:

Compile the BPF program: make (produces xdp_fwd.o)
Start your DPDK application with the AF_XDP vdev argument pointing to the .o file
DPDK loads the program, creates sockets, pins maps — all transparently
Your application opens the pinned maps and manages rules at runtime

No ip link set dev eth0 xdp obj ..., no bpftool prog load, no manual socket creation. The AF_XDP PMD is a single integration point that wires together the XDP program, the AF_XDP sockets, and the DPDK poll loop.

Userspace Control Plane

The DPDK application manages the BPF maps at runtime — adding rules when flows are established, deleting them on teardown, and reading stats back. All of this happens through libbpf’s map manipulation API on pinned file descriptors.

Detecting AF_XDP Mode

When DPDK initializes its Ethernet ports, you can query each port’s driver name. If it reports net_af_xdp, you know an XDP program is loaded on the interface and the BPF maps are available:

static int detect_afxdp(uint16_t port_id)
{
  struct rte_eth_dev_info dev_info;
  if (rte_eth_dev_info_get(port_id, &dev_info) != 0)
    return 0;

  return strcmp(dev_info.driver_name, "net_af_xdp") == 0;
}

This check gates all offload logic — when DPDK uses a hardware PMD (e.g., mlx5, ixgbe), the BPF maps don’t exist and there’s nothing to manage.

Accessing Pinned Maps

BPF maps declared with LIBBPF_PIN_BY_NAME are pinned to /sys/fs/bpf/<map_name>. From userspace, bpf_obj_get() opens a pinned map and returns a file descriptor you can use with all bpf_map_* functions:

#include <bpf/bpf.h>

int rules_fd = bpf_obj_get("/sys/fs/bpf/xdp_fwd_rules");
if (rules_fd < 0) {
  fprintf(stderr, "Cannot open rules map: %s\n", strerror(errno));
  return -1;
}

/* Use rules_fd with bpf_map_update_elem / bpf_map_lookup_elem / etc. */

close(rules_fd);  /* close when done */

This is the bridge between kernel and userspace — the same hash table the XDP program reads at packet speed is directly writable from your DPDK application.

Installing a Rule

To offload a flow, build the match key and rewrite value, then insert into the rules map. Here’s an example that offloads a bidirectional UDP flow between a client (10.0.1.50:20000) and a server (10.0.2.1:40000), with the proxy listening on 10.0.0.100:30000:

#include <arpa/inet.h>

/* Ingress rule: client → proxy, rewrite to → server */
struct flow_key ingress_key = {
  .ifindex = 3,                             /* rx interface */
  .srcip   = inet_addr("10.0.1.50"),
  .dstip   = inet_addr("10.0.0.100"),
  .srcport = htons(20000),
  .dstport = htons(30000),
};

struct flow_value ingress_val = {
  .rewrite = {
    .ifindex = 4,                           /* tx interface */
    .srcip   = inet_addr("10.0.2.1"),
    .dstip   = inet_addr("10.0.1.50"),
    .srcport = htons(40000),
    .dstport = htons(20000),
  },
};

bpf_map_update_elem(rules_fd, &ingress_key, &ingress_val, BPF_ANY);

/* Egress rule: server → proxy, rewrite to → client */
struct flow_key egress_key = {
  .ifindex = 4,
  .srcip   = inet_addr("10.0.1.50"),
  .dstip   = inet_addr("10.0.2.1"),
  .srcport = htons(20000),
  .dstport = htons(40000),
};

struct flow_value egress_val = {
  .rewrite = {
    .ifindex = 3,
    .srcip   = inet_addr("10.0.0.100"),
    .dstip   = inet_addr("10.0.1.50"),
    .srcport = htons(30000),
    .dstport = htons(20000),
  },
};

bpf_map_update_elem(rules_fd, &egress_key, &egress_val, BPF_ANY);

Each bidirectional flow needs two rules — one per direction. When the XDP program matches the ingress key, it rewrites the packet and forwards it out interface 4 toward the server. The egress rule handles the return path.

Lazy Offload

A subtlety: you often can’t install rules immediately when a flow is created. If the client is behind NAT, you don’t know its real source IP/port until the first packet arrives. A practical pattern is to let the first packet pass through DPDK (via AF_XDP), learn the NAT-translated address from the packet headers, then install the XDP rule so all subsequent packets bypass userspace:

void on_first_packet(struct flow_entry *flow, struct rte_mbuf *pkt)
{
  struct rte_ipv4_hdr *ip = rte_pktmbuf_mtod_offset(pkt,
    struct rte_ipv4_hdr *, sizeof(struct rte_ether_hdr));
  struct rte_udp_hdr *udp = (struct rte_udp_hdr *)((char *)ip + sizeof(*ip));

  /* Now we know the real client address (NAT resolved) */
  flow->client_real_ip   = ip->src_addr;
  flow->client_real_port = udp->src_port;

  /* Install both XDP offload rules */
  install_xdp_rules(rules_fd, flow);
  flow->offloaded = 1;
}

After this point, the XDP program handles the flow at kernel speed. DPDK only sees the first packet.

Deleting Rules

When a flow ends (session teardown, timeout, etc.), delete both rules from the map:

void teardown_flow(int rules_fd, struct flow_entry *flow)
{
  bpf_map_delete_elem(rules_fd, &flow->ingress_key);
  bpf_map_delete_elem(rules_fd, &flow->egress_key);
  flow->offloaded = 0;
}

If NAT information changes mid-flow (the client’s address shifts), delete the old rules and install new ones with the updated addresses.

Reading Per-Flow Stats

Since the XDP program updates counters inside the rule values, userspace can read them back at any time with a simple lookup:

struct flow_value val;
if (bpf_map_lookup_elem(rules_fd, &ingress_key, &val) == 0)
{
  printf("packets: %llu, bytes: %llu\n", val.packets, val.bytes);
}

A background timer (every few seconds) can iterate all offloaded flows and sync their BPF-side stats into whatever reporting structures your application uses.

Reading Per-CPU Stats

The xdp_counters map is a PERCPU_ARRAY — each bpf_map_lookup_elem call returns an array of values, one per CPU. Sum them for the global total:

int stats_fd = bpf_obj_get("/sys/fs/bpf/xdp_counters");
int num_cpus = sysconf(_SC_NPROCESSORS_CONF);
uint64_t percpu[num_cpus];

for (uint32_t key = 0; key < XDP_STAT_MAX; key++)
{
  uint64_t total = 0;
  if (bpf_map_lookup_elem(stats_fd, &key, percpu) == 0)
    for (int i = 0; i < num_cpus; i++)
      total += percpu[i];

  printf("stat[%u] = %llu\n", key, total);
}

IP Allowlist Management

The xdp_local_ips map controls which destination IPs get steered to AF_XDP. Populate it with every IP your application listens on:

int ips_fd = bpf_obj_get("/sys/fs/bpf/xdp_local_ips");

/* Add an IP to the allowlist */
uint32_t ip = inet_addr("10.0.0.100");
struct ip_counter val = { .packets = 0 };
bpf_map_update_elem(ips_fd, &ip, &val, BPF_ANY);

/* Remove an IP */
bpf_map_delete_elem(ips_fd, &ip);

When your set of served IPs changes, iterate the existing map entries, delete any that are no longer needed, and add the new ones. This ensures packets to removed IPs fall back to the kernel stack immediately.

Monitoring

The per-CPU stats map and per-rule counters enable real-time monitoring. Userspace sums the per-CPU values and formats a dashboard:

XDP datapath stats:
  Rewrite hits         1284923
  Rewrite misses       42
  FIB drops            0
  Hairpin TX           641200
  Redirect out         643723
  AF_XDP -> DPDK       38
  Pass to kernel       157

Per-rule detail includes live PPS and BPS:

Rule Detail
----------------------------------------
Match (Key):
  Interface:   ifindex 3
  Source:      10.0.1.50:20000
  Destination: 10.0.0.100:30000

Rewrite (Value):
  Interface:   ifindex 4
  Source:      10.0.2.1:40000
  Destination: 10.0.1.50:20000

Counters:
  Packets:     641200
  Bytes:       51296000
  PPS:         50
  BPS:         40000

Summary

The hybrid XDP + AF_XDP architecture gives us the best of both worlds:

Write a BPF program that does fast-path packet rewriting using hash-map lookups
Define shared structures in a header compiled by both clang -target bpf and your C compiler
Let DPDK load the program — the AF_XDP PMD handles XDP attachment, socket creation, and map pinning
Pin BPF maps to /sys/fs/bpf/ so your userspace application can manage rules at runtime
Use the IP allowlist to steer traffic: matched IPs go to AF_XDP (DPDK), everything else to the kernel
Offload lazily — let the first packet through DPDK to resolve NAT, then install XDP rules

The key insight is that XDP and AF_XDP are not competing technologies — they compose naturally. AF_XDP gives your DPDK application a kernel-bypass receive path, and XDP gives you a programmable fast path in front of it. By combining both, the kernel handles the steady-state data plane while DPDK handles the exceptions.

References

BPF and XDP Reference Guide — comprehensive overview of BPF program types, map types, and helper functions
AF_XDP — scalable packet processing in the kernel — official kernel documentation for AF_XDP sockets
XDP — eXpress Data Path — kernel documentation covering XDP actions, attachment modes, and driver support
libbpf API documentation — reference for bpf_obj_get(), bpf_map_update_elem(), and other userspace BPF functions
DPDK AF_XDP Poll Mode Driver — DPDK documentation for the net_af_xdp PMD, including xdp_prog parameter and socket map handling
bpf_fib_lookup() helper — man page for BPF helper functions including FIB lookup, checksum diff, and redirect
XDP Tutorial — hands-on exercises for XDP programming, packet parsing, and map usage
LIBBPF_PIN_BY_NAME and map pinning — explanation of BPF object lifetime and pinning to /sys/fs/bpf/

Learning VPP: Building DPDK with debug symbols

Posted on June 13, 2023 by Denys Haryachyy

logo_fdio-300x184

Overview

The goal is to build DPDK which is an external package for VPP together with debug symbols to be able to debug inside DPDK source code.

Version

VPP version is 23.02

Configuration

We need to modify two files:

build/external/deb/debian/rules
build/external/packages/dpdk.mk

dpdk.mk

The following flag has to be enabled.

DPDK_DEBUG ?= y

rules

The following lines have to be added.

override_dh_strip:
dh_strip --exclude=librte

Rebuild

sudo dpkg -r vpp-ext-deps
make install-ext-dep
make rebuild

gdb

When running gdb we need to specify the path to DPDK sources.

set substitute-path '../src-dpdk/' 
'/home/projects/vpp/build/external/downloads/dpdk-22.07'

📊 Want to see these techniques in action?
Check out our real-world DPDK & VPP projects →

Learning DPDK: Traffic generator TRex

Posted on October 19, 2019 by Denys Haryachyy

✅ Updated January 2026 — This guide has been reviewed and updated for the latest DPDK/VPP versions.

Overview

TRex is a stateful and stateless traffic generator based on DPDK. TCP stack implementation leverages BSD4.4 original code.

Setup

Ubuntu 18.04 server installed on VirtualBox VM with two interfaces connected in a loopback, 4 CPUs, 4G RAM.

Install

Download and build the latest Trex on Ubuntu 18.04.

sudo apt -y install zlib1g-dev build-essential python python3-distutils
git clone https://github.com/cisco-system-traffic-generator/trex-core.git
cd trex-core/
cd linux_dpdk
./b configure
./b build
cd ..
sudo cp scripts/cfg/simple_cfg.yaml /etc/trex_cfg.yaml

Find out the PCI IDs of the interfaces to be used by Trex.

lspci | grep Eth
00:03.0 Ethernet controller: Intel Corporation 82540EM Gigabit Ethernet Controller (rev 02)
00:08.0 Ethernet controller: Intel Corporation 82540EM Gigabit Ethernet Controller (rev 02)
00:09.0 Ethernet controller: Intel Corporation 82540EM Gigabit Ethernet Controller (rev 02)

trex_cfg.yaml

Edit TRex config file by changing PCI ids.

- port_limit : 2
version : 2
#List of interfaces. Change to suit your setup. Use ./dpdk_setup_ports.py -s to see available options
interfaces : ["00:08.0","00:09.0"]
port_info : # Port IPs. Change to suit your needs. In case of loopback, you can leave as is.
- ip : 1.1.1.1
default_gw : 2.2.2.2
- ip : 2.2.2.2
default_gw : 1.1.1.1

Run server

Run TRex in a stateful mode.

cd scripts/
sudo ./t-rex-64 -i --astf

Run console

Generate HTTP flows.

cd scripts/
./trex-console
trex> start -f astf/http_simple.py -m 1000 -d 1000 -l 1000
trex> tui

Traffic profile (`http_simple.py`)

from trex_astf_lib.api import *
class Prof1():
    def get_profile(self):
        # ip generator
        ip_gen_c = ASTFIPGenDist(ip_range=["10.10.10.0", "10.10.10.255"],
                                 distribution="seq")
        ip_gen_s = ASTFIPGenDist(ip_range=["20.20.20.0", "20.20.20.255"],
                                  distribution="seq")
        ip_gen = ASTFIPGen(glob=ASTFIPGenGlobal(ip_offset="1.0.0.0"),
                           dist_client=ip_gen_c,
                           dist_server=ip_gen_s)

        return ASTFProfile(default_ip_gen=ip_gen,
                            cap_list=[ASTFCapInfo(
                                      file="../avl/delay_10_http_browsing_0.pcap"
                                      cps=1)
                                     ])

def register():
    return Prof1()

Results

Monitor flow statistics by pressing “Esc” and “t” buttons in “tui” mode.

References

📊 Want to see these techniques in action?
Check out our real-world DPDK & VPP projects →

Learning DPDK: Understanding the Benefits of Huge Pages

Posted on April 17, 2019 by Denys Haryachyy

✅ Updated January 2026 — This guide has been reviewed and updated for the latest DPDK/VPP versions.

Intro

Modern CPUs support different page sizes, e.g. 4K, 2M, 1GB. All page sizes, except 4K, are named “huge pages” in Linux. The reason for this name convention is historical and stems from the fact that originally Linux supported 4K page size only.

Big page sizes are beneficial for performance as far as fewer translations between virtual and physical addresses happen and Translation Lookaside Buffer (TLB) cache is a scarce resource.

To check the size of TLB the following utility can be used.

cpuid | grep -i tlb cache and TLB information (2): 0x63: data TLB: 1G pages, 4-way, 4 entries 0x03: data TLB: 4K pages, 4-way, 64 entries 0x76: instruction TLB: 2M/4M pages, fully, 8 entries 0xb6: instruction TLB: 4K, 8-way, 128 entries 0xc3: L2 TLB: 4K/2M pages, 6-way, 1536 entries

To check the number of allocated huge pages the following command can be used.

cat /proc/meminfo | grep Huge AnonHugePages: 4409344 kB HugePages_Total: 32 HugePages_Free: 32 HugePages_Rsvd: 0 HugePages_Surp: 0 Hugepagesize: 1048576 kB

There are two types of huge pages available in the Linux.

Transparent (Anonymous) huge pages
Persistent huge pages

Transparent huge pages

Transparent huge pages is an abstraction layer that automates most aspects of creating, managing and using huge pages. As far as there existed some issues with performance and stability, DPDK does not rely on this mechanism but uses persistent huge pages.

Persistent huge pages

Persistent huge pages have to be configured manually. Persistent huge pages are never swapped by the Linux kernel.

The following management interfaces exist in Linux to allocate the persistent huge pages.

Shared memory using shmget()
HugeTLBFS is a RAM-based filesystem and mmap(), read() or memfd_create() can be used to access its files
Anonymous mmap() by specifying the flags MAP_ANONYMOUS and MAP_HUGETLB flags
libhugetlbfs APIs
Automatic backing of memory regions

Persistent huge pages are used in DPDK by default, mount points are discovered automatically and pages are released once application exits. But in case a user needs to manually tune something, the following EAL command line parameters could be used.

--huge-dir Use specified hugetlbfs directory instead of autodetected ones.
--huge-unlink Unlink huge page files after creating them (implies no secondary process support).
--in-memory Recent DPDK versions added an option to not rely on hugetlbfs

There are multiple ways to set up persistent huge pages.

On the boot
In run-time

In boot time

Modify Linux boot time parameters inside /etc/default/grub. Huge pages will be spread equally between all NUMA sockets.
GRUB_CMDLINE_LINUX="default_hugepagesz=1G hugepagesz=1G hugepages=32"

Update the grub configuration file and reboot.

grub2-mkconfig -o /boot/grub2/grub.cfg
reboot

Create a folder for a permanent mount point of hugetlbfs

mkdir /mnt/huge

Add the following line to the /etc/fstab file:
nodev /mnt/huge hugetlbfs defaults 0 0

In runtime

Update number of huge pages for each NUMA node. Default huge page size cannot be modified in the runtime.
echo 16 > /sys/devices/system/node/node0/hugepages/hugepages-1048576kB/nr_hugepages echo 16 > /sys/devices/system/node/node1/hugepages/hugepages-1048576kB/nr_hugepages

Create a mount point.

mkdir /mnt/huge

Mount hugetlbfs
mount -t hugetlbfs nodev /mnt/huge

Memory allocation

While there are many ways to allocate persistent huge pages, DPDK is using the following.

mmap() call with hugetlbfs mount point
mmap() call with MAP_HUGETLB flag
memfd_create() call with MFD_HUGETLB flag

References

📊 Want to see these techniques in action?
Check out our real-world DPDK & VPP projects →

Learning DPDK : Capture to Kafka

Posted on March 19, 2019 by Denys Haryachyy

✅ Updated January 2026 — This guide has been reviewed and updated for the latest DPDK/VPP versions.

Requirements

The requirements are as follows.

To capture 10Gbps of 128B packets into Apache Kafka
Implement basic filtering using IP addresses
Save traffic into one Kafka topic

Hardware

Dell PowerEdge R730
Intel Xeon CPU E5-2660 v4 @ 2.00GHz
Intel 82599ES 10G NIC
Hardware RAID controller PERC H730 Mini for 8 disks
6 SCSI disks in RAID-0 configuration

Software

DPDK 18.11
Apache Kafka 2.12-2.1.0
librdkafka v0.11.1

Solution

The following key design ideas helped to achieve 5Gbps capture speed.

Use XFS filesystem
Combine small packets into the big Kafka messages, 500KB each
Run 4 Kafka brokers on one physical server simultaneously
Allocate 20 partitions per topic

Conclusion

The decision was made to use two servers to be able to capture all 10Gbps traffic.

References

📊 Want to see these techniques in action?
Check out our real-world DPDK & VPP projects →

Learning DPDK: Understanding Symmetric RSS for Efficient Packet Processing

Posted on January 18, 2019 by Denys Haryachyy

✅ Updated January 2026 — This guide has been reviewed and updated for the latest DPDK/VPP versions.

Overview

Receive side scaling (RSS) is a technology that enables the distribution of received packets between multiple RX queues using a predefined hash function. It enabled multicore CPU to process packets from different queues on different cores.

Symmetric RSS promise is to populate two-way packets from the same TCP connection to the same RX queue. As a result statistics on the different connections could be stored in the per-queue data structures avoiding any need for locking.

System

Recently I had a chance to test symmetric RSS on two NICs from Intel, namely XL710 40G and 82599 10G.

The approach to configuring symmetric RSS in XL710 is different from the standard DPDK approach. Thus i40e driver offers specific API for this purpose.

DPDK 18.05.1 was used for testing.

82599 solution

To make symmetric RSS work a default hash key has to be replaced with a custom one.

.gist table { margin-bottom: 0; }

This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters

Show hidden characters

	#define RSS_HASH_KEY_LENGTH 40
	static uint8_t hash_key[RSS_HASH_KEY_LENGTH] = {
	0x6D, 0x5A, 0x6D, 0x5A, 0x6D, 0x5A, 0x6D, 0x5A,
	0x6D, 0x5A, 0x6D, 0x5A, 0x6D, 0x5A, 0x6D, 0x5A,
	0x6D, 0x5A, 0x6D, 0x5A, 0x6D, 0x5A, 0x6D, 0x5A,
	0x6D, 0x5A, 0x6D, 0x5A, 0x6D, 0x5A, 0x6D, 0x5A,
	0x6D, 0x5A, 0x6D, 0x5A, 0x6D, 0x5A, 0x6D, 0x5A,
	};

	struct rte_eth_conf port_conf = {
	.rxmode = {
	.mq_mode = ETH_MQ_RX_RSS,
	},
	.rx_adv_conf = {
	.rss_conf = {
	.rss_key = hash_key,
	.rss_key_len = RSS_HASH_KEY_LENGTH,
	.rss_hf = ETH_RSS_IP \|
	ETH_RSS_TCP \|
	ETH_RSS_UDP \|
	ETH_RSS_SCTP,
	}
	},
	};

	rte_eth_dev_configure(port_id, rx_queue_num, tx_queue_num, &port_conf);

view raw

Symmetric RSS in DPDK for 82599

hosted with ❤ by GitHub

XL710 solution

To enable symmetric RSS i40e driver provides API to setup hardware registers.

.gist table { margin-bottom: 0; }

Show hidden characters

	struct rte_eth_conf port_conf = {
	.rxmode = {
	.mq_mode = ETH_MQ_RX_RSS,
	},
	.rx_adv_conf = {
	.rss_conf = {
	.rss_hf = ETH_RSS_IP \|
	ETH_RSS_TCP \|
	ETH_RSS_UDP \|
	ETH_RSS_SCTP,
	}
	},
	};

	rte_eth_dev_configure(port_id, rx_queue_num, tx_queue_num, &port_conf);

	int sym_hash_enable(int port_id, uint32_t ftype, enum rte_eth_hash_function function)
	{
	struct rte_eth_hash_filter_info info;
	int ret = 0;
	uint32_t idx = 0;
	uint32_t offset = 0;

	memset(&info, 0, sizeof(info));

	ret = rte_eth_dev_filter_supported(port_id, RTE_ETH_FILTER_HASH);
	if (ret < 0) {
	DPDK_ERROR(“RTE_ETH_FILTER_HASH not supported on port: %d”,
	port_id);
	return ret;
	}

	info.info_type = RTE_ETH_HASH_FILTER_GLOBAL_CONFIG;
	info.info.global_conf.hash_func = function;

	idx = ftype / UINT64_BIT;
	offset = ftype % UINT64_BIT;
	info.info.global_conf.valid_bit_mask[idx] \|= (1ULL << offset);
	info.info.global_conf.sym_hash_enable_mask[idx] \|=
	(1ULL << offset);

	ret = rte_eth_dev_filter_ctrl(port_id, RTE_ETH_FILTER_HASH,
	RTE_ETH_FILTER_SET, &info);
	if (ret < 0)
	{
	DPDK_ERROR(“Cannot set global hash configurations”
	“on port %u”, port_id);
	return ret;
	}

	return 0;
	}

	int sym_hash_set(int port_id, int enable)
	{
	int ret = 0;
	struct rte_eth_hash_filter_info info;

	memset(&info, 0, sizeof(info));

	ret = rte_eth_dev_filter_supported(port_id, RTE_ETH_FILTER_HASH);
	if (ret < 0) {
	DPDK_ERROR(“RTE_ETH_FILTER_HASH not supported on port: %d”,
	port_id);
	return ret;
	}

	info.info_type = RTE_ETH_HASH_FILTER_SYM_HASH_ENA_PER_PORT;
	info.info.enable = enable;
	ret = rte_eth_dev_filter_ctrl(port_id, RTE_ETH_FILTER_HASH,
	RTE_ETH_FILTER_SET, &info);

	if (ret < 0)
	{
	DPDK_ERROR(“Cannot set symmetric hash enable per port “
	“on port %u”, port_id);
	return ret;
	}

	return 0;
	}

	sym_hash_enable(port_id, RTE_ETH_FLOW_NONFRAG_IPV4_TCP, RTE_ETH_HASH_FUNCTION_TOEPLITZ);
	sym_hash_enable(port_id, RTE_ETH_FLOW_NONFRAG_IPV4_UDP, RTE_ETH_HASH_FUNCTION_TOEPLITZ);
	sym_hash_enable(port_id, RTE_ETH_FLOW_FRAG_IPV4, RTE_ETH_HASH_FUNCTION_TOEPLITZ);
	sym_hash_enable(port_id, RTE_ETH_FLOW_NONFRAG_IPV4_SCTP, RTE_ETH_HASH_FUNCTION_TOEPLITZ);
	sym_hash_enable(port_id, RTE_ETH_FLOW_NONFRAG_IPV4_OTHER, RTE_ETH_HASH_FUNCTION_TOEPLITZ);

	sym_hash_set(port_id, 1);

view raw

Symmetric RSS in DPDK for i40e

hosted with ❤ by GitHub

References

📊 Want to see these techniques in action?
Check out our real-world DPDK & VPP projects →

Learning DPDK: NUMA optimization

Posted on July 11, 2018 by Denys Haryachyy

✅ Updated January 2026 — This guide has been reviewed and updated for the latest DPDK/VPP versions.

Overview

To get the maximum performance on NUMA system, the underlying architecture has to be taken into account.

To spot the problems in your data design, there exists a handy tool, called “perf c2c”. Where C2C stands for Cache To Cache. The output of the tool will provide statistics about the access to a data on the remote NUMA socket.

Run

Record PMU counters.

perf c2c record -F 99 -g -- binary

Analyze in interactive mode.
perf c2c report

Analyze in text mode.
perf c2c report --stdio

For example summary in a text mode could look as follows.
================================================= Trace Event Information ================================================= Total records : 5621889 Locked Load/Store Operations : 10032 Load Operations : 741529 Loads - uncacheable : 7 Loads - IO : 0 Loads - Miss : 8299 Loads - no mapping : 18 Load Fill Buffer Hit : 533018 Load L1D hit : 109495 Load L2D hit : 4337 Load LLC hit : 61245 Load Local HITM : 9673 Load Remote HITM : 12528 Load Remote HIT : 780 Load Local DRAM : 4593 Load Remote DRAM : 7209 Load MESI State Exclusive : 11802 Load MESI State Shared : 0 Load LLC Misses : 25110 LLC Misses to Local DRAM : 18.3% LLC Misses to Remote DRAM : 28.7% LLC Misses to Remote cache (HIT) : 3.1% LLC Misses to Remote cache (HITM) : 49.9% Store Operations : 4880360 Store - uncacheable : 0 Store - no mapping : 178126 Store L1D Hit : 4696772 Store L1D Miss : 5462 No Page Map Rejects : 1095 Unable to parse data source : 0 ================================================= Global Shared Cache Line Event Information ================================================= Total Shared Cache Lines : 10898 Load HITs on shared lines : 88830 Fill Buffer Hits on shared lines : 39884 L1D hits on shared lines : 8717 L2D hits on shared lines : 86 LLC hits on shared lines : 25798 Locked Access on shared lines : 5336 Store HITs on shared lines : 5953 Store L1D hits on shared lines : 5633 Total Merged records : 28154

References

📊 Want to see these techniques in action?
Check out our real-world DPDK & VPP projects →

Learning DPDK: Understanding the Benefits of Inlining for Function Optimization

Posted on July 9, 2018 by Denys Haryachyy

✅ Updated January 2026 — This guide has been reviewed and updated for the latest DPDK/VPP versions.

Overview

Inlining method can help to mitigate the following:

Function call overhead;
Pipeline stall.

It is advised to apply the method for the following types of routines:

Trivial and small functions used as accessors to data or wrappers around another function;
Big functions called quite regularly but not from many places.

Solution

A modern compiler uses heuristics to decide which functions need to be inlined. But it is always better to give it a hint using the following keywords.

static inline

Moreover to make a decision instead of the gcc compiler the following attribute should be used.

__attribute__((always_inline))

References

📊 Want to see these techniques in action?
Check out our real-world DPDK & VPP projects →

Learning DPDK: Branch Prediction

Posted on June 24, 2018 by Denys Haryachyy

✅ Updated January 2026 — This guide has been reviewed and updated for the latest DPDK/VPP versions.

Overview

It is well-known that modern CPUs are built using the instructions pipelines that enable them to execute multiple instructions in parallel. But in case of conditional branches within the program code, not all the instructions are executed each time. As a solution, a speculative execution and branch prediction mechanisms are used to further speed up performance by guessing and executing one branch ahead of time. The problem is that in case of the wrong guess, the results of the execution have to be discarded and correct instructions have to be loaded into the instruction cache and executed on the spot.

Solution

An application developer should use macros likely and unlikely that are shortcuts for gcc __builtin_expect directive. The purpose of these macros is to give the compiler a hint which path will be taken more often and as a result, decreasing percentage of branch prediction misses.

References

📊 Want to see these techniques in action?
Check out our real-world DPDK & VPP projects →

Learning DPDK: make your data cache friendly with pahole tool

Posted on June 15, 2018 by Denys Haryachyy

✅ Updated January 2026 — This guide has been reviewed and updated for the latest DPDK/VPP versions.

Overview

Taking into account orders of magnitude between speed access to different cache levels and RAM itself, it is advised to carefully analyze C data structures that are used frequently on cache friendliness. The idea is to have the most often accessed data (“hot”) to stay in a higher level cache as long as possible. And the following technics are used.

Group “hot” members together in the beginning and push “cold” to the end;
Minimize structure size by avoiding padding;
Align data to cache line size.

You can find a great description of why and how the data structures are laid out by compilers here.

Poke-a-hole (pahole) analyzes the object file and outputs detailed description of each and every structure layout created by a compiler.

Run

Analyze the file.
pahole a.out
Analyze one structure.
pahole a.out -C structure
Get suggestion on improvements.
pahole --show_reorg_steps --reorganize -C structure a.out

References

📊 Want to see these techniques in action?
Check out our real-world DPDK & VPP projects →

Learning DPDK: Profiling with Flame Graphs

Posted on June 14, 2018 by Denys Haryachyy

✅ Updated January 2026 — This guide has been reviewed and updated for the latest DPDK/VPP versions.

Overview

perf is a great tool to profile the application. The problem is that it generates an enormous amount of text information that is difficult to analyze. Brendan Gregg developed a bunch of handy scripts to visualize perf results.

These tools generate a graph that represents call stacks and a relative execution time of each function.

Generate a graph

git clone https://github.com/brendangregg/FlameGraph
cd FlameGraph
perf record -F 99 -ag — sleep 60
perf script | ./stackcollapse-perf.pl > out.perf-folded
cat out.perf-folded | ./flamegraph.pl > perf.svg

Analyze

Open the graph in a browser;
Point to a bar for an execution statistics;
Click on a bar to zoom;
Use search “Ctrl-F”.

References

📊 Want to see these techniques in action?
Check out our real-world DPDK & VPP projects →

Learning DPDK : DPI with Hyperscan

Posted on June 9, 2018 by Denys Haryachyy

✅ Updated January 2026 — This guide has been reviewed and updated for the latest DPDK/VPP versions.

Why

To know which application generates monitored traffic it is not enough to know TCP/IP address and port but a look inside HTTP header is required.

How

HTTP header is analyzed against a collection of strings. Each string is associated with some protocol, like facebook, google chat, etc.

Complications

String search is a slow operation and to be made fast could leverage smart algorithms and HW optimization technics.

Solution

Regex library called Hyperscan. You can listen for the introduction of the library here. The speed of the library was evaluated here.

Integration

Install binary prerequisites

yum install ragel libstdc++-static

Download Hyperscan sources

wget https://github.com/intel/hyperscan/archive/v4.7.0.tar.gz tar -xf v4.7.0.tar.gz

Download boost headers

wget https://dl.bintray.com/boostorg/release/1.67.0/source/boost_1_67_0.tar.gz tar -xf boost_1_67_0.tar.gz cp -r boost_1_67_0/boost hyperscan-4.7.0/include

Build and install Hyperscan shared library

Just follow the instruction from here.
cd hyperscan-4.7.0 mkdir build cd build cmake -DBUILD_SHARED_LIBS=true .. make make install

Link DPDK app against Hyperscan

Modify Makefile as follows.
CFLAGS += -I/usr/local/include/hs/ LDFLAGS += -lhs

Build a database from a list of strings

Use hs_compile_multi() with an array of strings that you need to grep. To escape a string use \Q and \E symbols from PCRE syntax.

Search

Use hs_scan() API
Check simplegrep example for more details.

References

📊 Want to see these techniques in action?
Check out our real-world DPDK & VPP projects →

Learning DPDK: Implementing Java Support for Improved Performance and Flexibility

Posted on May 4, 2018 by Denys Haryachyy

✅ Updated January 2026 — This guide has been reviewed and updated for the latest DPDK/VPP versions.

Overview

DPDK framework is written in C with a purpose to be fast and be able to use HW optimization technics. But there are many languages that people use to write a software. And one of the most popular languages is Java.

So we had a project with a goal to develop a packet capturing Java library. To marry DPDK with JAVA we have chosen JNI.

Building blocks

We have chosen the following approach to create a library that could be linked to Java application.

Build DPDK as a set of dynamic libraries.
You need to enable CONFIG_RTE_BUILD_SHARED_LIB in the configuration.
Generate C headers using JNI.
Build own dynamic library using DPDK build system.
You need to include rte.extshared.mk in the library Makefile

Communication between DPDK and JAVA

There are two directions of communication, i.e. from JAVA application to DPDK and the opposite.

Here you need to follow JNI guidelines with the following exceptions.

Do not use DPDK native thread for communication with JAVA but create a dedicated thread using pthread. Otherwise, we observed a crash.
Use JAVA static methods. It is not clear why, but we could not use the regular JAVA methods.

📊 Want to see these techniques in action?
Check out our real-world DPDK & VPP projects →

Learning DPDK : Packet capturing

Posted on March 14, 2018 by Denys Haryachyy

✅ Updated January 2026 — This guide has been reviewed and updated for the latest DPDK/VPP versions.

A new project has its goal to capture 40G rate traffic on a specified schedule.

Why

To analyze

security breaches,
misbehaviours or
faulty appliances

it is utterly useful to have virtual traces fully recorded.

What

You can record the whole Ethernet packet.
You can trim its payload in case only headers are important for later analysis.
You can filter the traffic based on IP address and TCP/UDP port.

How

First, capture the traffic into the RAM.
Second, store it on disk.

Complications

Average SSD disk speed is about 500 MB/s
SATA 3.0 speed is 6Gb/S

Solution

It looks a solution could be one of the following or both

RAID
PCI + high-speed SSD disk

📊 Want to see these techniques in action?
Check out our real-world DPDK & VPP projects →

Learning DPDK : Deep Packet Inspection 40/100G

Posted on October 12, 2017 by Denys Haryachyy

✅ Updated January 2026 — This guide has been reviewed and updated for the latest DPDK/VPP versions.

Currently, I am working on the project with a goal to analyze application protocols. That results in a functionality similar to a stateful firewall.

The main challenges:

Parsing on line-rate
Storing state

The following hardware parts were chosen:

FPGA based smart NIC supported by DPDK
Powerful multicore Intel server with lots of RAM

To sustain high speed and enable scalability, main DPDK application design principals have to be respected:

Avoid thread synchronization
Avoid heavy operations such as memcpy, etc.
Take NUMA into the consideration

Thus the design of the resulting application is presented in the below diagram. And its cornerstone is the ability of NIC to distribute bidirectional flows between multiple CPUs, i.e. Receive Side Scaling.

As a result, packets from one flow are always handled by the same CPU core. Furthermore, each thread is using its own dedicated data structures, e.g. hash table, without a need for synchronization.

XDR design

The flow entry is allocated from a memory pool and the hash table is used to quickly locate the flow entry by a key tuple, consisting of source/destination pair of IP address and TCP/UDP port.

In the next article, I am going to describe specifics of the cards that I plan to test, i.e Napatech and Netcope.

📊 Want to see these techniques in action?
Check out our real-world DPDK & VPP projects →

Learning DPDK: Algorithms for Line Speed Switching and Routing

Posted on June 14, 2016 by Denys Haryachyy

✅ Updated January 2026 — This guide has been reviewed and updated for the latest DPDK/VPP versions.

DPDK library contains a lot of things that help the application that leverages its API to achieve line speed switching/routing. To name a few:
1. Poll drivers (recently enhanced with interrupt support)
2. Hardware optimizations (SSE, AVX2, etc.)
3. OS optimizations (Huge tables, etc.)
4. Lookup algorithms (Hash, LPM, etc.)

The following slides is an incomplete overview of the packet lookup algorithms supported in the current version of DPDK.

Good books on the topic are:
High Performance Switches and Routers

Network Algorithmics: An Interdisciplinary Approach to Designing Fast Networked Devices

📊 Want to see these techniques in action?
Check out our real-world DPDK & VPP projects →

Learning DPDK: Setting Up Vagrant VM for DPDK Development

Posted on June 14, 2016 by Denys Haryachyy

✅ Updated January 2026 — This guide has been reviewed and updated for the latest DPDK/VPP versions.

It is very convenient to develop DPDK based solution using one or multiple VMs. My choice of tools for this task is a combination of vagrant and virtualbox tools. This pair makes deployment of VM really simple and easily reproducible.

The following slides give a short overview of the vagrant usage scenario.

My vagrant configuration file looks as follows.

.gist table { margin-bottom: 0; }

Show hidden characters

	# -- mode: ruby --
	# vi: set ft=ruby :

	# Vagrantfile API/syntax version. Don’t touch unless you know what you’re doing!
	VAGRANTFILE_API_VERSION = “2”

	Vagrant.configure(VAGRANTFILE_API_VERSION) do \|config\|
	config.vm.box = “trusty64”
	config.vm.provision :shell, path: “../bootstrap.sh”

	config.vm.network “public_network”, bridge: “br-1”, ip: “192.168.56.1”
	config.vm.network “public_network”, bridge: “br-2”, ip: “192.168.57.1”
	config.vm.network “public_network”, bridge: “br-3”, ip: “192.168.58.1”

	config.vm.provider :virtualbox do \|vb\|
	# # Don’t boot with headless mode
	# vb.gui = true
	#
	# # Use VBoxManage to customize the VM. For example to change memory:
	vb.customize [“modifyvm”, :id, “–memory”, “1024”]
	vb.customize [“modifyvm”, :id, “–cpuexecutioncap”, “50”]
	vb.customize [“modifyvm”, :id, “–ioapic”, “on”]
	vb.customize [“modifyvm”, :id, “–cpus”, “4”]
	end
	end

view raw

Vagrantfile

hosted with ❤ by GitHub

My bootstrap.sh file contains the following.

#!/bin/bash
apt-get -y install git libtool autoconf g++ gdb

📊 Want to see these techniques in action?
Check out our real-world DPDK & VPP projects →

Learning DPDK: Understanding KNI Interface for Kernel Network Communication

Posted on February 7, 2016 by Denys Haryachyy

✅ Updated January 2026 — This guide has been reviewed and updated for the latest DPDK/VPP versions.

KNI (Kernel Network Interface) is an approach that is used in DPDK to connect user space applications with the kernel network stack.

The following slides present the concept using a number of functional block diagrams.

The code that we are interested in is located by the following locations.

Sample KNI application
- example/kni
KNI kernel module
- lib/librte_eal/linuxapp/kni
KNI library
- lib/librte_kni

To begin testing KNI we need to build DPDK libraries first

git clone git://dpdk.org/dpdk
export RTE_SDK=~/dpdk/
make config T=x86_64-native-linuxapp-gcc O=x86_64-native-linuxapp-gcc
cd x86_64-native-linuxapp-gcc
make

Then we need to compile KNI sample application

cd ${RTE_SDK}/examples/kni
export RTE_TARGET=x86_64-native-linuxapp-gcc
make

To run the above application we need to load KNI kernel module

insmod ${RTE_SDK}/${RTE_TARGET}/kmod/rte_kni.ko

The following kernel module options are available in case if a loopback mode is required.

kthread_mode=single/multiple – number of kernel threads
lo_mode=lo_mode_fifo/lo_mode_fifo_skb – loopback mode

Enable enough huge pages

mkdir -p /mnt/huge
mount -t hugetlbfs nodev /mnt/huge
echo 512 /sys/devices/system/node/node0/hugepages/hugepages-2048kB/nr_hugepages

Load UIO kernel module and bind network interfaces to it. Note that you will not be able to bind interface in case if there exist any route associated with it.

modprobe uio_pci_generic
${RTE_SDK}/tools/dpdk_nic_bind.py --status
${RTE_SDK}/tools/dpdk_nic_bind.py --bind=uio_pci_generic eth1
${RTE_SDK}/tools/dpdk_nic_bind.py --bind=uio_pci_generic eth2
${RTE_SDK}/tools/dpdk_nic_bind.py --status

In the case of PC/VM with four cores we can run KNI application using the following commands.

export LD_LIBRARY_PATH=${RTE_SDK}/${RTE_TARGET}/lib/
${RTE_SDK}/examples/kni/build/kni -c 0x0f -n 4 -- -P -p 0x3 --config="(0,0,1),(1,2,3)"

Where:

-c = core bitmask
-P = promiscuous mode
-p = port hex bitmask
–config=”(port, lcore_rx, lcore_tx [,lcore_kthread, …]) …”

Note that each core can do either TX or RX for one port only.

You can use the following script to setup and run KNI test application.

.gist table { margin-bottom: 0; }

Show hidden characters

	#/bin/sh

	#setup path to DPDK
	export RTE_SDK=/home/dpdk
	export RTE_TARGET=x86_64-native-linuxapp-gcc

	#setup 512 huge pages
	mkdir -p /mnt/huge
	umount -t hugetlbfs nodev /mnt/huge
	mount -t hugetlbfs nodev /mnt/huge
	echo 512 > /sys/devices/system/node/node0/hugepages/hugepages-2048kB/nr_hugepages

	#bind eth1 and eth2 to Linux generic UIO
	modprobe uio_pci_generic

	${RTE_SDK}/tools/dpdk_nic_bind.py –bind=uio_pci_generic eth1
	${RTE_SDK}/tools/dpdk_nic_bind.py –bind=uio_pci_generic eth2

	#insert KNI kernel driver
	insmod ${RTE_SDK}/${RTE_TARGET}/kmod/rte_kni.ko

	#start KNI sample application
	export LD_LIBRARY_PATH=${RTE_SDK}/${RTE_TARGET}/lib/
	${RTE_SDK}/examples/kni/build/kni -c 0x0f -n 4 — -P -p 0x3 –config=”(0,0,1),(1,2,3)”

view raw

start_kni.sh

hosted with ❤ by GitHub

Let’s set ip addresses to KNI interfaces

sudo ifconfig vEth0 192.168.56.100
sudo ifconfig vEth1 192.168.56.101

Now we are set to test the application. To see statistics we need to send the SIGUSR1 signal.

watch -n 10 sudo pkill -10 kni

References

📊 Want to see these techniques in action?
Check out our real-world DPDK & VPP projects →

Learning DPDK: 10G traffic generator

Posted on June 9, 2015 by Denys Haryachyy

✅ Updated January 2026 — This guide has been reviewed and updated for the latest DPDK/VPP versions.

DPDK was successfully used for a project to create Ostinato based traffic generator.

The following performance was achieved on 1G NIC.
Ostinato performance report for 1G link

The following performance was achieved on 10G NIC.
Ostinato performance report for 10G link

The code for this project is available on Github:
https://github.com/PLVision/ostinato-dpdk

The library responsible for DPDK is located on GitHub as well:
https://github.com/PLVision/dpdkadapter

📊 Want to see these techniques in action?
Check out our real-world DPDK & VPP projects →

Deep Dive into DPDK: Understanding Performance Boosting Techniques

Posted on February 13, 2015 by Denys Haryachyy

✅ Updated January 2026 — This guide has been reviewed and updated for the latest DPDK/VPP versions.

📊 Want to see these techniques in action?
Check out our real-world DPDK & VPP projects →

Learning DPDK : Tips and Tricks

Posted on November 10, 2014 by Denys Haryachyy

✅ Updated January 2026 — This guide has been reviewed and updated for the latest DPDK/VPP versions.

Overview

Here I will present some nuances of using DPDK libraries that I learned during implementation of DPDK based custom application.

These tips could save some time for a developer implementing an independent application using DPDK libraries.

Linking DPDK library

By default DPDK sources are built in the set of libraries. But there is an option to pack all libraries in the one dynamic or static library. To achieve it the file “config/common_linuxapp” has to be modified by assigning “y” to “CONFIG_RTE_BUILD_COMBINE_LIBS” define.

Beside that the following options in the Makefile are required.

-include /.../rte_config.h
D__STDC_LIMIT_MACROS
DRTE_MAX_LCORE=64
DRTE_PKTMBUF_HEADROOM=128
DRTE_MAX_ETHPORTS=32
DRTE_MACHINE_CPUFLAG_SSE
DRTE_MACHINE_CPUFLAG_SSE2
DRTE_MACHINE_CPUFLAG_SSE3
DRTE_MACHINE_CPUFLAG_SSSE3
DRTE_COMPILE_TIME_CPUFLAGS=RTE_CPUFLAG_SSE
DRTE_CPUFLAG_SSE2
DRTE_CPUFLAG_SSE3
DRTE_CPUFLAG_SSSE3

Drop unprocessed traffic

As I have discovered it is important to setup DPDK RX queues to drop the received packets in case if they are not processed by an application. Otherwise both RX and TX will be compromised.

To enable such behavior rte_eth_rx_queue_setup has to be provided with rte_eth_rxconf structure that has rx_drop_en field assigned 1.

It could be that such issue is present not for all NICs but only for some of them.

Mbuf reference count

In case if a specific packet has to be sent many times to the same interface it is impractical and quite inefficient to copy its contents each time before transmit. Such copy operation dramatically affects datapath throughput. The reason is not only that it takes memory cycles to execute but, what is more important, it does not let to use cache efficiently.

In order to implement a zero-copy mechanism DPDK provides a notion of reference counter for each memory buffer (mbuf) that is used to store a packet.

The function rte_pktmbuf_refcnt_update could be used to increment reference counter before each send invocation. In such scenario memory buffer would not be released to a memory pool after packet was sent out the port and the same buffer could be used again later.

Jumbo frames

Jumbo frame is a name for frames that have the size greater 1500 bytes of a regular Ethernet frame. Usually its size does not exceed 9000 bytes.

To enable DPDK application to receive such big frames the “rte_eth_conf.rxmode.jumbo_frame” field has to be set to 1 and “rte_eth_conf.rxmode.max_rx_pkt_len” has to be set to the maximum frame size supported.

Afterwards rte_eth_conf structure has to be provided as a parameter to the function rte_eth_dev_configure that is used to configure a specific interface.

Besides that it has to be noted that manipulation with such frames is implemented using multiple segment memory buffers (mbuf). The following fields of mbuf are used to support segments:

pkt_segs – the first segment has to specify the overall number in the chain
data_len – a particular segment payload length
pkt_len – a total length of the payload stored in all segments in the chain
next – the pointer to the next segment or null otherwise

Barriers for synchronization

While working on multiple cores it is important to be able to send a synchronization signal from one core to another. For this purpose a simple flag (array of flags) could be used accompanied by a R/W barrier, i.e. “rte_mb”, that ensures that all threads have most relevant memory view.

References

DPDK Programmer’s Guide

📊 Want to see these techniques in action?
Check out our real-world DPDK & VPP projects →

	Learning VPP: Syncin… on Learning VPP: XFRM Plugin for…
	Learning DPDK: Under… on Learning VPP: Building DPDK wi…
	Learning VPP: VXLAN… on Learning VPP: Benchmarking TAP…
	Learning DPDK : Capt… on Learning VPP: Benchmarking TAP…
	Learning DPDK : Symm… on Learning DPDK: Understanding t…

The Problem

Architecture Overview

BPF Maps: The Shared State

Rules Map — The Forwarding Table

IP Allowlist — Steering Traffic to DPDK

AF_XDP Socket Map

Per-CPU Stats

Sharing Structures Between Kernel and Userspace

The XDP Program

Entry Point

Packet Parsing

Rewriting Headers

Incremental Checksum Update

FIB Forwarding

AF_XDP Redirection

Building the BPF Program

How DPDK Loads the XDP Program

Userspace Control Plane

Detecting AF_XDP Mode

Accessing Pinned Maps

Installing a Rule

Lazy Offload

Deleting Rules

Reading Per-Flow Stats

Reading Per-CPU Stats

IP Allowlist Management

Monitoring

Summary

References

Overview

Configuration

dpdk.mk

rules

Related Posts

Overview

Setup

Install

trex_cfg.yaml

Run server

Run console

Traffic profile (http_simple.py)

Results

References

Related Posts

Intro

Transparent huge pages

Persistent huge pages

In boot time

In runtime

Memory allocation

References

Related Posts

Requirements

Hardware

Software

Solution

Conclusion

References

Related Posts

Overview

System

82599 solution

XL710 solution

References

Related Posts

Overview

Run

References

Related Posts

Overview

Solution

References

Related Posts

Overview

Limitations

VMXNET 3 driver

ENA driver

References

Related Posts

Overview

Traffic profile (`http_simple.py`)