Learning BPF: Offloading DPDK AF_XDP Traffic with XDP

XDP BPF offload architecture diagram: packet flow from NIC through XDP hook with BPF rules map lookup, fast path via FIB forwarding, slow path through AF_XDP to DPDK userspace

When you run a DPDK application with the AF_XDP poll-mode driver, every packet destined for your application travels from the NIC through the kernel’s XDP hook and into userspace via an AF_XDP socket. That includes packets your application could forward without ever touching userspace — if only the kernel knew how.

In this post I describe how to write a BPF program that sits in front of DPDK’s AF_XDP datapath and offloads packet rewriting to the kernel. Matched flows get their IP/UDP headers rewritten and forwarded at XDP speed, while unmatched traffic passes through to DPDK as usual. The result: the kernel handles the fast path, and DPDK handles the control path.

I’ll use a UDP proxy as the running example — an application that bridges traffic between clients and servers by rewriting IP/UDP headers on every packet. The same technique applies to any DPDK AF_XDP application that forwards known flows with predictable header transformations.

The Problem

Consider a proxy that sits between clients and backend servers. For every active session, it rewrites IP addresses and UDP ports on both legs — essentially a specialized NAT. In a pure DPDK setup, every single packet flows through userspace:

Diagram of pure DPDK datapath: every packet flows from NIC through kernel, AF_XDP socket, DPDK poll loop, header rewrite, and TX — all through userspace

For a simple header rewrite on a known flow, that round trip through userspace is unnecessary overhead. The kernel’s XDP hook can do the same rewrite at line rate, before the packet ever reaches the AF_XDP socket.

The goal is a hybrid datapath where:

  • Known flows (established sessions) get rewritten and forwarded entirely in XDP
  • Unknown flows (new connections, control traffic) pass through to DPDK for full processing
  • Non-matching traffic (SSH, management) goes to the kernel stack as usual

Architecture Overview

The XDP program runs at the earliest point in the kernel’s receive path, before the normal networking stack and before AF_XDP delivery. It makes a per-packet decision:

XDP packet decision tree flowchart: parse IPv4, check UDP, lookup 5-tuple in BPF rules map, rewrite headers on hit, FIB lookup for MAC resolution, XDP_TX hairpin or XDP_REDIRECT; on miss check IP allowlist for AF_XDP redirect to DPDK or XDP_PASS to kernel stack

Three XDP actions do the heavy lifting:

  • XDP_PASS — hand the packet to the kernel networking stack (for management traffic not destined to our application)
  • XDP_TX — transmit back out the same interface the packet arrived on
  • XDP_REDIRECT — forward to a different interface, or into an AF_XDP socket for DPDK

BPF Maps: The Shared State

BPF maps are the communication channel between the XDP program (kernel) and the DPDK application (userspace). We define four maps, each with a distinct role.

Rules Map — The Forwarding Table

The core data structure is a hash map keyed by a 5-tuple (interface index + source/destination IP and port). The value contains the rewrite target and per-flow packet counters:

/* 5-tuple match key. All fields in network byte order. */
struct flow_key {
  __u32 ifindex;    /* ingress interface */
  __u32 srcip;
  __u32 dstip;
  __u16 srcport;
  __u16 dstport;
};

/* Rewrite target + counters. */
struct flow_value {
  struct flow_key rewrite;   /* new headers */
  __u64 packets;
  __u64 bytes;
};

The map definition uses LIBBPF_PIN_BY_NAME so it persists across program reloads and is accessible from userspace via /sys/fs/bpf/xdp_fwd_rules:

struct {
  __uint(type, BPF_MAP_TYPE_HASH);
  __uint(max_entries, 524288);
  __type(key, struct flow_key);
  __type(value, struct flow_value);
  __uint(pinning, LIBBPF_PIN_BY_NAME);
} xdp_fwd_rules SEC(".maps");

Each bidirectional flow creates two rules (ingress and egress), so the 524K entry limit supports up to 262K concurrent streams.

IP Allowlist — Steering Traffic to DPDK

Not every packet with a matching destination IP should be rewritten. New flows and control traffic need to reach DPDK for processing. The IP allowlist map controls which destination IPs get redirected to AF_XDP sockets when they don’t match a rewrite rule:

struct {
  __uint(type, BPF_MAP_TYPE_HASH);
  __uint(max_entries, 1024);
  __type(key, __u32);                /* IPv4 address */
  __type(value, struct ip_counter); /* packet counter */
  __uint(pinning, LIBBPF_PIN_BY_NAME);
} xdp_local_ips SEC(".maps");

The userspace application populates this map with all IP addresses it serves. If a packet’s destination IP is in this map, it gets redirected to DPDK via AF_XDP. If not, it passes to the kernel stack — ensuring SSH, DNS, and other management traffic is unaffected.

AF_XDP Socket Map

The xsks_map is an XSKMAP type that DPDK’s AF_XDP PMD manages automatically. It maps RX queue indices to AF_XDP socket file descriptors:

struct {
  __uint(type, BPF_MAP_TYPE_XSKMAP);
  __uint(max_entries, 256);
  __type(key, __u32);   /* RX queue index */
  __type(value, __u32); /* XSK file descriptor */
} xsks_map SEC(".maps");

We never write to this map from our application — DPDK populates it when it opens AF_XDP sockets during port initialization.

Per-CPU Stats

A PERCPU_ARRAY map tracks datapath counters without any locking:

struct {
  __uint(type, BPF_MAP_TYPE_PERCPU_ARRAY);
  __uint(max_entries, XDP_STAT_MAX);
  __type(key, __u32);
  __type(value, __u64);
  __uint(pinning, LIBBPF_PIN_BY_NAME);
} xdp_counters SEC(".maps");

Each CPU gets its own copy of the counters. Userspace sums across all CPUs when reading. The counter indices track each decision point in the datapath:

enum xdp_stat {
  XDP_STAT_REWRITE_HIT,    /* rule matched, packet rewritten */
  XDP_STAT_REWRITE_MISS,   /* no rule for this UDP flow */
  XDP_STAT_FIB_DROP,       /* FIB lookup failed */
  XDP_STAT_HAIRPIN_TX,     /* forwarded on same interface */
  XDP_STAT_REDIRECT_OUT,   /* forwarded to different interface */
  XDP_STAT_AFXDP_REDIRECT, /* sent to DPDK via AF_XDP */
  XDP_STAT_PASS_TO_KERNEL, /* passed to kernel stack */
  XDP_STAT_MAX,
};

Sharing Structures Between Kernel and Userspace

One practical challenge: the BPF program uses kernel types (__u32, __u64), while the userspace C application uses standard types (uint32_t, uint64_t). A shared header bridges both worlds with conditional typedefs:

#ifdef __BPF__
#include <linux/types.h>
typedef __u32 fwd_u32;
typedef __u64 fwd_u64;
#else
#include <stdint.h>
typedef uint32_t fwd_u32;
typedef uint64_t fwd_u64;
#endif

The BPF Makefile passes -D__BPF__ when compiling the kernel program, while the userspace build gets the standard types. Both see identical struct layouts — critical for correct map access from either side.

The XDP Program

Entry Point

The XDP entry function xdp_main() implements the decision tree from the architecture diagram. It parses headers, attempts a rewrite, and falls back to AF_XDP redirection:

SEC("xdp")
int xdp_main(struct xdp_md *ctx)
{
  void *data     = (void *)(long)ctx->data;
  void *data_end = (void *)(long)ctx->data_end;

  struct iphdr *iph = extract_ip_hdr(data, data_end);
  if (!iph)
    return XDP_PASS;

  if (iph->protocol == IPPROTO_UDP &&
      (void *)iph + iph->ihl * 4 + sizeof(struct udphdr) < data_end)
  {
    struct udphdr *udp = (void *)iph + iph->ihl * 4;

    if (lookup_and_rewrite(ctx->ingress_ifindex, iph, udp, data_end) != -1)
      return resolve_and_forward(ctx, data, iph);
  }

  return steer_to_dpdk(iph, ctx);
}

The logic is intentionally flat: parse, try rewrite, fall back. Non-IPv4 packets pass to the kernel immediately. Non-UDP packets (like ICMP) skip the rewrite attempt and go straight to the AF_XDP redirection check.

Packet Parsing

BPF programs must prove to the verifier that every pointer dereference is within packet bounds. The extract_ip_hdr() helper does this for Ethernet + IPv4:

static __always_inline struct iphdr *extract_ip_hdr(void *data, void *data_end)
{
  struct ethhdr *eth = data;
  if ((void *)(eth + 1) >= data_end)
    return NULL;

  if (eth->h_proto != bpf_htons(ETH_P_IP))
    return NULL;

  struct iphdr *iph = (void *)(eth + 1);
  if ((void *)(iph + 1) >= data_end)
    return NULL;

  return iph;
}

Each bounds check satisfies the verifier and also serves as a protocol filter — only IPv4 packets proceed.

Rewriting Headers

lookup_and_rewrite() is the core of the offload. It builds a 5-tuple key from the packet, looks up the rules map, and rewrites in place:

static __always_inline int lookup_and_rewrite(
  __u32 ifindex, struct iphdr *ip, struct udphdr *udp, void *data_end)
{
  struct flow_key key = {
    .ifindex = ifindex,
    .srcip   = ip->saddr,
    .dstip   = ip->daddr,
    .srcport = udp->source,
    .dstport = udp->dest,
  };

  struct flow_value *value =
    bpf_map_lookup_elem(&xdp_fwd_rules, &key);
  if (!value)
    return -1;

  /* Rewrite IP and UDP headers */
  ip->saddr    = value->rewrite.srcip;
  ip->daddr    = value->rewrite.dstip;
  udp->source  = value->rewrite.srcport;
  udp->dest    = value->rewrite.dstport;

  /* ... checksum fixup and counter updates ... */

  return value->rewrite.ifindex;
}

The function returns the target interface index on a hit (used by resolve_and_forward()), or -1 on a miss.

Incremental Checksum Update

After rewriting IP and UDP headers, the checksums must be recalculated. Rather than recomputing from scratch, we use bpf_csum_diff() for an incremental update — only the changed fields contribute to the new checksum:

/* L3 checksum: IP addresses changed */
__u32 l3_old[2] = { key.srcip, key.dstip };
__u32 l3_new[2] = { value->rewrite.srcip, value->rewrite.dstip };
__s64 l3_diff = bpf_csum_diff(l3_old, 8, l3_new, 8, 0);

/* L4 checksum: ports changed (only if UDP checksum is non-zero) */
if (udp->check != 0)
{
  __u32 l4_old = (key.dstport << 16) + key.srcport;
  __u32 l4_new = (value->rewrite.dstport << 16) + value->rewrite.srcport;

  __s64 l4_diff = bpf_csum_diff(&l4_old, 4, &l4_new, 4, l3_diff);
  udp->check = fold_csum((0xFFFF & ~udp->check) + l4_diff);
}

ip->check = fold_csum((0xFFFF & ~ip->check) + l3_diff);

The UDP checksum is optional for IPv4 (a zero value means “not computed”), so we skip it when the original checksum is zero. The fold_csum() folds a 64-bit intermediate back to 16 bits:

static __always_inline __u16 fold_csum(__u64 csum)
{
  int i;
#pragma unroll
  for (i = 0; i < 4; i++)
  {
    if (csum >> 16)
      csum = (csum & 0xffff) + (csum >> 16);
  }
  return ~csum;
}

FIB Forwarding

After rewriting, the packet has new destination IP addresses but stale MAC addresses. The bpf_fib_lookup() helper queries the kernel’s routing table (FIB) to resolve the next-hop MAC addresses:

static __always_inline int resolve_and_forward(
  struct xdp_md *ctx, void *data, struct iphdr *iph)
{
  struct bpf_fib_lookup s = { 0 };
  s.family      = 2;             /* AF_INET */
  s.tos         = iph->tos;
  s.l4_protocol = iph->protocol;
  s.tot_len     = bpf_ntohs(iph->tot_len);
  s.ifindex     = ctx->ingress_ifindex;
  s.ipv4_src    = iph->saddr;
  s.ipv4_dst    = iph->daddr;

  int ret = bpf_fib_lookup(ctx, &s, sizeof(s), 0);
  if (ret != 0)
    return XDP_DROP;

  struct ethhdr *eth = data;
  __builtin_memcpy(eth->h_dest, s.dmac, 6);
  __builtin_memcpy(eth->h_source, s.smac, 6);

  if (ctx->ingress_ifindex == s.ifindex)
    return XDP_TX;                /* hairpin: same interface */

  return bpf_redirect(s.ifindex, 0);  /* cross-interface forward */
}

One important detail: the flags parameter to bpf_fib_lookup() is set to 0 (not BPF_FIB_LOOKUP_OUTPUT). With BPF_FIB_LOOKUP_OUTPUT, the kernel constrains the lookup to the input interface, which prevents cross-interface forwarding. Without that flag, the FIB lookup can resolve routes through any interface — essential when ingress and egress use different NICs.

AF_XDP Redirection

When a packet doesn’t match a rewrite rule, it may still be destined for our application. The steer_to_dpdk() function checks the IP allowlist and redirects matching traffic to DPDK:

static __always_inline int steer_to_dpdk(
  struct iphdr *iph, struct xdp_md *ctx)
{
  struct ip_counter *val =
    bpf_map_lookup_elem(&xdp_local_ips, &iph->daddr);

  if (!val)
    return XDP_PASS;   /* not our IP, let the kernel handle it */

  __sync_fetch_and_add(&val->packets, 1);
  return bpf_redirect_map(&xsks_map, ctx->rx_queue_index, XDP_PASS);
}

bpf_redirect_map() sends the packet to the AF_XDP socket for the appropriate RX queue. The third argument (XDP_PASS) is the fallback action if the queue has no socket — in that case the packet goes to the kernel stack.

Building the BPF Program

The BPF object file is compiled with clang targeting the BPF backend:

xdp_fwd.o: xdp_fwd.c xdp_maps.h xdp_common.h xdp_shared.h
    clang -O2 -target bpf -D__BPF__ $(INCLUDES) -c $< -o $@ -g

Key compiler flags:

  • -target bpf — emit BPF bytecode instead of native code
  • -O2 — required for the BPF verifier to accept the program (without optimization, the code often contains constructs the verifier rejects)
  • -D__BPF__ — activates kernel-side type definitions in the shared header
  • -g — includes debug info for bpftool introspection

How DPDK Loads the XDP Program

An important detail: you don’t load the XDP program yourself. DPDK’s AF_XDP PMD handles this automatically.

When you pass --vdev=net_af_xdp0,iface=eth0,xdp_prog=xdp_fwd.o to your DPDK application’s EAL arguments, the AF_XDP PMD:

  1. Opens the specified .o file and loads the XDP program onto the network interface
  2. Creates AF_XDP sockets (one per RX queue) and populates the xsks_map with their file descriptors
  3. Attaches the XDP program to the interface so it runs on every incoming packet

From your application’s perspective, DPDK starts receiving packets through rte_eth_rx_burst() as usual — but now the XDP program is running in front of it, and any BPF maps defined with LIBBPF_PIN_BY_NAME are automatically pinned to /sys/fs/bpf/.

This means the deployment workflow is straightforward:

  1. Compile the BPF program: make (produces xdp_fwd.o)
  2. Start your DPDK application with the AF_XDP vdev argument pointing to the .o file
  3. DPDK loads the program, creates sockets, pins maps — all transparently
  4. Your application opens the pinned maps and manages rules at runtime

No ip link set dev eth0 xdp obj ..., no bpftool prog load, no manual socket creation. The AF_XDP PMD is a single integration point that wires together the XDP program, the AF_XDP sockets, and the DPDK poll loop.

Userspace Control Plane

The DPDK application manages the BPF maps at runtime — adding rules when flows are established, deleting them on teardown, and reading stats back. All of this happens through libbpf’s map manipulation API on pinned file descriptors.

Detecting AF_XDP Mode

When DPDK initializes its Ethernet ports, you can query each port’s driver name. If it reports net_af_xdp, you know an XDP program is loaded on the interface and the BPF maps are available:

static int detect_afxdp(uint16_t port_id)
{
  struct rte_eth_dev_info dev_info;
  if (rte_eth_dev_info_get(port_id, &dev_info) != 0)
    return 0;

  return strcmp(dev_info.driver_name, "net_af_xdp") == 0;
}

This check gates all offload logic — when DPDK uses a hardware PMD (e.g., mlx5, ixgbe), the BPF maps don’t exist and there’s nothing to manage.

Accessing Pinned Maps

BPF maps declared with LIBBPF_PIN_BY_NAME are pinned to /sys/fs/bpf/<map_name>. From userspace, bpf_obj_get() opens a pinned map and returns a file descriptor you can use with all bpf_map_* functions:

#include <bpf/bpf.h>

int rules_fd = bpf_obj_get("/sys/fs/bpf/xdp_fwd_rules");
if (rules_fd < 0) {
  fprintf(stderr, "Cannot open rules map: %s\n", strerror(errno));
  return -1;
}

/* Use rules_fd with bpf_map_update_elem / bpf_map_lookup_elem / etc. */

close(rules_fd);  /* close when done */

This is the bridge between kernel and userspace — the same hash table the XDP program reads at packet speed is directly writable from your DPDK application.

Installing a Rule

To offload a flow, build the match key and rewrite value, then insert into the rules map. Here’s an example that offloads a bidirectional UDP flow between a client (10.0.1.50:20000) and a server (10.0.2.1:40000), with the proxy listening on 10.0.0.100:30000:

#include <arpa/inet.h>

/* Ingress rule: client → proxy, rewrite to → server */
struct flow_key ingress_key = {
  .ifindex = 3,                             /* rx interface */
  .srcip   = inet_addr("10.0.1.50"),
  .dstip   = inet_addr("10.0.0.100"),
  .srcport = htons(20000),
  .dstport = htons(30000),
};

struct flow_value ingress_val = {
  .rewrite = {
    .ifindex = 4,                           /* tx interface */
    .srcip   = inet_addr("10.0.2.1"),
    .dstip   = inet_addr("10.0.1.50"),
    .srcport = htons(40000),
    .dstport = htons(20000),
  },
};

bpf_map_update_elem(rules_fd, &ingress_key, &ingress_val, BPF_ANY);

/* Egress rule: server → proxy, rewrite to → client */
struct flow_key egress_key = {
  .ifindex = 4,
  .srcip   = inet_addr("10.0.1.50"),
  .dstip   = inet_addr("10.0.2.1"),
  .srcport = htons(20000),
  .dstport = htons(40000),
};

struct flow_value egress_val = {
  .rewrite = {
    .ifindex = 3,
    .srcip   = inet_addr("10.0.0.100"),
    .dstip   = inet_addr("10.0.1.50"),
    .srcport = htons(30000),
    .dstport = htons(20000),
  },
};

bpf_map_update_elem(rules_fd, &egress_key, &egress_val, BPF_ANY);

Each bidirectional flow needs two rules — one per direction. When the XDP program matches the ingress key, it rewrites the packet and forwards it out interface 4 toward the server. The egress rule handles the return path.

Lazy Offload

A subtlety: you often can’t install rules immediately when a flow is created. If the client is behind NAT, you don’t know its real source IP/port until the first packet arrives. A practical pattern is to let the first packet pass through DPDK (via AF_XDP), learn the NAT-translated address from the packet headers, then install the XDP rule so all subsequent packets bypass userspace:

void on_first_packet(struct flow_entry *flow, struct rte_mbuf *pkt)
{
  struct rte_ipv4_hdr *ip = rte_pktmbuf_mtod_offset(pkt,
    struct rte_ipv4_hdr *, sizeof(struct rte_ether_hdr));
  struct rte_udp_hdr *udp = (struct rte_udp_hdr *)((char *)ip + sizeof(*ip));

  /* Now we know the real client address (NAT resolved) */
  flow->client_real_ip   = ip->src_addr;
  flow->client_real_port = udp->src_port;

  /* Install both XDP offload rules */
  install_xdp_rules(rules_fd, flow);
  flow->offloaded = 1;
}

After this point, the XDP program handles the flow at kernel speed. DPDK only sees the first packet.

Deleting Rules

When a flow ends (session teardown, timeout, etc.), delete both rules from the map:

void teardown_flow(int rules_fd, struct flow_entry *flow)
{
  bpf_map_delete_elem(rules_fd, &flow->ingress_key);
  bpf_map_delete_elem(rules_fd, &flow->egress_key);
  flow->offloaded = 0;
}

If NAT information changes mid-flow (the client’s address shifts), delete the old rules and install new ones with the updated addresses.

Reading Per-Flow Stats

Since the XDP program updates counters inside the rule values, userspace can read them back at any time with a simple lookup:

struct flow_value val;
if (bpf_map_lookup_elem(rules_fd, &ingress_key, &val) == 0)
{
  printf("packets: %llu, bytes: %llu\n", val.packets, val.bytes);
}

A background timer (every few seconds) can iterate all offloaded flows and sync their BPF-side stats into whatever reporting structures your application uses.

Reading Per-CPU Stats

The xdp_counters map is a PERCPU_ARRAY — each bpf_map_lookup_elem call returns an array of values, one per CPU. Sum them for the global total:

int stats_fd = bpf_obj_get("/sys/fs/bpf/xdp_counters");
int num_cpus = sysconf(_SC_NPROCESSORS_CONF);
uint64_t percpu[num_cpus];

for (uint32_t key = 0; key < XDP_STAT_MAX; key++)
{
  uint64_t total = 0;
  if (bpf_map_lookup_elem(stats_fd, &key, percpu) == 0)
    for (int i = 0; i < num_cpus; i++)
      total += percpu[i];

  printf("stat[%u] = %llu\n", key, total);
}

IP Allowlist Management

The xdp_local_ips map controls which destination IPs get steered to AF_XDP. Populate it with every IP your application listens on:

int ips_fd = bpf_obj_get("/sys/fs/bpf/xdp_local_ips");

/* Add an IP to the allowlist */
uint32_t ip = inet_addr("10.0.0.100");
struct ip_counter val = { .packets = 0 };
bpf_map_update_elem(ips_fd, &ip, &val, BPF_ANY);

/* Remove an IP */
bpf_map_delete_elem(ips_fd, &ip);

When your set of served IPs changes, iterate the existing map entries, delete any that are no longer needed, and add the new ones. This ensures packets to removed IPs fall back to the kernel stack immediately.

Monitoring

The per-CPU stats map and per-rule counters enable real-time monitoring. Userspace sums the per-CPU values and formats a dashboard:

XDP datapath stats:
  Rewrite hits         1284923
  Rewrite misses       42
  FIB drops            0
  Hairpin TX           641200
  Redirect out         643723
  AF_XDP -> DPDK       38
  Pass to kernel       157

Per-rule detail includes live PPS and BPS:

Rule Detail
----------------------------------------
Match (Key):
  Interface:   ifindex 3
  Source:      10.0.1.50:20000
  Destination: 10.0.0.100:30000

Rewrite (Value):
  Interface:   ifindex 4
  Source:      10.0.2.1:40000
  Destination: 10.0.1.50:20000

Counters:
  Packets:     641200
  Bytes:       51296000
  PPS:         50
  BPS:         40000

Summary

The hybrid XDP + AF_XDP architecture gives us the best of both worlds:

  1. Write a BPF program that does fast-path packet rewriting using hash-map lookups
  2. Define shared structures in a header compiled by both clang -target bpf and your C compiler
  3. Let DPDK load the program — the AF_XDP PMD handles XDP attachment, socket creation, and map pinning
  4. Pin BPF maps to /sys/fs/bpf/ so your userspace application can manage rules at runtime
  5. Use the IP allowlist to steer traffic: matched IPs go to AF_XDP (DPDK), everything else to the kernel
  6. Offload lazily — let the first packet through DPDK to resolve NAT, then install XDP rules

The key insight is that XDP and AF_XDP are not competing technologies — they compose naturally. AF_XDP gives your DPDK application a kernel-bypass receive path, and XDP gives you a programmable fast path in front of it. By combining both, the kernel handles the steady-state data plane while DPDK handles the exceptions.

References

Learning VPP: Building DPDK with debug symbols

logo_fdio-300x184

Overview

The goal is to build DPDK which is an external package for VPP together with debug symbols to be able to debug inside DPDK source code.

Version

VPP version is 23.02

Configuration

We need to modify two files:

  • build/external/deb/debian/rules
  • build/external/packages/dpdk.mk

dpdk.mk

The following flag has to be enabled.

DPDK_DEBUG ?= y

rules

The following lines have to be added.

override_dh_strip:
dh_strip --exclude=librte
Rebuild
 
sudo dpkg -r vpp-ext-deps
make install-ext-dep
make rebuild

gdb

When running gdb we need to specify the path to DPDK sources.

set substitute-path '../src-dpdk/' 
'/home/projects/vpp/build/external/downloads/dpdk-22.07'


Related Posts


📊 Want to see these techniques in action?
Check out our real-world DPDK & VPP projects →

Learning DPDK: Traffic generator TRex

✅ Updated January 2026 — This guide has been reviewed and updated for the latest DPDK/VPP versions.

trex1

Overview

TRex is a stateful and stateless traffic generator based on DPDK. TCP stack implementation leverages BSD4.4 original code.

Setup

Ubuntu 18.04 server installed on VirtualBox VM with two interfaces connected in a loopback, 4 CPUs, 4G RAM.

Install

Download and build the latest Trex on Ubuntu 18.04.

sudo apt -y install zlib1g-dev build-essential python python3-distutils
git clone https://github.com/cisco-system-traffic-generator/trex-core.git
cd trex-core/
cd linux_dpdk
./b configure
./b build
cd ..
sudo cp scripts/cfg/simple_cfg.yaml /etc/trex_cfg.yaml

Find out the PCI IDs of the interfaces to be used by Trex.

lspci | grep Eth
00:03.0 Ethernet controller: Intel Corporation 82540EM Gigabit Ethernet Controller (rev 02)
00:08.0 Ethernet controller: Intel Corporation 82540EM Gigabit Ethernet Controller (rev 02)
00:09.0 Ethernet controller: Intel Corporation 82540EM Gigabit Ethernet Controller (rev 02)

trex_cfg.yaml

Edit TRex config file by changing PCI ids.

- port_limit : 2
version : 2
#List of interfaces. Change to suit your setup. Use ./dpdk_setup_ports.py -s to see available options
interfaces : ["00:08.0","00:09.0"]
port_info : # Port IPs. Change to suit your needs. In case of loopback, you can leave as is.
- ip : 1.1.1.1
default_gw : 2.2.2.2
- ip : 2.2.2.2
default_gw : 1.1.1.1

Run server

Run TRex in a stateful mode.

cd scripts/
sudo ./t-rex-64 -i --astf

Run console

Generate HTTP flows.

cd scripts/
./trex-console
trex> start -f astf/http_simple.py -m 1000 -d 1000 -l 1000
trex> tui

Traffic profile (http_simple.py)

from trex_astf_lib.api import *
class Prof1():
    def get_profile(self):
        # ip generator
        ip_gen_c = ASTFIPGenDist(ip_range=["10.10.10.0", "10.10.10.255"],
                                 distribution="seq")
        ip_gen_s = ASTFIPGenDist(ip_range=["20.20.20.0", "20.20.20.255"],
                                  distribution="seq")
        ip_gen = ASTFIPGen(glob=ASTFIPGenGlobal(ip_offset="1.0.0.0"),
                           dist_client=ip_gen_c,
                           dist_server=ip_gen_s)

        return ASTFProfile(default_ip_gen=ip_gen,
                            cap_list=[ASTFCapInfo(
                                      file="../avl/delay_10_http_browsing_0.pcap"
                                      cps=1)
                                     ])

def register():
    return Prof1()

Results

Monitor flow statistics by pressing “Esc” and “t” buttons in “tui” mode.

trex-stats

References


Related Posts


📊 Want to see these techniques in action?
Check out our real-world DPDK & VPP projects →

Learning DPDK: Understanding the Benefits of Huge Pages

✅ Updated January 2026 — This guide has been reviewed and updated for the latest DPDK/VPP versions.

hugepages

Intro

Modern CPUs support different page sizes, e.g. 4K, 2M, 1GB. All page sizes, except 4K, are named “huge pages” in Linux. The reason for this name convention is historical and stems from the fact that originally Linux supported 4K page size only.

Big page sizes are beneficial for performance as far as fewer translations between virtual and physical addresses happen and Translation Lookaside Buffer (TLB) cache is a scarce resource.

mmu-tlb

To check the size of TLB the following utility can be used.

cpuid | grep -i tlb
cache and TLB information (2):
0x63: data TLB: 1G pages, 4-way, 4 entries
0x03: data TLB: 4K pages, 4-way, 64 entries
0x76: instruction TLB: 2M/4M pages, fully, 8 entries
0xb6: instruction TLB: 4K, 8-way, 128 entries
0xc3: L2 TLB: 4K/2M pages, 6-way, 1536 entries

To check the number of allocated huge pages the following command can be used.

cat /proc/meminfo | grep Huge
AnonHugePages: 4409344 kB
HugePages_Total: 32
HugePages_Free: 32
HugePages_Rsvd: 0
HugePages_Surp: 0
Hugepagesize: 1048576 kB

There are two types of huge pages available in the Linux.

  • Transparent (Anonymous) huge pages
  • Persistent huge pages

Transparent huge pages

Transparent huge pages is an abstraction layer that automates most aspects of creating, managing and using huge pages. As far as there existed some issues with performance and stability, DPDK does not rely on this mechanism but uses persistent huge pages.

Persistent huge pages

Persistent huge pages have to be configured manually. Persistent huge pages are never swapped by the Linux kernel.

The following management interfaces exist in Linux to allocate the persistent huge pages.

  • Shared memory using shmget()
  • HugeTLBFS is a RAM-based filesystem and mmap()read() or memfd_create() can be used to access its files
  • Anonymous mmap() by specifying the flags MAP_ANONYMOUS and MAP_HUGETLB flags
  • libhugetlbfs APIs
  • Automatic backing of memory regions

Persistent huge pages are used in DPDK by default, mount points are discovered automatically and pages are released once application exits. But in case a user needs to manually tune something, the following EAL command line parameters could be used.

  • --huge-dir Use specified hugetlbfs directory instead of autodetected ones.
  • --huge-unlink Unlink huge page files after creating them (implies no secondary process support).
  • --in-memory Recent DPDK versions added an option to not rely on hugetlbfs

There are multiple ways to set up persistent huge pages.

  • On the boot
  • In run-time

In boot time

Modify Linux boot time parameters inside /etc/default/grub. Huge pages will be spread equally between all NUMA sockets.
GRUB_CMDLINE_LINUX="default_hugepagesz=1G hugepagesz=1G hugepages=32"

Update the grub configuration file and reboot.

grub2-mkconfig -o /boot/grub2/grub.cfg
reboot

Create a folder for a permanent mount point of hugetlbfs 

mkdir /mnt/huge

Add the following line to the /etc/fstab file:
nodev /mnt/huge hugetlbfs defaults 0 0

In runtime

Update number of huge pages for each NUMA node. Default huge page size cannot be modified in the runtime.
echo 16 > /sys/devices/system/node/node0/hugepages/hugepages-1048576kB/nr_hugepages
echo 16 > /sys/devices/system/node/node1/hugepages/hugepages-1048576kB/nr_hugepages

Create a mount point.

mkdir /mnt/huge

Mount hugetlbfs 
mount -t hugetlbfs nodev /mnt/huge

Memory allocation

While there are many ways to allocate persistent huge pages, DPDK is using the following.

  • mmap() call with hugetlbfs mount point
  • mmap() call with MAP_HUGETLB flag
  • memfd_create() call with MFD_HUGETLB flag

References


Related Posts


📊 Want to see these techniques in action?
Check out our real-world DPDK & VPP projects →

Learning DPDK : Capture to Kafka

✅ Updated January 2026 — This guide has been reviewed and updated for the latest DPDK/VPP versions.

kafka_logo

Requirements

The requirements are as follows.

  • To capture 10Gbps of 128B packets into Apache Kafka
  • Implement basic filtering using IP addresses
  • Save traffic into one Kafka topic

Hardware

Software

Solution

The following key design ideas helped to achieve 5Gbps capture speed.

  • Use XFS filesystem
  • Combine small packets into the big Kafka messages, 500KB each
  • Run 4 Kafka brokers on one physical server simultaneously
  • Allocate 20 partitions per topic

Conclusion

The decision was made to use two servers to be able to capture all 10Gbps traffic.

References


Related Posts


📊 Want to see these techniques in action?
Check out our real-world DPDK & VPP projects →

Learning DPDK: Understanding Symmetric RSS for Efficient Packet Processing

✅ Updated January 2026 — This guide has been reviewed and updated for the latest DPDK/VPP versions.

rss diagram (1)

Overview

Receive side scaling (RSS) is a technology that enables the distribution of received packets between multiple RX queues using a predefined hash function. It enabled multicore CPU to process packets from different queues on different cores.

Symmetric RSS promise is to populate two-way packets from the same TCP connection to the same RX queue. As a result statistics on the different connections could be stored in the per-queue data structures avoiding any need for locking.

System

Recently I had a chance to test symmetric RSS on two NICs from Intel, namely XL710 40G and 82599 10G. 

The approach to configuring symmetric RSS in XL710 is different from the standard DPDK approach. Thus i40e driver offers specific API for this purpose.

DPDK 18.05.1 was used for testing.

82599 solution

To make symmetric RSS work a default hash key has to be replaced with a custom one.

.gist table { margin-bottom: 0; }


This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters

#define RSS_HASH_KEY_LENGTH 40
static uint8_t hash_key[RSS_HASH_KEY_LENGTH] = {
0x6D, 0x5A, 0x6D, 0x5A, 0x6D, 0x5A, 0x6D, 0x5A,
0x6D, 0x5A, 0x6D, 0x5A, 0x6D, 0x5A, 0x6D, 0x5A,
0x6D, 0x5A, 0x6D, 0x5A, 0x6D, 0x5A, 0x6D, 0x5A,
0x6D, 0x5A, 0x6D, 0x5A, 0x6D, 0x5A, 0x6D, 0x5A,
0x6D, 0x5A, 0x6D, 0x5A, 0x6D, 0x5A, 0x6D, 0x5A,
};
struct rte_eth_conf port_conf = {
.rxmode = {
.mq_mode = ETH_MQ_RX_RSS,
},
.rx_adv_conf = {
.rss_conf = {
.rss_key = hash_key,
.rss_key_len = RSS_HASH_KEY_LENGTH,
.rss_hf = ETH_RSS_IP |
ETH_RSS_TCP |
ETH_RSS_UDP |
ETH_RSS_SCTP,
}
},
};
rte_eth_dev_configure(port_id, rx_queue_num, tx_queue_num, &port_conf);

XL710 solution

To enable symmetric RSS i40e driver provides API to setup hardware registers.

.gist table { margin-bottom: 0; }


This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters

struct rte_eth_conf port_conf = {
.rxmode = {
.mq_mode = ETH_MQ_RX_RSS,
},
.rx_adv_conf = {
.rss_conf = {
.rss_hf = ETH_RSS_IP |
ETH_RSS_TCP |
ETH_RSS_UDP |
ETH_RSS_SCTP,
}
},
};
rte_eth_dev_configure(port_id, rx_queue_num, tx_queue_num, &port_conf);
int sym_hash_enable(int port_id, uint32_t ftype, enum rte_eth_hash_function function)
{
struct rte_eth_hash_filter_info info;
int ret = 0;
uint32_t idx = 0;
uint32_t offset = 0;
memset(&info, 0, sizeof(info));
ret = rte_eth_dev_filter_supported(port_id, RTE_ETH_FILTER_HASH);
if (ret < 0) {
DPDK_ERROR(“RTE_ETH_FILTER_HASH not supported on port: %d”,
port_id);
return ret;
}
info.info_type = RTE_ETH_HASH_FILTER_GLOBAL_CONFIG;
info.info.global_conf.hash_func = function;
idx = ftype / UINT64_BIT;
offset = ftype % UINT64_BIT;
info.info.global_conf.valid_bit_mask[idx] |= (1ULL << offset);
info.info.global_conf.sym_hash_enable_mask[idx] |=
(1ULL << offset);
ret = rte_eth_dev_filter_ctrl(port_id, RTE_ETH_FILTER_HASH,
RTE_ETH_FILTER_SET, &info);
if (ret < 0)
{
DPDK_ERROR(“Cannot set global hash configurations”
“on port %u”, port_id);
return ret;
}
return 0;
}
int sym_hash_set(int port_id, int enable)
{
int ret = 0;
struct rte_eth_hash_filter_info info;
memset(&info, 0, sizeof(info));
ret = rte_eth_dev_filter_supported(port_id, RTE_ETH_FILTER_HASH);
if (ret < 0) {
DPDK_ERROR(“RTE_ETH_FILTER_HASH not supported on port: %d”,
port_id);
return ret;
}
info.info_type = RTE_ETH_HASH_FILTER_SYM_HASH_ENA_PER_PORT;
info.info.enable = enable;
ret = rte_eth_dev_filter_ctrl(port_id, RTE_ETH_FILTER_HASH,
RTE_ETH_FILTER_SET, &info);
if (ret < 0)
{
DPDK_ERROR(“Cannot set symmetric hash enable per port “
“on port %u”, port_id);
return ret;
}
return 0;
}
sym_hash_enable(port_id, RTE_ETH_FLOW_NONFRAG_IPV4_TCP, RTE_ETH_HASH_FUNCTION_TOEPLITZ);
sym_hash_enable(port_id, RTE_ETH_FLOW_NONFRAG_IPV4_UDP, RTE_ETH_HASH_FUNCTION_TOEPLITZ);
sym_hash_enable(port_id, RTE_ETH_FLOW_FRAG_IPV4, RTE_ETH_HASH_FUNCTION_TOEPLITZ);
sym_hash_enable(port_id, RTE_ETH_FLOW_NONFRAG_IPV4_SCTP, RTE_ETH_HASH_FUNCTION_TOEPLITZ);
sym_hash_enable(port_id, RTE_ETH_FLOW_NONFRAG_IPV4_OTHER, RTE_ETH_HASH_FUNCTION_TOEPLITZ);
sym_hash_set(port_id, 1);

References


Related Posts


📊 Want to see these techniques in action?
Check out our real-world DPDK & VPP projects →

Learning DPDK: NUMA optimization

✅ Updated January 2026 — This guide has been reviewed and updated for the latest DPDK/VPP versions.

NUMA

Overview

To get the maximum performance on NUMA system, the underlying architecture has to be taken into account.

To spot the problems in your data design, there exists a handy tool, called “perf c2c”. Where C2C stands for Cache To Cache. The output of the tool will provide statistics about the access to a data on the remote NUMA socket.

Run

Record PMU counters.

perf c2c record -F 99 -g -- binary

Analyze in interactive mode.
perf c2c report

Analyze in text mode.
perf c2c report --stdio

For example summary in a text mode could look as follows.
=================================================
Trace Event Information
=================================================
Total records : 5621889
Locked Load/Store Operations : 10032
Load Operations : 741529
Loads - uncacheable : 7
Loads - IO : 0
Loads - Miss : 8299
Loads - no mapping : 18
Load Fill Buffer Hit : 533018
Load L1D hit : 109495
Load L2D hit : 4337
Load LLC hit : 61245
Load Local HITM : 9673
Load Remote HITM : 12528
Load Remote HIT : 780
Load Local DRAM : 4593
Load Remote DRAM : 7209
Load MESI State Exclusive : 11802
Load MESI State Shared : 0
Load LLC Misses : 25110
LLC Misses to Local DRAM : 18.3%
LLC Misses to Remote DRAM : 28.7%
LLC Misses to Remote cache (HIT) : 3.1%
LLC Misses to Remote cache (HITM) : 49.9%
Store Operations : 4880360
Store - uncacheable : 0
Store - no mapping : 178126
Store L1D Hit : 4696772
Store L1D Miss : 5462
No Page Map Rejects : 1095
Unable to parse data source : 0
=================================================
Global Shared Cache Line Event Information
=================================================
Total Shared Cache Lines : 10898
Load HITs on shared lines : 88830
Fill Buffer Hits on shared lines : 39884
L1D hits on shared lines : 8717
L2D hits on shared lines : 86
LLC hits on shared lines : 25798
Locked Access on shared lines : 5336
Store HITs on shared lines : 5953
Store L1D hits on shared lines : 5633
Total Merged records : 28154

References


Related Posts


📊 Want to see these techniques in action?
Check out our real-world DPDK & VPP projects →

Learning DPDK: Understanding the Benefits of Inlining for Function Optimization

✅ Updated January 2026 — This guide has been reviewed and updated for the latest DPDK/VPP versions.

To_Inline_or_not_to_Inline_Dia_01Overview

Inlining method can help to mitigate the following:

  1. Function call overhead;
  2. Pipeline stall.

It is advised to apply the method for the following types of routines:

  1. Trivial and small functions used as accessors to data or wrappers around another function;
  2. Big functions called quite regularly but not from many places.

Solution

A modern compiler uses heuristics to decide which functions need to be inlined. But it is always better to give it a hint using the following keywords.

static inline

Moreover to make a decision instead of the gcc compiler the following attribute should be used.

__attribute__((always_inline))

References


Related Posts


📊 Want to see these techniques in action?
Check out our real-world DPDK & VPP projects →

Learning DPDK: Cloud Support for AWS and VMware Environments

✅ Updated January 2026 — This guide has been reviewed and updated for the latest DPDK/VPP versions.

DPDK in cloud

Overview

DPDK based products fit perfectly into NFV paradigm. DPDK provides drivers for cloud-based NICs that could be run in AWS and VMware environments.

Limitations

The following nuances were discovered when using DPDK on VMware and Amazon platforms.

VMXNET 3 driver

Both RX and TX queues have to be configured on the device. Otherwise, DPDK initialization crashes.

ENA driver

The maximum number of buffer descriptors for RX queue is 512.

References


Related Posts


📊 Want to see these techniques in action?
Check out our real-world DPDK & VPP projects →

Learning DPDK: Branch Prediction

✅ Updated January 2026 — This guide has been reviewed and updated for the latest DPDK/VPP versions.
Pipeline,_4_stage.svg

Overview

It is well-known that modern CPUs are built using the instructions pipelines that enable them to execute multiple instructions in parallel. But in case of conditional branches within the program code, not all the instructions are executed each time. As a solution, a speculative execution and branch prediction mechanisms are used to further speed up performance by guessing and executing one branch ahead of time. The problem is that in case of the wrong guess, the results of the execution have to be discarded and correct instructions have to be loaded into the instruction cache and executed on the spot.

Solution

An application developer should use macros likely and unlikely that are shortcuts for gcc __builtin_expect directive. The purpose of these macros is to give the compiler a hint which path will be taken more often and as a result, decreasing percentage of branch prediction misses.

References

 


Related Posts


📊 Want to see these techniques in action?
Check out our real-world DPDK & VPP projects →

Learning DPDK: Avoid False Sharing

✅ Updated January 2026 — This guide has been reviewed and updated for the latest DPDK/VPP versions.

false-sharing-illustration

Overview

It is convenient to store thread-specific data, for instance, statistics, inside an array of structures. The size of the array is equal to the number of threads.

The only thing that you need to be careful about is to avoid so-called false sharing. It is a performance penalty that you pay when RW-data shares the same cache line and is accessed from multiple threads.

Solution

Align a structure accessed by each thread to a cache line size (64 bytes) using macro __rte_cache_aligned that is actually a shortcut for __attribute__(__aligned__((64))).

typedef struct counter_s
{
uint64_t packets;
uint64_t bytes;
uint64_t failed_packets;
uint64_t failed_bytes;
uint64_t pad[4];
}counter_t __rte_cache_aligned;

Define an array of the structures with one element per thread.
counter_t stats[THREADS_NUM];

Note that in case if structure size is smaller than cache line size, the padding is required. Otherwise, gcc compiler will complain with the following error.

error: alignment of array elements is greater than element size

References


Related Posts


📊 Want to see these techniques in action?
Check out our real-world DPDK & VPP projects →

Learning DPDK: make your data cache friendly with pahole tool

✅ Updated January 2026 — This guide has been reviewed and updated for the latest DPDK/VPP versions.

CacheHierarchy

Overview

Taking into account orders of magnitude between speed access to different cache levels and RAM itself,  it is advised to carefully analyze C data structures that are used frequently on cache friendliness. The idea is to have the most often accessed data (“hot”) to stay in a higher level cache as long as possible. And the following technics are used.

  1. Group “hot” members together in the beginning and push “cold” to the end;
  2. Minimize structure size by avoiding padding;
  3. Align data to cache line size.

You can find a great description of why and how the data structures are laid out by compilers here.

Poke-a-hole (pahole) analyzes the object file and outputs detailed description of each and every structure layout created by a compiler.

Run

Analyze the file.
pahole a.out
Analyze one structure.
pahole a.out -C structure
Get suggestion on improvements.
pahole --show_reorg_steps --reorganize -C structure a.out

References


Related Posts


📊 Want to see these techniques in action?
Check out our real-world DPDK & VPP projects →

Learning DPDK: Profiling with Flame Graphs

✅ Updated January 2026 — This guide has been reviewed and updated for the latest DPDK/VPP versions.

flumegraph

Overview

perf is a great tool to profile the application. The problem is that it generates an enormous amount of text information that is difficult to analyze. Brendan Gregg developed a bunch of handy scripts to visualize perf results.

These tools generate a graph that represents call stacks and a relative execution time of each function.

Generate a graph

git clone https://github.com/brendangregg/FlameGraph
cd FlameGraph
perf record -F 99 -ag — sleep 60
perf script | ./stackcollapse-perf.pl > out.perf-folded
cat out.perf-folded | ./flamegraph.pl > perf.svg

Analyze

  1. Open the graph in a browser;
  2. Point to a bar for an execution statistics;
  3. Click on a bar to zoom;
  4. Use search “Ctrl-F”.

References


Related Posts


📊 Want to see these techniques in action?
Check out our real-world DPDK & VPP projects →

Learning DPDK : DPI with Hyperscan

✅ Updated January 2026 — This guide has been reviewed and updated for the latest DPDK/VPP versions.

magnifying-glass

Why

To know which application generates monitored traffic it is not enough to know TCP/IP address and port but a look inside HTTP header is required.

How

HTTP header is analyzed against a collection of strings. Each string is associated with some protocol, like facebook, google chat, etc.

Complications

String search is a slow operation and to be made fast could leverage smart algorithms and HW optimization technics.

Solution

Regex library called Hyperscan. You can listen for the introduction of the library here. The speed of the library was evaluated here.

Integration

Install binary prerequisites

yum install ragel libstdc++-static

Download Hyperscan sources

wget https://github.com/intel/hyperscan/archive/v4.7.0.tar.gz
tar -xf v4.7.0.tar.gz

Download boost headers

wget https://dl.bintray.com/boostorg/release/1.67.0/source/boost_1_67_0.tar.gz
tar -xf boost_1_67_0.tar.gz
cp -r boost_1_67_0/boost hyperscan-4.7.0/include

Build and install Hyperscan shared library

Just follow the instruction from here.
cd hyperscan-4.7.0
mkdir build
cd build
cmake -DBUILD_SHARED_LIBS=true ..
make
make install

Link DPDK app against Hyperscan

Modify Makefile as follows.
CFLAGS += -I/usr/local/include/hs/
LDFLAGS += -lhs

Build a database from a list of strings

Use hs_compile_multi() with an array of strings that you need to grep. To escape a string use  \Q and \E symbols from PCRE syntax.

Search

Use hs_scan() API
Check simplegrep example for more details.

References


Related Posts


📊 Want to see these techniques in action?
Check out our real-world DPDK & VPP projects →

Learning DPDK: Implementing Java Support for Improved Performance and Flexibility

✅ Updated January 2026 — This guide has been reviewed and updated for the latest DPDK/VPP versions.

JNI-Java-Native-Interface

Overview

DPDK framework is written in C with a purpose to be fast and be able to use HW optimization technics. But there are many languages that people use to write a software. And one of the most popular languages is Java.

So we had a project with a goal to develop a packet capturing Java library. To marry DPDK with JAVA we have chosen JNI.

Building blocks

We have chosen the following approach to create a library that could be linked to Java application.

  1. Build DPDK as a set of dynamic libraries.
    You need to enable CONFIG_RTE_BUILD_SHARED_LIB in the configuration.
  2. Generate C headers using JNI.
  3. Build own dynamic library using DPDK build system.
    You need to include rte.extshared.mk in the library Makefile

Communication between DPDK and JAVA

There are two directions of communication, i.e. from JAVA application to DPDK and the opposite.

Here you need to follow JNI guidelines with the following exceptions.

  1. Do not use DPDK native thread for communication with JAVA but create a dedicated thread using pthread. Otherwise, we observed a crash.
  2. Use JAVA static methods. It is not clear why, but we could not use the regular JAVA methods.

 


Related Posts


📊 Want to see these techniques in action?
Check out our real-world DPDK & VPP projects →

Learning DPDK : Packet capturing

✅ Updated January 2026 — This guide has been reviewed and updated for the latest DPDK/VPP versions.

A new project has its goal to capture 40G rate traffic on a specified schedule.

Why

To analyze

  • security breaches,
  • misbehaviours or
  • faulty appliances

it is utterly useful to have virtual traces fully recorded.

What

  • You can record the whole Ethernet packet.
  • You can trim its payload in case only headers are important for later analysis.
  • You can filter the traffic based on IP address and TCP/UDP port.

How

  • First, capture the traffic into the RAM.
  • Second, store it on disk.

Complications

  • Average SSD disk speed is about 500 MB/s
  • SATA 3.0 speed is 6Gb/S

Solution

It looks a solution could be one of the following or both

  • RAID
  • PCI + high-speed SSD disk

Related Posts


📊 Want to see these techniques in action?
Check out our real-world DPDK & VPP projects →

Learning DPDK : Deep Packet Inspection 40/100G

✅ Updated January 2026 — This guide has been reviewed and updated for the latest DPDK/VPP versions.

Currently, I am working on the project with a goal to analyze application protocols. That results in a functionality similar to a stateful firewall.

The main challenges:

  1. Parsing on line-rate
  2. Storing state

The following hardware parts were chosen:

  1. FPGA based smart NIC supported by DPDK
  2. Powerful multicore Intel server with lots of RAM

To sustain high speed and enable scalability, main DPDK application design principals have to be respected:

  1. Avoid thread synchronization
  2. Avoid heavy operations such as memcpy, etc.
  3. Take NUMA into the consideration

Thus the design of the resulting application is presented in the below diagram. And its cornerstone is the ability of NIC to distribute bidirectional flows between multiple CPUs, i.e. Receive Side Scaling.

As a result, packets from one flow are always handled by the same CPU core. Furthermore, each thread is using its own dedicated data structures, e.g. hash table, without a need for synchronization.

XDR design

The flow entry is allocated from a memory pool and the hash table is used to quickly locate the flow entry by a key tuple, consisting of source/destination pair of IP address and TCP/UDP port.

In the next article, I am going to describe specifics of the cards that I plan to test, i.e Napatech and Netcope.


Related Posts


📊 Want to see these techniques in action?
Check out our real-world DPDK & VPP projects →

Learning DPDK: Algorithms for Line Speed Switching and Routing

✅ Updated January 2026 — This guide has been reviewed and updated for the latest DPDK/VPP versions.

DPDK library contains a lot of things that help the application that leverages its API to achieve line speed switching/routing. To name a few:
1. Poll drivers (recently enhanced with interrupt support)
2. Hardware optimizations (SSE, AVX2, etc.)
3. OS optimizations (Huge tables, etc.)
4. Lookup algorithms (Hash, LPM, etc.)

The following slides is an incomplete overview of the packet lookup algorithms supported in the current version of DPDK.

Good books on the topic are:
High Performance Switches and Routers

Network Algorithmics: An Interdisciplinary Approach to Designing Fast Networked Devices


Related Posts


📊 Want to see these techniques in action?
Check out our real-world DPDK & VPP projects →

Learning DPDK: Setting Up Vagrant VM for DPDK Development

✅ Updated January 2026 — This guide has been reviewed and updated for the latest DPDK/VPP versions.

It is very convenient to develop DPDK based solution using one or multiple VMs. My choice of tools for this task is a combination of vagrant and virtualbox tools. This pair makes deployment of VM really simple and easily reproducible.

The following slides give a short overview of the vagrant usage scenario.

My vagrant configuration file looks as follows.

.gist table { margin-bottom: 0; }


This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters

# -*- mode: ruby -*-
# vi: set ft=ruby :
# Vagrantfile API/syntax version. Don’t touch unless you know what you’re doing!
VAGRANTFILE_API_VERSION = “2”
Vagrant.configure(VAGRANTFILE_API_VERSION) do |config|
config.vm.box = “trusty64”
config.vm.provision :shell, path: “../bootstrap.sh”
config.vm.network “public_network”, bridge: “br-1”, ip: “192.168.56.1”
config.vm.network “public_network”, bridge: “br-2”, ip: “192.168.57.1”
config.vm.network “public_network”, bridge: “br-3”, ip: “192.168.58.1”
config.vm.provider :virtualbox do |vb|
# # Don’t boot with headless mode
# vb.gui = true
#
# # Use VBoxManage to customize the VM. For example to change memory:
vb.customize [“modifyvm”, :id, “–memory”, “1024”]
vb.customize [“modifyvm”, :id, “–cpuexecutioncap”, “50”]
vb.customize [“modifyvm”, :id, “–ioapic”, “on”]
vb.customize [“modifyvm”, :id, “–cpus”, “4”]
end
end
view raw

Vagrantfile

hosted with ❤ by GitHub

My bootstrap.sh file contains the following.

#!/bin/bash
apt-get -y install git libtool autoconf g++ gdb

Related Posts


📊 Want to see these techniques in action?
Check out our real-world DPDK & VPP projects →

Learning DPDK: Understanding KNI Interface for Kernel Network Communication

✅ Updated January 2026 — This guide has been reviewed and updated for the latest DPDK/VPP versions.

KNI (Kernel Network Interface) is an approach that is used in DPDK to connect user space applications with the kernel network stack.

The following slides present the concept using a number of functional block diagrams.

The code that we are interested in is located by the following locations.

  • Sample KNI application
    • example/kni
  • KNI kernel module
    • lib/librte_eal/linuxapp/kni
  • KNI library
    • lib/librte_kni

To begin testing KNI we need to build DPDK libraries first

git clone git://dpdk.org/dpdk
export RTE_SDK=~/dpdk/
make config T=x86_64-native-linuxapp-gcc O=x86_64-native-linuxapp-gcc
cd x86_64-native-linuxapp-gcc
make

Then we need to compile KNI sample application

cd ${RTE_SDK}/examples/kni
export RTE_TARGET=x86_64-native-linuxapp-gcc
make

To run the above application we need to load KNI kernel module

insmod ${RTE_SDK}/${RTE_TARGET}/kmod/rte_kni.ko

The following kernel module options are available in case if a loopback mode is required.

  • kthread_mode=single/multiple – number of kernel threads
  • lo_mode=lo_mode_fifo/lo_mode_fifo_skb – loopback mode

Enable enough huge pages

mkdir -p /mnt/huge
mount -t hugetlbfs nodev /mnt/huge
echo 512 /sys/devices/system/node/node0/hugepages/hugepages-2048kB/nr_hugepages

Load UIO kernel module and bind network interfaces to it. Note that you will not be able to bind interface in case if there exist any route associated with it.

modprobe uio_pci_generic
${RTE_SDK}/tools/dpdk_nic_bind.py --status
${RTE_SDK}/tools/dpdk_nic_bind.py --bind=uio_pci_generic eth1
${RTE_SDK}/tools/dpdk_nic_bind.py --bind=uio_pci_generic eth2
${RTE_SDK}/tools/dpdk_nic_bind.py --status

In the case of PC/VM with four cores we can run KNI application using the following commands.

export LD_LIBRARY_PATH=${RTE_SDK}/${RTE_TARGET}/lib/
${RTE_SDK}/examples/kni/build/kni -c 0x0f -n 4 -- -P -p 0x3 --config="(0,0,1),(1,2,3)"

Where:

  • -c = core bitmask
  • -P = promiscuous mode
  • -p = port hex bitmask
  • –config=”(port, lcore_rx, lcore_tx [,lcore_kthread, …]) …”

Note that each core can do either TX or RX for one port only.

You can use the following script to setup and run KNI test application.

.gist table { margin-bottom: 0; }


This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters

#/bin/sh
#setup path to DPDK
export RTE_SDK=/home/dpdk
export RTE_TARGET=x86_64-native-linuxapp-gcc
#setup 512 huge pages
mkdir -p /mnt/huge
umount -t hugetlbfs nodev /mnt/huge
mount -t hugetlbfs nodev /mnt/huge
echo 512 > /sys/devices/system/node/node0/hugepages/hugepages-2048kB/nr_hugepages
#bind eth1 and eth2 to Linux generic UIO
modprobe uio_pci_generic
${RTE_SDK}/tools/dpdk_nic_bind.py –bind=uio_pci_generic eth1
${RTE_SDK}/tools/dpdk_nic_bind.py –bind=uio_pci_generic eth2
#insert KNI kernel driver
insmod ${RTE_SDK}/${RTE_TARGET}/kmod/rte_kni.ko
#start KNI sample application
export LD_LIBRARY_PATH=${RTE_SDK}/${RTE_TARGET}/lib/
${RTE_SDK}/examples/kni/build/kni -c 0x0f -n 4 — -P -p 0x3 –config=”(0,0,1),(1,2,3)”
view raw

start_kni.sh

hosted with ❤ by GitHub

Let’s set ip addresses to KNI interfaces

sudo ifconfig vEth0 192.168.56.100
sudo ifconfig vEth1 192.168.56.101

Now we are set to test the application. To see statistics we need to send the SIGUSR1 signal.

watch -n 10 sudo pkill -10 kni

References


Related Posts


📊 Want to see these techniques in action?
Check out our real-world DPDK & VPP projects →

Learning DPDK: 10G traffic generator

✅ Updated January 2026 — This guide has been reviewed and updated for the latest DPDK/VPP versions.

DPDK was successfully used for a project to create Ostinato based traffic generator.

The following performance was achieved on 1G NIC.
Ostinato performance report for 1G link

The following performance was achieved on 10G NIC.
Ostinato performance report for 10G link

The code for this project is available on Github:
https://github.com/PLVision/ostinato-dpdk

The library responsible for DPDK is located on GitHub as well:
https://github.com/PLVision/dpdkadapter


Related Posts


📊 Want to see these techniques in action?
Check out our real-world DPDK & VPP projects →

Deep Dive into DPDK: Understanding Performance Boosting Techniques

✅ Updated January 2026 — This guide has been reviewed and updated for the latest DPDK/VPP versions.

Related Posts


📊 Want to see these techniques in action?
Check out our real-world DPDK & VPP projects →

Learning DPDK : Tips and Tricks

✅ Updated January 2026 — This guide has been reviewed and updated for the latest DPDK/VPP versions.

Overview

Here I will present some nuances of using DPDK libraries that I learned during implementation of DPDK based custom application.

These tips could save some time for a developer implementing an independent application using DPDK libraries.

Linking DPDK library

By default DPDK sources are built in the set of libraries. But there is an option to pack all libraries in the one dynamic or static library. To achieve it the file “config/common_linuxapp” has to be modified by assigning “y” to “CONFIG_RTE_BUILD_COMBINE_LIBS” define.

Beside that the following options in the Makefile are required.

-include /.../rte_config.h
D__STDC_LIMIT_MACROS
DRTE_MAX_LCORE=64
DRTE_PKTMBUF_HEADROOM=128
DRTE_MAX_ETHPORTS=32
DRTE_MACHINE_CPUFLAG_SSE
DRTE_MACHINE_CPUFLAG_SSE2
DRTE_MACHINE_CPUFLAG_SSE3
DRTE_MACHINE_CPUFLAG_SSSE3
DRTE_COMPILE_TIME_CPUFLAGS=RTE_CPUFLAG_SSE
DRTE_CPUFLAG_SSE2
DRTE_CPUFLAG_SSE3
DRTE_CPUFLAG_SSSE3

Drop unprocessed traffic

As I have discovered it is important to setup DPDK RX queues to drop the received packets in case if they are not processed by an application. Otherwise both RX and TX will be compromised.

To enable such behavior rte_eth_rx_queue_setup has to be provided with rte_eth_rxconf structure that has rx_drop_en field assigned 1.

It could be that such issue is present not for all NICs but only for some of them.

Mbuf reference count

In case if a specific packet has to be sent many times to the same interface it is impractical and quite inefficient to copy its contents each time before transmit. Such copy operation dramatically affects datapath throughput. The reason is not only that it takes memory cycles to execute but, what is more important, it does not let to use cache efficiently.

In order to implement a zero-copy mechanism DPDK provides a notion of reference counter for each memory buffer (mbuf) that is used to store a packet.

The function rte_pktmbuf_refcnt_update could be used to increment reference counter before each send invocation. In such scenario memory buffer would not be released to a memory pool after packet was sent out the port and the same buffer could be used again later.

Jumbo frames

Jumbo frame is a name for frames that have the size greater 1500 bytes of a regular Ethernet frame. Usually its size does not exceed 9000 bytes.

To enable DPDK application to receive such big frames the “rte_eth_conf.rxmode.jumbo_frame” field has to be set to 1 and “rte_eth_conf.rxmode.max_rx_pkt_len” has to be set to the maximum frame size supported.

Afterwards rte_eth_conf structure has to be provided as a parameter to the function rte_eth_dev_configure that is used to configure a specific interface.

Besides that it has to be noted that manipulation with such frames is implemented using multiple segment memory buffers (mbuf). The following fields of mbuf are used to support segments:

  • pkt_segs – the first segment has to specify the overall number in the chain
  • data_len – a particular segment payload length
  • pkt_len – a total length of the payload stored in all segments in the chain
  • next – the pointer to the next segment or null otherwise

Barriers for synchronization

While working on multiple cores it is important to be able to send a synchronization signal from one core to another. For this purpose a simple flag (array of flags) could be used accompanied by a R/W barrier, i.e. “rte_mb”, that ensures that all threads have most relevant memory view.

References


Related Posts


📊 Want to see these techniques in action?
Check out our real-world DPDK & VPP projects →