Essays

Conquering the Linux Network Stack: Decapsulating Packets with eBPF at 6 Mpps+

When Selectel's transition to VXLAN threatened to break their traffic analysis system, they turned to eBPF to decapsulate packets at line rate. This article walks through the Linux network stack internals, eBPF hook points, and a production solution handling 6 million packets per second.

Introduction

When the transition to VXLAN in cloud networks threatened to disrupt the traffic analysis system, a solution was needed that would preserve accurate statistics collection under extreme loads with a changed packet header structure. This is a case study from Alexander Shishebarov, a senior developer on Selectel's cloud network functions team, about how eBPF helped solve the problem without significant architectural changes.

Selectel's Network Architecture

The cloud runs servers with hypervisors that launch client virtual machines. To provide internet access, a network is organized between hypervisors and physical routers using a Clos topology. Initially, connectivity was arranged via VLAN technology.

Problems with VLAN

VLAN is an extension of the Ethernet protocol that adds a special tag to headers. While the technology was simple, as the cloud expanded to hundreds and thousands of compute nodes, problems intensified.

At this scale, switches had to memorize all MAC addresses of all virtual machines, which led to slower operation and difficulties maintaining a stretched VLAN across the entire physical network.

Transition to VXLAN

VXLAN creates overlay networks on top of physical ones using UDP tunnels. Additional headers are added to the client packet: L4 (UDP on port 4789), L3, and L2. Instead of broadcasting packets across the entire network, VXLAN builds tunnels directly from source to destination.

The Traffic Analysis Problem

After deploying VXLAN in staging, a serious problem was discovered: traffic statistics from the collection system dropped from 30 to 15 megabytes per second.

Collection System Architecture

Service nodes have four dual-port 40-gigabit network cards in promiscuous mode. The Linux kernel intercepts packets via netfilter (iptables) and passes them to the ipt_NETFLOW module, which collects statistics using the NetFlow protocol developed by Cisco.

The module intercepts packets in the raw table of the PREROUTING chain and exports statistics to the analysis system.

Root Cause of the Performance Drop

After the addition of external VXLAN headers, the ipt_NETFLOW module began missing packets because it could not recognize client traffic underneath the new encapsulation. The module was looking at the outer headers — which belonged to the VXLAN tunnel itself — rather than the inner headers containing the actual client traffic information.

A solution was required that would:

Support the new traffic format
Be developed quickly (work was already scheduled)
Preserve the current analysis stack without adding hardware

Choosing the Solution: eBPF

Several approaches were considered:

Kernel bypass (DPDK, Netmap) — fast, but complex to implement and requires returning packets to the kernel for further processing by ipt_NETFLOW.

Modifying ipt_NETFLOW — the code was an unreadable 6,000 lines that caused kernel panics. It would require forking and constant maintenance.

eBPF — built into the kernel, protected from errors by the verifier, high-performance, and the author had prior experience with the technology.

How eBPF Works

eBPF is a virtual machine inside the Linux kernel with its own instruction set resembling x86 assembly and virtual registers. It allows running user code from userspace in the kernel context with safety verification.

Network Hooks

Three main options for packet interception:

XDP (eXpress Data Path) — intercepts ingress packets
- Offload — loaded into the network card itself
- Native XDP — inside the driver
- Generic XDP — in the Linux stack
Traffic Control (TC) — intercepts via the TC subsystem, both ingress and egress
Socket-level handlers — intercepts at the socket level (not suitable for this use case)

How the Linux Kernel Processes Packets

Hardware Interrupts

The network card copies packets via DMA into RAM and creates descriptors. An interrupt is raised to the processor, and the driver's interrupt handler is activated.

NAPI (Deferred Interrupts)

NAPI launches the driver's polling function with a time limit, disabling hardware interrupts. The function collects as many packets as possible, creates sk_buff structures, and passes them into the kernel stack.

At this stage, interception via XDP in driver mode is possible.

Processing in the Kernel Stack

The function netif_receive_skb_core handles the following stages:

Packet taps — handlers (tcpdump, ipt_NETFLOW) at the Ethernet level
Linux TC — L3 analysis, handler lookup
Iptables — routing (PREROUTING, INPUT, etc.)
L4 handlers — TCP/UDP analysis, socket lookup

Key insight: XDP Generic intercepts packets BEFORE packet taps, where ipt_NETFLOW operates. TC intercepts AFTER packet taps, so it would not solve the problem. This made XDP Generic the correct hook point — the eBPF program would strip the VXLAN headers before the packet reached ipt_NETFLOW.

eBPF Development

Kernel-Space Limitations (C Limited)

No infinite loops
No dynamic memory allocation (stack only)
No access outside the context bounds
All code branches must be reachable
Maximum 1 million eBPF instructions
Program must always terminate

Development Cycle

Write code in a subset of C
Compile with Clang to ELF format
Load into the kernel via the bpf system call
Verifier checks for safety
JIT compilation (optional)
Attach to a network interface via BPF_LINK_CREATE

Libraries

libbpf — the primary library (C)
BCC — Python bindings
Cilium ebpf — Go bindings
libbpfgo, Aya — Rust wrappers

Implementing the Decapsulation

Packet Analysis Process

Find the L2 offset — parse dynamic VLAN headers
Check L3 — filter service traffic by IP address
Check L4 — look for UDP on port 4789
Decapsulate — use the bpf_xdp_adjust_head helper to shift the packet start

After stripping the outer headers, the original client packet is passed into the kernel stack, where ipt_NETFLOW intercepts it and correctly reads the inner headers to collect accurate traffic statistics.

Results

Code optimization: 205 lines of kernel-space code (down from an initial 900+). All complex logic was moved to userspace.

Implementation: The Cilium eBPF library for Go was used for program management and state monitoring.

Performance:

Synthetic tests (TRex): CPU increase for IRQ handling of 7–10%
Production: increase of approximately 3%
A single server handles 6 MP/sec (35–38 GB/sec) at roughly 20% CPU load

Stability: One year in production without failures.

Conclusions and Recommendations

"Move as much complex logic as possible to userspace, leaving only basic functionality in kernel-space" — for high-load systems, additional computation in the kernel significantly increases CPU load.

XDP in driver mode "takes away" CPU time from the polling function, reducing the number of collected packets and increasing IRQ processing overhead.

eBPF is not a "silver bullet": there are numerous limitations, especially the requirement to stay within 1 million instructions. Program logic must be as concise as possible. But for the right use case — fast, safe, in-kernel packet manipulation — it is an exceptionally powerful tool.

FAQ

What is this article about in one sentence?

This article explains the core idea in practical terms and focuses on what you can apply in real work.

Who is this article for?

It is written for engineers, technical leaders, and curious readers who want a clear, implementation-focused explanation.

What should I read next?

Use the related articles below to continue with closely connected topics and concrete examples.