Conquering the Linux Network Stack: Decapsulating Packets with eBPF at 6 Mpps+
When Selectel's transition to VXLAN threatened to break their traffic analysis system, they turned to eBPF to decapsulate packets at line rate. This article walks through the Linux network stack internals, eBPF hook points, and a production solution handling 6 million packets per second.
Introduction
When the transition to VXLAN in cloud networks threatened to disrupt the traffic analysis system, a solution was needed that would preserve accurate statistics collection under extreme loads with a changed packet header structure. This is a case study from Alexander Shishebarov, a senior developer on Selectel's cloud network functions team, about how eBPF helped solve the problem without significant architectural changes.
Selectel's Network Architecture
The cloud runs servers with hypervisors that launch client virtual machines. To provide internet access, a network is organized between hypervisors and physical routers using a Clos topology. Initially, connectivity was arranged via VLAN technology.
Problems with VLAN
VLAN is an extension of the Ethernet protocol that adds a special tag to headers. While the technology was simple, as the cloud expanded to hundreds and thousands of compute nodes, problems intensified.
At this scale, switches had to memorize all MAC addresses of all virtual machines, which led to slower operation and difficulties maintaining a stretched VLAN across the entire physical network.
Transition to VXLAN
VXLAN creates overlay networks on top of physical ones using UDP tunnels. Additional headers are added to the client packet: L4 (UDP on port 4789), L3, and L2. Instead of broadcasting packets across the entire network, VXLAN builds tunnels directly from source to destination.
The Traffic Analysis Problem
After deploying VXLAN in staging, a serious problem was discovered: traffic statistics from the collection system dropped from 30 to 15 megabytes per second.
Collection System Architecture
Service nodes have four dual-port 40-gigabit network cards in promiscuous mode. The Linux kernel intercepts packets via netfilter (iptables) and passes them to the ipt_NETFLOW module, which collects statistics using the NetFlow protocol developed by Cisco.
The module intercepts packets in the raw table of the PREROUTING chain and exports statistics to the analysis system.
Root Cause of the Performance Drop
After the addition of external VXLAN headers, the ipt_NETFLOW module began missing packets because it could not recognize client traffic underneath the new encapsulation. The module was looking at the outer headers — which belonged to the VXLAN tunnel itself — rather than the inner headers containing the actual client traffic information.
A solution was required that would:
- Support the new traffic format
- Be developed quickly (work was already scheduled)
- Preserve the current analysis stack without adding hardware
Choosing the Solution: eBPF
Several approaches were considered:
Kernel bypass (DPDK, Netmap) — fast, but complex to implement and requires returning packets to the kernel for further processing by ipt_NETFLOW.
Modifying ipt_NETFLOW — the code was an unreadable 6,000 lines that caused kernel panics. It would require forking and constant maintenance.
eBPF — built into the kernel, protected from errors by the verifier, high-performance, and the author had prior experience with the technology.
How eBPF Works
eBPF is a virtual machine inside the Linux kernel with its own instruction set resembling x86 assembly and virtual registers. It allows running user code from userspace in the kernel context with safety verification.
Network Hooks
Three main options for packet interception:
- XDP (eXpress Data Path) — intercepts ingress packets
- Offload — loaded into the network card itself
- Native XDP — inside the driver
- Generic XDP — in the Linux stack
- Traffic Control (TC) — intercepts via the TC subsystem, both ingress and egress
- Socket-level handlers — intercepts at the socket level (not suitable for this use case)
How the Linux Kernel Processes Packets
Hardware Interrupts
The network card copies packets via DMA into RAM and creates descriptors. An interrupt is raised to the processor, and the driver's interrupt handler is activated.
NAPI (Deferred Interrupts)
NAPI launches the driver's polling function with a time limit, disabling hardware interrupts. The function collects as many packets as possible, creates sk_buff structures, and passes them into the kernel stack.
At this stage, interception via XDP in driver mode is possible.
Processing in the Kernel Stack
The function netif_receive_skb_core handles the following stages:
- Packet taps — handlers (tcpdump, ipt_NETFLOW) at the Ethernet level
- Linux TC — L3 analysis, handler lookup
- Iptables — routing (PREROUTING, INPUT, etc.)
- L4 handlers — TCP/UDP analysis, socket lookup
Key insight: XDP Generic intercepts packets BEFORE packet taps, where ipt_NETFLOW operates. TC intercepts AFTER packet taps, so it would not solve the problem. This made XDP Generic the correct hook point — the eBPF program would strip the VXLAN headers before the packet reached ipt_NETFLOW.
eBPF Development
Kernel-Space Limitations (C Limited)
- No infinite loops
- No dynamic memory allocation (stack only)
- No access outside the context bounds
- All code branches must be reachable
- Maximum 1 million eBPF instructions
- Program must always terminate
Development Cycle
- Write code in a subset of C
- Compile with Clang to ELF format
- Load into the kernel via the bpf system call
- Verifier checks for safety
- JIT compilation (optional)
- Attach to a network interface via BPF_LINK_CREATE
Libraries
- libbpf — the primary library (C)
- BCC — Python bindings
- Cilium ebpf — Go bindings
- libbpfgo, Aya — Rust wrappers
Implementing the Decapsulation
Packet Analysis Process
- Find the L2 offset — parse dynamic VLAN headers
- Check L3 — filter service traffic by IP address
- Check L4 — look for UDP on port 4789
- Decapsulate — use the
bpf_xdp_adjust_headhelper to shift the packet start
After stripping the outer headers, the original client packet is passed into the kernel stack, where ipt_NETFLOW intercepts it and correctly reads the inner headers to collect accurate traffic statistics.
Results
Code optimization: 205 lines of kernel-space code (down from an initial 900+). All complex logic was moved to userspace.
Implementation: The Cilium eBPF library for Go was used for program management and state monitoring.
Performance:
- Synthetic tests (TRex): CPU increase for IRQ handling of 7–10%
- Production: increase of approximately 3%
- A single server handles 6 MP/sec (35–38 GB/sec) at roughly 20% CPU load
Stability: One year in production without failures.
Conclusions and Recommendations
"Move as much complex logic as possible to userspace, leaving only basic functionality in kernel-space" — for high-load systems, additional computation in the kernel significantly increases CPU load.
XDP in driver mode "takes away" CPU time from the polling function, reducing the number of collected packets and increasing IRQ processing overhead.
eBPF is not a "silver bullet": there are numerous limitations, especially the requirement to stay within 1 million instructions. Program logic must be as concise as possible. But for the right use case — fast, safe, in-kernel packet manipulation — it is an exceptionally powerful tool.
FAQ
What is this article about in one sentence?
This article explains the core idea in practical terms and focuses on what you can apply in real work.
Who is this article for?
It is written for engineers, technical leaders, and curious readers who want a clear, implementation-focused explanation.
What should I read next?
Use the related articles below to continue with closely connected topics and concrete examples.