Data Structures in Practice. Chapter 1: The Performance Gap
A practical exploration of why theoretical algorithmic complexity (Big O notation) fails to predict real-world performance, demonstrating how CPU cache behavior makes arrays 2.55x faster than linked lists despite identical O(n) complexity.
Introduction
When we talk about data structure performance, we almost always think in terms of Big O notation. Hash table? O(1). Binary search? O(log n). Case closed, right?
Not quite. In this article, I'll show you why theoretical complexity is only part of the story — and why hardware behavior, particularly CPU cache, can completely overturn our algorithmic expectations.
The Central Problem
While working on a RISC-V bootloader for device configuration lookup, I replaced a hash table (O(1) lookup) with binary search on a sorted array (O(log n)). Counter-intuitively, the theoretically slower approach ran 40% faster.
Profiling revealed the reason: cache behavior trumped algorithmic complexity.
- Hash table: 71.5% cache miss rate
- Binary search: 21.1% cache miss rate
Each cache miss cost approximately 100 CPU cycles, making memory access patterns far more critical than operation count.
The Memory Hierarchy
To understand why this happens, we need to look at how modern processors access memory. Here are typical latencies:
| Operation | Relative Cost |
|---|---|
| Register access | 1x |
| L1 cache hit | 3-4x |
| L2 cache hit | 12x |
| DRAM access | 100-200x |
A single DRAM access costs as much as roughly 100 register operations. This means that a theoretically "faster" algorithm that causes frequent cache misses can be dramatically slower than a "slower" algorithm with good cache locality.
Practical Benchmark: Array vs. Linked List
To illustrate this concretely, let's compare summing 100,000 integers stored in an array versus a linked list. Both operations have identical O(n) complexity.
The results are striking:
- Array: ~70,147 ns average
- Linked list: ~179,169 ns average
- Arrays are 2.55x faster
Despite identical algorithmic complexity, the array version is over 2.5 times faster. Why?
Spatial Locality
Arrays store elements contiguously in memory. When the CPU accesses array[0], it loads an entire 64-byte cache line — which contains 16 consecutive integers (at 4 bytes each). The next 15 accesses are instant cache hits.
This gives arrays approximately 94% cache hit rates for sequential access.
Cache Lines: The Fundamental Unit
The CPU doesn't fetch individual bytes from memory — it fetches 64-byte cache lines. This is the fundamental unit of data transfer between memory levels. When your data is scattered across memory (as with linked list nodes), each access potentially requires a full cache line fetch for just one useful value. The rest of the 64 bytes are wasted bandwidth.
Linked List Locality Problems
Linked list nodes are allocated individually on the heap, scattered across memory. Each node->next pointer dereference can point to a completely different memory region, causing approximately 70% cache misses. The CPU's prefetcher, which excels at predicting sequential access patterns, is helpless against random pointer chasing.
Embedded Systems Constraints
This matters even more in embedded systems. Microcontrollers typically feature:
- 16-32 KB L1 caches (versus 32-64 KB in desktop processors)
- No L3 cache at all
- Strict real-time requirements
Every byte counts when your cache is measured in kilobytes rather than megabytes. Choosing the wrong data structure in an embedded context doesn't just mean slower code — it can mean missed deadlines and system failures.
The Series Ahead
This article is the first in a five-part series:
- Part I: Foundational concepts — memory hierarchy and benchmarking (this article)
- Part II: Core data structures — arrays, lists, hash tables, heaps
- Part III: Trees — BSTs, B-trees, tries
- Part IV: Advanced topics — lock-free structures, strings, graphs
- Part V: Real-world examples — bootloaders, device drivers, firmware
Conclusion
"Hardware beat algorithm." Theoretical superiority means nothing without considering actual machine constraints. Big O notation tells you how an algorithm scales, but it says nothing about the constant factors — and when those constant factors differ by 100x due to cache behavior, they dominate everything.
The takeaway is simple: measure, don't assume. Profile your code on actual hardware. Understand your memory access patterns. And never blindly trust Big O notation to predict real-world performance.