Essays

Data Structures in Practice. Chapter 1: The Performance Gap

A practical exploration of why theoretical algorithmic complexity (Big O notation) fails to predict real-world performance, demonstrating how CPU cache behavior makes arrays 2.55x faster than linked lists despite identical O(n) complexity.

Introduction

When we talk about data structure performance, we almost always think in terms of Big O notation. Hash table? O(1). Binary search? O(log n). Case closed, right?

Not quite. In this article, I'll show you why theoretical complexity is only part of the story — and why hardware behavior, particularly CPU cache, can completely overturn our algorithmic expectations.

The Central Problem

While working on a RISC-V bootloader for device configuration lookup, I replaced a hash table (O(1) lookup) with binary search on a sorted array (O(log n)). Counter-intuitively, the theoretically slower approach ran 40% faster.

Profiling revealed the reason: cache behavior trumped algorithmic complexity.

Hash table: 71.5% cache miss rate
Binary search: 21.1% cache miss rate

Each cache miss cost approximately 100 CPU cycles, making memory access patterns far more critical than operation count.

The Memory Hierarchy

To understand why this happens, we need to look at how modern processors access memory. Here are typical latencies:

Operation	Relative Cost
Register access	1x
L1 cache hit	3-4x
L2 cache hit	12x
DRAM access	100-200x

A single DRAM access costs as much as roughly 100 register operations. This means that a theoretically "faster" algorithm that causes frequent cache misses can be dramatically slower than a "slower" algorithm with good cache locality.

Practical Benchmark: Array vs. Linked List

To illustrate this concretely, let's compare summing 100,000 integers stored in an array versus a linked list. Both operations have identical O(n) complexity.

The results are striking:

Array: ~70,147 ns average
Linked list: ~179,169 ns average
Arrays are 2.55x faster

Despite identical algorithmic complexity, the array version is over 2.5 times faster. Why?

Spatial Locality

Arrays store elements contiguously in memory. When the CPU accesses array[0], it loads an entire 64-byte cache line — which contains 16 consecutive integers (at 4 bytes each). The next 15 accesses are instant cache hits.

This gives arrays approximately 94% cache hit rates for sequential access.

Cache Lines: The Fundamental Unit

The CPU doesn't fetch individual bytes from memory — it fetches 64-byte cache lines. This is the fundamental unit of data transfer between memory levels. When your data is scattered across memory (as with linked list nodes), each access potentially requires a full cache line fetch for just one useful value. The rest of the 64 bytes are wasted bandwidth.

Linked List Locality Problems

Linked list nodes are allocated individually on the heap, scattered across memory. Each node->next pointer dereference can point to a completely different memory region, causing approximately 70% cache misses. The CPU's prefetcher, which excels at predicting sequential access patterns, is helpless against random pointer chasing.

Embedded Systems Constraints

This matters even more in embedded systems. Microcontrollers typically feature:

16-32 KB L1 caches (versus 32-64 KB in desktop processors)
No L3 cache at all
Strict real-time requirements

Every byte counts when your cache is measured in kilobytes rather than megabytes. Choosing the wrong data structure in an embedded context doesn't just mean slower code — it can mean missed deadlines and system failures.

The Series Ahead

This article is the first in a five-part series:

Part I: Foundational concepts — memory hierarchy and benchmarking (this article)
Part II: Core data structures — arrays, lists, hash tables, heaps
Part III: Trees — BSTs, B-trees, tries
Part IV: Advanced topics — lock-free structures, strings, graphs
Part V: Real-world examples — bootloaders, device drivers, firmware

Conclusion

"Hardware beat algorithm." Theoretical superiority means nothing without considering actual machine constraints. Big O notation tells you how an algorithm scales, but it says nothing about the constant factors — and when those constant factors differ by 100x due to cache behavior, they dominate everything.

The takeaway is simple: measure, don't assume. Profile your code on actual hardware. Understand your memory access patterns. And never blindly trust Big O notation to predict real-world performance.

Data Structures in Practice. Chapter 1: The Performance Gap

Introduction

The Central Problem

The Memory Hierarchy

Practical Benchmark: Array vs. Linked List

Spatial Locality

Cache Lines: The Fundamental Unit

Linked List Locality Problems

Embedded Systems Constraints

The Series Ahead

Conclusion

Further reading

Why Airships Never Took Off. Part 12: Italian Semi-Rigid Airships

Why Airships Never Took Off. Part 11: Aircraft Carriers in the Sky

Why Airships Never Took Off. Part 10: The Most Famous and Successful Zeppelin

Why Airships Never Took Off. Part 9: Ashes of War and New Opportunities