Programming

Don't Look at CPU Usage % with Hyper-Threading

CPU usage percentage is a misleading metric when hyper-threading is involved. Through controlled assembly-level benchmarks, this article demonstrates why Instructions Per Cycle (IPC) is the proper way to measure processor utilization.

We all know that doubling the number of cores through hyper-threading doesn't generally double performance. But why exactly, and what should we look at instead? This article dives deep into processor microarchitecture with controlled assembly-language experiments to show why the CPU usage percentage in your task manager is essentially lying to you.

Test Environment

All experiments were conducted on a Xeon processor with 4 physical cores and hyper-threading enabled (8 logical cores). The base clock is 3.5 GHz with turbo boost up to 3.6 GHz.

Synthetic Multiplication Test

The core test uses Euro Assembler to perform 1.2 billion multiplication operations:

mov r8, 1_200_000_000
.loop:
    imul r10, r10
    dec r8
    jnz .loop

Results:

One copy on a single core: ~3.5 billion ticks (1 second)
Four copies on four physical cores: each runs independently with no degradation
Eight copies on eight logical cores: performance does not degrade, IPC stays at ~1.0

Why Does This Work? Pipeline Utilization

The key insight is understanding pipeline utilization. The imul instruction has a latency of 3 clock cycles. In our loop, there are effectively 3 instructions (multiplication, decrement, and conditional jump — though dec and jnz are fused into one micro-op via macro-fusion). This means the pipeline is only ~33% utilized during multiplication.

Hyper-threading allows a second logical thread to fill the remaining ~67% of the pipeline that would otherwise sit idle. This is why running two threads on the same physical core doesn't degrade either thread's performance in this particular case.

Impact of Instruction Count

Testing with 2-4 multiplications per loop iteration reveals the saturation point:

2 multiplications: IPC = 1.33 (4 instructions / 3 cycles)
3 multiplications: IPC = 1.66 (5 instructions / 3 cycles)
4 multiplications: 4.6-4.7 billion ticks — the pipeline is fully saturated

When the pipeline is fully loaded, hyper-threading provides no benefit — in fact, the second thread actively competes for resources and slows down the first.

Real-World Test: SHA-256

Moving from synthetic benchmarks to a real cryptographic function implemented in pure assembly:

Single core: 155-156 MB/s, IPC = 2.78
Two logical cores on one physical core: 83 MB/s each (166 MB/s total — only a ~10% gain from hyper-threading)

The practical gain from hyper-threading in a real-world, compute-intensive task is just a few percent. The SHA-256 algorithm already keeps the pipeline busy with an IPC of 2.78, leaving little room for a second thread.

Cache Effects

Two memory access tests reveal cache contention problems:

Sequential read of 32K: ~40 thousand increments (good cache behavior)
Strided read of 32K from a 1GB array: significantly slower due to cache misses

When both tests run simultaneously on the same physical core, both slow down significantly because they share the L3 cache. Threads accessing different memory regions on the same physical core create cache contention that degrades both threads' performance.

Turbo Boost Effects

On a 6-core processor, turbo boost frequency decreases as more cores become active:

1 thread: 160 MB/s (maximum turbo)
2 threads: 145 MB/s
4 threads: 138 MB/s
6 threads: 130 MB/s

The clock frequency drops by several hundred MHz as load increases across cores. This means that even without hyper-threading, adding more threads doesn't scale linearly because each thread runs at a lower frequency.

Latency vs. Throughput

Understanding the distinction between latency and throughput is critical:

The imul instruction has a latency of 3 cycles but can be issued every cycle if there are no data dependencies
The processor uses macro-fusion to combine dec + jnz into a single micro-op

Measurement Methodology

All timing uses RDTSC (Time Stamp Counter) at the base frequency of 3.5 GHz. RDTSCP additionally returns the core number in ECX, which helps verify thread affinity.

Power Consumption

Power consumption is not a reliable corrective metric either. Energy usage barely changes when adding hyper-threaded load to a core — the transistors are switching regardless. This means you can't use power draw as a proxy for "real" utilization.

Critical Conclusions

CPU usage percentage does not reflect actual utilization. At 50% load, pipelines may only be 33% filled.
The correct metric is IPC (Instructions Per Cycle). You can measure this using Intel Performance Counter Monitor.
Hyper-threading only helps when pipelines are underutilized. If pipelines are fully loaded, the second logical thread slows down the first.
Power draw is not a good corrective metric. Energy consumption barely changes when adding hyper-threaded load.
In real applications, the gain is typically 5-10%, not the doubling of performance that the doubled core count might suggest.

Practical Recommendations

Distribute threads that read different memory regions across different physical cores
Evaluate real utilization through IPC, not percentages in Task Manager
Account for turbo boost reduction under full load across all cores
Remember that hyper-threading is a compromise for underutilized pipelines