Leap Year Check in Three CPU Instructions
Through clever use of bit manipulation and the Z3 theorem prover, a standard leap year check can be reduced from a branchy sequence of modulo operations to a single branchless expression executing in just three CPU instructions. The approach achieves a 3.8× speedup over the conventional implementation on random input.
The standard leap year check is a familiar piece of code that appears in almost every programming language's standard library. It involves three divisibility tests — by 4, by 100, and by 400 — typically implemented as a short chain of conditionals. It is correct, readable, and completely unoptimized. This article explores how far you can push this calculation, ultimately arriving at a three-instruction branchless solution found through automated constraint solving.
The Standard Implementation
The proleptic Gregorian calendar rules for leap years are:
- A year divisible by 4 is a leap year...
- ...unless it is also divisible by 100...
- ...unless it is also divisible by 400, in which case it is a leap year again
bool is_leap_year(uint32_t y) {
if ((y % 4) != 0) return false;
if ((y % 100) != 0) return true;
if ((y % 400) == 0) return true;
return false;
}
This compiles to several branches and three division operations. On modern hardware, division is expensive. Branches, when unpredictable, are expensive. Let's do better.
Step 1: Simplifying the Modulo Operations
The first insight is that certain divisibility relationships allow us to replace expensive modulo-100 and modulo-400 checks with cheaper alternatives:
y % 100 == 0is equivalent toy % 25 == 0when we already knowy % 4 == 0, because 100 = 4 × 25 and a number divisible by both 4 and 25 is divisible by 100y % 400 == 0is equivalent toy % 16 == 0under the same condition, since 400 = 16 × 25
Checking divisibility by 25 can be done with a multiplication trick. The standard compiler technique for turning x % n into a multiply-and-compare uses the observation that x is divisible by n if and only if x * (2³²/n) (using 32-bit overflow) is less than or equal to 2³²/n - 1. For n=25, the magic constant is:
⌊2³² × 19/25⌋ + 1 = 3264175145
Divisibility by 16 (and by 4) reduces to simple bitmask checks since 16 is a power of two.
This gives us a second version:
bool is_leap_year2(uint32_t y) {
if ((y & 3) != 0) return false;
if (y * 3264175145u > 171798691u) return true;
if ((y & 15) == 0) return true;
return false;
}
Step 2: A Branchless Variant
We can also write a branch-free single-expression version:
bool is_leap_year3(uint32_t y) {
return !(y & ((y % 25) ? 3 : 15));
}
This selects between masking with 3 (checking divisibility by 4) and masking with 15 (checking divisibility by 16) based on whether the year is a multiple of 25. This is compact, but the ternary operator still introduces a data dependency that prevents fully branchless compilation.
Step 3: The Three-Instruction Solution
The quest for a true three-instruction solution means finding constants f, m, and t such that:
bool is_leap_year_fast(uint32_t y) {
return ((y * f) & m) <= t;
}
which compiles to: multiply, bitwise AND, compare-and-set. Finding those constants by hand is infeasible. Instead, the author fed the problem to the Z3 theorem prover using bit-vector constraints — essentially asking Z3 to find values of f, m, t such that the expression is true for all leap years and false for all non-leap years in the range 0–102,499. After approximately 30 minutes of computation, Z3 returned:
f = 1073750999m = 3221352463t = 126976
Yielding the final form:
bool is_leap_year_fast(uint32_t y) {
return ((y * 1073750999) & 3221352463) <= 126976;
}
How Does It Actually Work?
The constants are not arbitrary magic numbers — they encode all three divisibility tests into the bit structure of a single 32-bit integer. Examining the binary representations of f and m reveals four distinct bit regions:
- Region A (low 2 bits): Detects non-divisibility by 4 — if any bit here is set after multiplication, the year fails the first test
- Region B (bits 2–16): Identifies years not divisible by 100 — these bits carry the fractional part of the modulo-100 test
- Region C (bits 17–19): Checks divisibility by 16, which combined with divisibility by 25 implies divisibility by 400
- Region D (remaining bits): The constant
f & D ≈ 9175, which approximates2¹⁷ × 7/100 = 9175.04, enabling detection of multiples of 100
The threshold t = 126976 acts as the combined boundary condition for all four regions simultaneously.
Performance Benchmarks
Tested on an Intel Core i7-8700K (Coffee Lake, 4.7 GHz) with g++ -O3:
| Function | Fixed year (ns) | Random years (ns) |
|---|---|---|
| is_leap_year | 0.65 | 2.61 |
| is_leap_year2 | 0.65 | 2.75 |
| is_leap_year3 | 0.67 | 0.88 |
| is_leap_year_fast | 0.69 | 0.69 |
On a fixed (predictable) input, branch prediction eliminates most of the cost of the branchy versions and all implementations are roughly equivalent. On random input — which is the realistic case when checking years from a database or user input — the three-instruction version achieves a 3.8× speedup, dropping from 2.61 ns to 0.69 ns per call.
Extending to 64-bit
The same approach works for 64-bit integers. Equivalent constants exist that correctly handle all years from 0 to 5,965,232,499, verified through Z3 solvers as discussed in a Code Golf StackExchange thread. The 64-bit version follows the same three-instruction structure.
When Does This Actually Matter?
The author is candid about practical applicability. If you are checking the current year to format a date string, branch prediction will have the conditional version running at maximum speed anyway. The benefit materializes in high-throughput scenarios with unpredictable data: batch processing of date fields across millions of database records, time-series analysis, or any inner loop where the year is an unstructured input. In those cases, eliminating branch mispredictions at 3.8× the throughput is meaningful.
The deeper lesson is about methodology: Z3 and similar SMT solvers can find optimal bit-manipulation constants that no human would derive by hand, opening up a class of micro-optimizations that are correct by construction and verified exhaustively over the full input range.