Programming

Leap Year Check in Three CPU Instructions

Through clever use of bit manipulation and the Z3 theorem prover, a standard leap year check can be reduced from a branchy sequence of modulo operations to a single branchless expression executing in just three CPU instructions. The approach achieves a 3.8× speedup over the conventional implementation on random input.

The standard leap year check is a familiar piece of code that appears in almost every programming language's standard library. It involves three divisibility tests — by 4, by 100, and by 400 — typically implemented as a short chain of conditionals. It is correct, readable, and completely unoptimized. This article explores how far you can push this calculation, ultimately arriving at a three-instruction branchless solution found through automated constraint solving.

The Standard Implementation

The proleptic Gregorian calendar rules for leap years are:

A year divisible by 4 is a leap year...
...unless it is also divisible by 100...
...unless it is also divisible by 400, in which case it is a leap year again

bool is_leap_year(uint32_t y) {
    if ((y % 4) != 0)   return false;
    if ((y % 100) != 0) return true;
    if ((y % 400) == 0) return true;
    return false;
}

This compiles to several branches and three division operations. On modern hardware, division is expensive. Branches, when unpredictable, are expensive. Let's do better.

Step 1: Simplifying the Modulo Operations

The first insight is that certain divisibility relationships allow us to replace expensive modulo-100 and modulo-400 checks with cheaper alternatives:

y % 100 == 0 is equivalent to y % 25 == 0 when we already know y % 4 == 0, because 100 = 4 × 25 and a number divisible by both 4 and 25 is divisible by 100
y % 400 == 0 is equivalent to y % 16 == 0 under the same condition, since 400 = 16 × 25

Checking divisibility by 25 can be done with a multiplication trick. The standard compiler technique for turning x % n into a multiply-and-compare uses the observation that x is divisible by n if and only if x * (2³²/n) (using 32-bit overflow) is less than or equal to 2³²/n - 1. For n=25, the magic constant is:

⌊2³² × 19/25⌋ + 1 = 3264175145

Divisibility by 16 (and by 4) reduces to simple bitmask checks since 16 is a power of two.

This gives us a second version:

bool is_leap_year2(uint32_t y) {
    if ((y & 3) != 0)                    return false;
    if (y * 3264175145u > 171798691u)    return true;
    if ((y & 15) == 0)                   return true;
    return false;
}

Step 2: A Branchless Variant

We can also write a branch-free single-expression version:

bool is_leap_year3(uint32_t y) {
    return !(y & ((y % 25) ? 3 : 15));
}

This selects between masking with 3 (checking divisibility by 4) and masking with 15 (checking divisibility by 16) based on whether the year is a multiple of 25. This is compact, but the ternary operator still introduces a data dependency that prevents fully branchless compilation.

Step 3: The Three-Instruction Solution

The quest for a true three-instruction solution means finding constants f, m, and t such that:

bool is_leap_year_fast(uint32_t y) {
    return ((y * f) & m) <= t;
}

which compiles to: multiply, bitwise AND, compare-and-set. Finding those constants by hand is infeasible. Instead, the author fed the problem to the Z3 theorem prover using bit-vector constraints — essentially asking Z3 to find values of f, m, t such that the expression is true for all leap years and false for all non-leap years in the range 0–102,499. After approximately 30 minutes of computation, Z3 returned:

f = 1073750999
m = 3221352463
t = 126976

Yielding the final form:

bool is_leap_year_fast(uint32_t y) {
    return ((y * 1073750999) & 3221352463) <= 126976;
}

How Does It Actually Work?

The constants are not arbitrary magic numbers — they encode all three divisibility tests into the bit structure of a single 32-bit integer. Examining the binary representations of f and m reveals four distinct bit regions:

Region A (low 2 bits): Detects non-divisibility by 4 — if any bit here is set after multiplication, the year fails the first test
Region B (bits 2–16): Identifies years not divisible by 100 — these bits carry the fractional part of the modulo-100 test
Region C (bits 17–19): Checks divisibility by 16, which combined with divisibility by 25 implies divisibility by 400
Region D (remaining bits): The constant f & D ≈ 9175, which approximates 2¹⁷ × 7/100 = 9175.04, enabling detection of multiples of 100

The threshold t = 126976 acts as the combined boundary condition for all four regions simultaneously.

Performance Benchmarks

Tested on an Intel Core i7-8700K (Coffee Lake, 4.7 GHz) with g++ -O3:

Function	Fixed year (ns)	Random years (ns)
is_leap_year	0.65	2.61
is_leap_year2	0.65	2.75
is_leap_year3	0.67	0.88
is_leap_year_fast	0.69	0.69

On a fixed (predictable) input, branch prediction eliminates most of the cost of the branchy versions and all implementations are roughly equivalent. On random input — which is the realistic case when checking years from a database or user input — the three-instruction version achieves a 3.8× speedup, dropping from 2.61 ns to 0.69 ns per call.

Extending to 64-bit

The same approach works for 64-bit integers. Equivalent constants exist that correctly handle all years from 0 to 5,965,232,499, verified through Z3 solvers as discussed in a Code Golf StackExchange thread. The 64-bit version follows the same three-instruction structure.

When Does This Actually Matter?

The author is candid about practical applicability. If you are checking the current year to format a date string, branch prediction will have the conditional version running at maximum speed anyway. The benefit materializes in high-throughput scenarios with unpredictable data: batch processing of date fields across millions of database records, time-series analysis, or any inner loop where the year is an unstructured input. In those cases, eliminating branch mispredictions at 3.8× the throughput is meaningful.

The deeper lesson is about methodology: Z3 and similar SMT solvers can find optimal bit-manipulation constants that no human would derive by hand, opening up a class of micro-optimizations that are correct by construction and verified exhaustively over the full input range.

Leap Year Check in Three CPU Instructions

The Standard Implementation

Step 1: Simplifying the Modulo Operations

Step 2: A Branchless Variant

Step 3: The Three-Instruction Solution

How Does It Actually Work?

Performance Benchmarks

Extending to 64-bit

When Does This Actually Matter?

Further reading

Why Airships Never Took Off. Part 12: Italian Semi-Rigid Airships

Why Airships Never Took Off. Part 11: Aircraft Carriers in the Sky

Why Airships Never Took Off. Part 10: The Most Famous and Successful Zeppelin

Why Airships Never Took Off. Part 9: Ashes of War and New Opportunities