High Load

FizzBuzz, Senior Style

A senior developer interview spirals into an obsessive optimization journey, taking a simple FizzBuzz from 39 seconds down to 1 second through loop unrolling, custom printing, SIMD intrinsics, and multithreading.

— Good afternoon, I'm here for a senior developer position interview.

— Hello, let's start with a small test while I review your CV. Write a program that prints numbers from 1 to, let's say, a billion, but if a number is divisible by three, print Fizz instead, if divisible by five — Buzz, and if by both three and five — FizzBuzz.

Seriously, FizzBuzz? An elementary school problem, for a senior position? Well, okay.

I pull out my trusty laptop and write the following code:

#include <stdio.h>

#define LIMIT 1000000000

int main(void) {
    for (int i = 1; i <= LIMIT; i++) {
        if (0 == i % 3) {
            if (0 == i % 5) {
                printf("FizzBuzz\n");
            } else {
                printf("Fizz\n");
            }
        } else if (0 == i % 5) {
            printf("Buzz\n");
        } else {
            printf("%d\n", i);
        }
    }

    return 0;
}

I run the program — it chugs along, but not exactly fast. After 3-something minutes, past the first million, I interrupt it and quickly calculate that the whole process would take over two days. Yes, I probably should have enabled buffering — that would speed things up somewhat — but it won't save me. Better to just redirect the output to a file, which is what I do, and 41.5 seconds later I have a beautiful 7.5 GB file.

— Don't you think it could be faster? — asks the interviewer.

— Come on, most of the time is spent on I/O. Writing 7.5 GB is no joke, even on an SSD.

— How about we redirect output to /dev/null?

— No problem.

A minute later:

— How is it 39.5 seconds? So all the I/O takes only 2 seconds, and everything else is my code?

— Yes, that's what we're seeing. This isn't the slowest implementation — each iteration has two comparisons and one printf. I often see versions with three comparisons and two printfs. For a junior, I'd say this is even decent. But for a senior...

That hurt, but I suppose it was deserved. Alright, I'll show you who the junior is here.

— I'll make it faster now.

— Give it a try. Just explain what you're doing.

— You see, there's a pattern here — every 3×5, that is, every 15 loop iterations the logic repeats entirely. So we can restructure the loop:

for (i = 1; i < LIMIT - 15; i += 15) {
    printf( "%d\n"          // 1
            "%d\n"          // 2
            "Fizz\n"        // 3
            "%d\n"          // 4
            "Buzz\n"        // 5
            "Fizz\n"        // 6
            "%d\n"          // 7
            "%d\n"          // 8
            "Fizz\n"        // 9
            "Buzz\n"        // 10
            "%d\n"          // 11
            "Fizz\n"        // 12
            "%d\n"          // 13
            "%d\n"          // 14
            "FizzBuzz\n",   // 15
            i, i+1, i+3, i+6, i+7, i+10, i+12, i+13);
}

— Where before every 15 numbers required 15 loop variable comparisons and two if statements in the body — that's 45 comparisons total, and each comparison is a potential branch prediction problem — now there's just one. Plus, the number of printf calls has been reduced by 15×. The only issue is the loop won't reach exactly one billion, only up to 999,999,990 (the largest multiple of 15 less than a billion), so we keep the old loop but only for handling the tail — the last 10 values (which practically doesn't affect performance).

After all the changes, the code looks like this.

— And what do we get for time?

— Output to file: 22.5 seconds. To /dev/null: 20.2.

The interviewer is pleased — apparently, something like this is what he expected from me. But... he shouldn't have said that thing about juniors.

— I think this isn't the limit.

— Really? What else can you optimize here?

— I reduced the number of printf calls by 15×, but each individual call became heavier. And printf itself is heavy due to its power — it's essentially a virtual machine with its own Turing-complete language; people have even written tic-tac-toe with it. In our case, only a small fraction of printf's capabilities are used, so we can replace it with something lighter:

#define NUM cur += myitoa(num++, cur)
#define FIZZ do { memcpy(cur, "Fizz\n", 5); cur += 5; num++; } while (0)
#define BUZZ do { memcpy(cur, "Buzz\n", 5); cur += 5; num++; } while (0)
#define FIZZBUZZ do { memcpy(cur, "FizzBuzz\n", 9); cur += 9; } while (0)

void print(int num) {
    static char wrkbuf[CHUNK_SIZE];

    char *cur = wrkbuf;
    NUM;
    NUM;
    FIZZ;
    NUM;
    BUZZ;
    FIZZ;
    NUM;
    NUM;
    FIZZ;
    BUZZ;
    NUM;
    FIZZ;
    NUM;
    NUM;
    FIZZBUZZ;
    fwrite(wrkbuf, cur - wrkbuf, 1, stdout);
}

— We could use an existing itoa, but it's a non-standard function, not available everywhere, and it's too universal since it supports different numeral systems while we only need decimal — simplify everything possible. And of course, in the main loop we just call print(i) instead of that long printf.

The resulting code looks like this.

I walk up to the whiteboard and draw a table with the run results:

Variant	Output to file			Output to `/dev/null`
Variant	Time (sec)	vs. naive	vs. previous	Time (sec)	vs. naive	vs. previous
naive	41.429	1x	—	39.650	1x	—
loop optimization	22.546	1.83x	1.83x	20.151	1.97x	1.97x
ditching `printf`	12.563	3.30x	1.80x	8.771	4.52x	2.30x

— We can mostly ignore the file output — some time gets eaten by I/O there, and it fluctuates, so it's better to focus on the I/O-free time.

I erase the file output part.

— Total speedup of 4.5×.

The interviewer looks at me with respect. Seems like I passed the interview, time to wrap up — the interviewer is already starting to shuffle papers. But I'm not done yet.

— I know how to make it even faster.

— Seriously?

— Absolutely. So far I've used purely technical speedup methods, but we can also improve algorithmically. Look at what gets printed when we call, say, print(150000001) and the subsequent print(150000016):

150000001\n150000002\nFizz\n150000004\nBuzz\nFizz\n150000007\n150000008\nFizz\nBuzz\n150000011\nFizz\n150000013\n150000014\nFizzBuzz\n
150000016\n150000017\nFizz\n150000019\nBuzz\nFizz\n150000022\n150000023\nFizz\nBuzz\n150000026\nFizz\n150000028\n150000029\nFizzBuzz\n
       ^^         ^^               ^^                     ^^         ^^                     ^^               ^^         ^^

— The differences are in only 16 bytes, yet the program rebuilds the entire string from scratch. We can just change bytes in place. Granted, we don't know in advance how many decimal digits need changing, so we'll need to compute that — compare two buffers and determine where they differ. That's a somewhat heavy task, but we have — I pause dramatically — vector instructions and intrinsics for them!

I don't say it out loud, but I'm implying that a junior wouldn't have come up with this. At this moment I realize the interviewer thinks so too.

I open Intel's intrinsics page and find the vector functions I need for working with 16-byte vectors. My values are 10 bytes max, but they can be padded with zeros to 16 — not a problem. And yes, 16-byte vectors mean SSE instructions — no AVX-512 needed here, my 4-year-old mobile processor can definitely handle this.

I get this chunk with rich and delicious intrinsics:

unsigned int diff = 0xFFFF & ~_mm_movemask_epi8(_mm_cmpeq_epi8(
                                  _mm_load_si128((__m128i const *)prev_first_number),
                                  _mm_load_si128((__m128i const *)last_number)));
unsigned int diff_pos = 16 - _tzcnt_u32(diff);   // number of changed digits

A quick check of flags in /proc/cpuinfo — the SSE2 (dating back to Pentium 4) and BMI1 (introduced in Haswell) instruction sets needed for my chosen intrinsics are present in the CPU — everything should work.

I run the resulting code, now only looking at /dev/null output, and update the table:

Variant	Time (sec)	vs. naive	vs. previous
naive	39.650	1x	—
loop optimization	20.151	1.97x	1.97x
ditching `printf`	8.771	4.52x	2.30x
buffer reuse	4.490	8.83x	1.95x

Almost another 2× speedup! And compared to the initial version — nearly 9×. Too bad I didn't quite reach 10×.

— Well, that should probably be enough. This is quite senior-level already.

The interviewer's gaze shows relief.

— Most likely we've reached the limit of what can be squeezed from a single-threaded program, — I say, slowly trailing off toward the end of the sentence. How did I not think of this earlier! — I haven't tried multithreading yet!

The interviewer appears to be getting worried, trying to inconspicuously slide to the adjacent chair closer to the conference room exit, but the armrests are seriously getting in his way. Whatever — he shouldn't have provoked me!

— I'll allocate each worker thread a chunk of the number space, and that thread will return a ready buffer for its chunk, while the main thread prints those buffers in order:

for (int j = 0; j < THREAD_COUNT; j++) {
        thread_pool[j].start_num = i;
        thread_pool[j].count = NUMS_PER_THREAD;
        thread_pool[j].buf = malloc(BUFFER_SIZE);
        pthread_create(&thread_pool[j].id, NULL, worker, (void *)&thread_pool[j]);
        i += NUMS_PER_THREAD;
    }
    int active_threads = THREAD_COUNT;
    int max = LIMIT / 15 * 15;
    for (int j = 0; active_threads; j = (j+1) % THREAD_COUNT) {
        pthread_join(thread_pool[j].id, NULL);
        fwrite(thread_pool[j].buf, thread_pool[j].buflen, 1, stdout);
        if (max - i > NUMS_PER_THREAD) {
            thread_pool[j].start_num = i;
            pthread_create(&thread_pool[j].id, NULL, worker, (void *)&thread_pool[j]);
            i += NUMS_PER_THREAD;
        } else if (max > i) {
            thread_pool[j].start_num = i;
            thread_pool[j].count = max - i + 1;
            pthread_create(&thread_pool[j].id, NULL, worker, (void *)&thread_pool[j]);
            i += max - i + 1;
        } else {
            free(thread_pool[j].buf);
            active_threads--;
        }
    }

— I have a dual-core processor with hyper-threading, so four worker threads simultaneously will be optimal. While the main thread handles output, one worker thread is idle, so at any given moment at most 4 threads are active. Of course, it's worth experimenting with chunk size — ideally a worker thread should process its chunk in exactly the same time the main thread takes to output data, so nobody waits and everyone runs at maximum speed.

After several measurements, I settled on chunks of 3 million numbers — a convenient number divisible by 15, and the results are good.

The resulting code looks like this.

I run it and update the table:

Variant	Time (sec)	vs. naive	vs. previous
naive	39.650	1x	—
loop optimization	20.151	1.97x	1.97x
ditching `printf`	8.771	4.52x	2.30x
buffer reuse	4.490	8.83x	1.95x
multithreading	1.748	22.68x	2.57x

— There you have it — I reduced processing time by over 22×. And the code is quite senior-level: clever algorithm, multithreading, intrinsics no less. What do you think?

I turned away from the whiteboard and saw that I was alone in the conference room. Through the glass wall I could make out the interviewer, rapidly saying something to the receptionist while pointing in my direction. Is he... calling security? Looks like I've overstayed my welcome.

I quickly closed my laptop and left the office. For some reason, they never called me back.

All tests were performed on a Dell Latitude 7480 with an i7-7600U at 2.8 GHz, 16 GB memory, SSD, running OpenSUSE Leap 15.1 with kernel 4.12.14. Each test was run at least 10 times, and the lowest value was selected. Compilation flags used: -O3 -march=native -pthread

All code variants produce identical output when directed to a file, meaning they work correctly. Code available in this repo.

P.S. I received some very interesting PRs from a Japanese developer named kariya-mitsuru, and there were a couple of fascinating ideas in them. The most important one was filling the buffer from right to left instead of left to right — this eliminated several memcpy() calls, yielding noticeable speedups. For example, the custom print variant (without printf) became 31% faster, and the buffer reuse variant — 59% faster.

Based on this approach, I also reworked the multithreaded version, making it 66% faster. I really wanted to break into sub-1-second territory, but fell just barely short — a mere 50 milliseconds.

The resulting table looks like this (with links to corresponding implementations):

Variant	Time (sec)	vs. naive	vs. previous
naive	39.650	1x	—
loop optimization	20.151	1.97x	1.97x
ditching `printf`	8.771	4.52x	2.30x
ditching `printf` v2 (right-to-left buffer fill)	6.695	5.92x	1.31x
buffer reuse	4.490	8.83x	1.49x
buffer reuse v2 (right-to-left buffer fill)	2.818	14.07x	1.59x
multithreading	1.051	37.73x	2.68x

FizzBuzz, Senior Style

Further reading

Why Airships Never Took Off. Part 12: Italian Semi-Rigid Airships

Why Airships Never Took Off. Part 11: Aircraft Carriers in the Sky

Why Airships Never Took Off. Part 10: The Most Famous and Successful Zeppelin

Why Airships Never Took Off. Part 9: Ashes of War and New Opportunities