Fastest Node.js Fizzbuzz

I recently encountered an interesting thread on CodeGolf that asked for the fastest implementation of everybody's favorite interview question FizzBuzz. Unsurprisingly, the fastest were written in Assembly and C that optimized their use of cache lines and syscalls.

In that thread, there were also some people who were curious on how fast other languages can be when pushed to their limits. Surprisingly nobody has attempted with JavaScript yet, at least not one with a valid solution.

Testing Methodology

All tests were run with the command node.js main.js | pv > /dev/null on my machine with Ryzen 2600 and Node.js 18.4.0.

Basic Implementation

This is the basic FizzBuzz that everybody knows and love.

undefined

However, this is not a valid solution for this particular challenge since the author wanted solutions that can execute up to 2^63-1. In JavaScript, the maximum safe integer value that can be stored in a native number (which is basically a double precision float) is 2^53-1. To fix this issue, we have to use the BigInt datatype introduced in Node.js 10.4.0.

undefined

Running this baseline implementation yields around 1 MiB/s.

Remove Branches

One property of FizzBuzz is that it is cyclic every 15 integers. The first optimization that we can make is to simply remove all the branches (if-statements) and unroll every 15 iterations.

undefined

Running this implementation yields around 10 MiB/s.

Remove console.log

Next, we can use process.stdout.write to directly write to the output stream and bypass the formatter logic inside console.log.

undefined

Running this implementation yields around 30 MiB/s.

Reduce process.stdout.write

Another caveat of Node.js is that whenever we are writing to stdout, the main thread gets blocked until the write finishes. Internally, each write is an expensive system call. To minimize these write calls, we can batch what we want to write until our buffer is full.

undefined

Running this implementation yields around 100 MiB/s.

Out of Memory Errors

One thing I noticed while testing the previous implementations is that after a few minutes they will run out of memory and crash.

<--- Last few GCs --->

[13497:0x5f97850]    40767 ms: Mark-sweep 4045.2 (4131.6) -> 4040.7 (4142.8) MB, 2416.9 / 0.0 ms  (average mu = 0.518, current mu = 0.069) allocation failure; scavenge might not succeed
[13497:0x5f97850]    44305 ms: Mark-sweep 4056.5 (4142.8) -> 4051.6 (4153.6) MB, 3522.2 / 0.0 ms  (average mu = 0.269, current mu = 0.004) allocation failure; scavenge might not succeed


<--- JS stacktrace --->

FATAL ERROR: Reached heap limit Allocation failed - JavaScript heap out of memory
 1: 0xb652d0 node::Abort() [node]
 2: 0xa761b5 node::FatalError(char const*, char const*) [node]
 3: 0xd55b6e v8::Utils::ReportOOMFailure(v8::internal::Isolate*, char const*, bool) [node]
 4: 0xd55ee7 v8::internal::V8::FatalProcessOutOfMemory(v8::internal::Isolate*, char const*, bool) [node]
 5: 0xf32f15  [node]
 6: 0xf453fd v8::internal::Heap::CollectGarbage(v8::internal::AllocationSpace, v8::internal::GarbageCollectionReason, v8::GCCallbackFlags) [node]
 7: 0xf1fafe v8::internal::HeapAllocator::AllocateRawWithLightRetrySlowPath(int, v8::internal::AllocationType, v8::internal::AllocationOrigin, v8::internal::AllocationAlignment) [node]
 8: 0xf20ec7 v8::internal::HeapAllocator::AllocateRawWithRetryOrFailSlowPath(int, v8::internal::AllocationType, v8::internal::AllocationOrigin, v8::internal::AllocationAlignment) [node]
 9: 0xf01410 v8::internal::Factory::AllocateRaw(int, v8::internal::AllocationType, v8::internal::AllocationAlignment) [node]
10: 0xef8e84 v8::internal::FactoryBase<v8::internal::Factory>::AllocateRawWithImmortalMap(int, v8::internal::AllocationType, v8::internal::Map, v8::internal::AllocationAlignment) [node]
11: 0xefb138 v8::internal::FactoryBase<v8::internal::Factory>::NewRawOneByteString(int, v8::internal::AllocationType) [node]
12: 0x1070231 v8::internal::BigInt::ToString(v8::internal::Isolate*, v8::internal::Handle<v8::internal::BigInt>, int, v8::internal::ShouldThrow) [node]
13: 0x12d340e v8::internal::Runtime_ToString(int, unsigned long*, v8::internal::Isolate*) [node]
14: 0x16f6939  [node]

Since we clearly do not have any dangling references or memory leaks, we can only conclude that V8's garbage collector fails to free up all the temporary strings. The error log states that the garbage collector executes every 4 seconds but fails to free up more than 5 MB each time. Since these implementations generate upwards of 100 MiB/s worth of strings, it's possible that the garbage collector simply does not have enough time budgeted to free up all the unused strings.

Memory Safe Implementation

Since the challenge is to implement something that can, in theory, execute on all 2^63-1 integers given enough time, we need a solution that doesn't run out of memory after only a few minutes.

Since strings are too problematic, I decided to use a fixed binary buffer and avoid as much non-stack allocations as possible. This buffer is explicitly allocated once and is simply reused by resetting the offset counter. The only heap object allocated is where I called subarray which returns a new Buffer object. Since this new buffer object internally points to the same memory location as the original buffer, the overall the heap impact should be marginal and is unlikely to exceed our hypothesized 5 MB per 4 second garbage collector time budget.

undefined

This implementation runs at an astounding 8 MiB/s. Yes, this is not a typo. We went from 100 back down to 8!

This is a huge regression but, at the same time, it's not that surprising. Running a profiler on this code reveals the biggest bottleneck is the BigInt division and modulo operations used to calculate the i-th decimal digit for printing non-fizzbuzz numbers. Since BigInt is designed to work on arbitrarily large numbers, it must use software-based division algorithms instead of hardware-based 64-bit division instructions.

Worker Threads

Another property of FizzBuzz is that it is trivially parallelizable since the result for one number is independant of every other number. Well... not quite. This challenge also stipulates that we cannot have null bytes in our output. Since different numbers have different number of digits, their locations in our buffer will be dependant on every previous number.

We can still make use of the data-independance of FizzBuzz by delegating each thread to work on a different batch of numbers. For example, thread 1 works on [0, 3000), thread 2 works on [3000, 6000), thread 3 works on [6000, 9000), etc. Inside each thread, we execute sequentially to pack the thread-specific output buffer. Finally, we round-robin print each thread's buffer in our main thread.

undefined

Putting everything that we have learned together, we have arrived at this implementation that runs at around 370 MiB/s. Not bad considering our starting base implementation runs at 1 MiB/s!

Final Thoughts

  • Node.js' garbage collector sucks.
  • Buffer system calls in memory as long as possible.
  • BigInt is insanely expensive. If you need to process integers larger than 2^53-1, you should consider using WebAssembly or native modules with access to hardware 64-bit integers. I used pure JavaScript for this challenge because I felt loading modules written in other languages defeats the spirit of the contest for language-specific implementions. Otherwise, every language's fastest implementation will just be a thin wrapper around C modules.