In the previous entry, I was able to get a huge speedup for the path tracing implementation using SIMD instructions plus some optimizations to memory usage, then when closing I mentioned that one of the next steps was to use multi-threading.
Multi-threading means spinning several threads to execute operations in parallel using as many cores as there are available in the CPU. In theory, this promises good speedups with low effort, in reality, things are not as straightfoward.
JavaScript is a single threaded language, but offers ways to scape it using WebWorkers. A WebWorker is an interface supported by web browsers that allow you to run background tasks and communicate back and forth using a messaging interface.
Using this facility, I spent some time updating the path tracer to move the rendering code into a WebWorker and add a synchronization mechanism to trigger rendering sections of an image that then get stitched together to form the final product. The end result is a faster rendering time but not quite what I expected.
This is the code
Implementation details
Tiles but not exactly
Initially I implemented this as square tiles, but the overhead of the message passing interface made the implementation much slower than single threaded. To alleviate this, I implemented full width sections with 8 pixels
of height, which requires less synchronization and fewer messages sent.
While on later developments I think I can go back to square tiles again, I left it like this for the time being.
WorkersPool
The available WebWorkers are managed through a WorkersPool
class that handles the initialization of the WebWorkers and triggers the start the rendering workflow.
When the WorkersPool
is instantiated, there are two parameters passed that control the size of the pool: poolSize
and maxPoolSize
. Neither value is used right away, but until the pool is initialized or the processing is started.
On initialization, the WorkersPool
will create maxPoolSize
number of WebWorkers and initalize them, they will remain in memory, waiting for the signal to start the rendering job.
When the process
function is called, WorkersPool
will signal to poolSize
number of threads to pick the next available section to render and start working on it until no more sections are left.
Since the total number of WebWorkers are available on initialization, it is easier to change the amount of workers to use on a particular frame, and as such, I implemented a method to find the best poolSize
configuration. This code will simply go from 1
thread up until maxPoolSize
and record all the rendering times on each step (signaled by (?)
), after all the configurations have been tested, it scans for the number of threads that gave the best performances and indicates to WorkersPool
to use this configuration from that point onwards (indicated by (!)
).
Message Passing
The default mechanism for comunicating the main JavaScript thread and the spanwed WebWorkers is using a message passing interface that offers a postMessage function and an onmessage event handler (docs).
This mechanism works well to isolate both threads and comunicate between them without data races, the problem is that, to do so, parameters are either copied or moved which makes for slow transactions. In the case of a fast rendering job, it incurrs on a hefty performance penalty.
SharedArrayBuffer
Fortunately, JavaScript offers a less involved communication mechanism using SharedArrayBuffer, which can be read and written by various WebWorkers and the main thread simultaneously, which also means that data races can occur if you are not careful.
With this facility, I was able to deprecate the use of the message passing interface to sync rendering jobs and their results. Now, all WebWorkers write to separate sections of the shared memory.
But for using SharedArrayBuffer
, there is one big caveat to be aware of: you can only use it inside a secure context. To achieve this, you need your webserver to emit two headers:
Cross-Origin-Opener-Policy: same-origin
Cross-Origin-Embedder-Policy: require-corp
And then, if you are not testing from local (either through localhost
or 127.0.0.1
), you need to serve your resources from https
or wss
, docs.
Atomics
Atomics are a facility offered by JavaScript to synchronize the access to shared resources in a multi-threaded environment. They can be used to access a memory location for reading or writing with the known safety that no other updates will happen while the operation is being executed.
Using them, I was able to pass a memory object containing the next available section for processing, enabling the implementation of a sort of job queue where WebWorkers can take the next un-processed portion of the image to work on.
This sort of work stealing strategy allows for two advantages
- Threads working on difficult parts of the scene will not block other threads from other sections.
- In case of heterogeneous CPU architectures, like big.LITTLE, slower cores can still help during rendering at their own pace without slowing the faster cores work.
What went wrong
Sadly, not everything went as smooth as I expected and the main problem this implementation faces is the JavaScript garbage collector, which on certain browsers (Firefox and Chrome), causes big performance drops when executing major swipes. Interestingly, this is not a problem in Safari, so Apple devices are unnafected by this.
Also, different browsers do not advertise the same value for navigator.hardwareConcurrency
on the same CPU, for instance on a Ryzen 3950X, Chrome returns 32
while Firefox gives 16
, which changes the program behavior if using that as a guide on threads configuration.
But anyway, usually after two or three threads, the speedup dramatically decreases, indicating a bottleneck is somewhere that I have not identified yet, my guess is that there is too much pressure in the memory subsystem, and a fix will require removing as many allocations as possible in the code's hot path.
Summary
I started writing this entry on a defeating tone since the initial implementation was very slow, until I refactored to use SharedArrayBuffer
and Atomics
, which improved things dramatically.
Still, the bottleneck from the memory allocator and garbage collector is stopping the implementation from using all the available computational resources for rendering. I might spend some time on this later, and move all the code to WebAssembly where memory management is more deterministic using Rust.
The next step will be implement a good accelerator structure to speedup the intersection tests, but to really experience its advantages, there needs to be several objects in the scene, and there is no easier way to increase the scene complexity than building everything out of triangles.
So next chapter in this series will bring back triangles using WebAssembly SIMD instructions.