Eccentric Developments

Packed Triangles

Introduction

It's been a hot minute since my last entry, as I have spent too much time playing around with different changes to the rendered, but not sitting down and writing about it. But finally, that changes now.

In a previous article on WASM and SIMD, I implemented some path tracing operations using SIMD to improve the performance of the JavaScript implementation. Since then I have made a couple of interesting changes: instead of supporting two versions (JavaScript and Rust), I am now working only on the Light Path Tracer (Rust) which I then compile into WebAssembly: code. Afterwards, I implemented a surface area heuristic to improve the quality of the bounding volume hierarchy. Those two changes improved the performance quite a bit.

Packed Triangles

After the surface area heuristic implementation, I started playing with a data-oriented design, where instead of storing the triangles as-is, I convert them to a packet of triangles with the values organized in arrays of elements, take a look at the code. This way, the information required to check if there is a ray intersection is organized in a more cache-friendly way.

This packed triangle approach, aggregates the values for the edges of a triangle in groups of four; this number was not selected at random, and I suspect you know why, but keep reading. Right now, the floating point numbers used in the path tracer are single precision, in other (double) words: 32 bits. The SIMD registers used by WebAssembly SIMD and the NEON (AARCH64) extensions are 128 bits, which can hold four single precision float (32 bits) values and operate on them using a single instruction, hence, packets of four triangles.

Implementation

Implementing the packed triangles data-oriented design was a bit tricky, because I wanted to no longer use indexes to reference individual triangles and store them as part of the BVH nodes, while tyring not to increase the size of the structure dramatically. To achieve this, I used Arc (although Rc could work too), to only store a pointer to the triangles.

Using the same approach, I extended the BVH Nodes to store pointers to a PackedTriangles structure, which stores the values used for ray-triangle intersection checks. This data is duplicated in the triangles internal structure, something I might change in the future.

Implementing a SIMD algorithm is a bit different than using regular scalar operations. While in a regular algorithm using scalar values, branching comes easy: when a condition is checked, the algorithm can execute diferent instructions or exit prematurely. This doesn't work the same way in the case of SIMD algorithms.

The following pseudocode might help explain this a bit better:

Regular scalar code:

let distance = ray.intersect(triangle) // 1 triangle
if distance < EPSILON {
    return 0
} else {
    return distance;
}

Vector code:

let distances = ray.intersect(triangle_pack) // 4 triangles, disances is a vector with 4 values
if (distances[0] < EPSILON &&
    distances[1] < EPSILON &&
    distances[2] < EPSILON &&
    distances[3] < EPSILON) {
    return [0, 0, 0, 0]
}
return distances

In this example, only the case of all the distances being smaller than EPSILON are handled, partial matches get more complicated.

The usual way to handle this when working with SIMD (vector) instructions is to use the output of SIMD conditional instuctions. Those isntructions output is kept as a mask, which is updated every time a condition needs to be checked. The algorithm doesn't stop, and uses the mask to eliminate invalid results before returning the output.

SIMD code:

let distances = ray.intersect(triangle_pack) // 4 triangles
let mask = simd_lt(distances, EPSILON)
return simd_and(mask, distances) // Bitwise AND of the mask and the distances

This is a bit disorienting initially but starts to make sense after a couple of implementations, writting an algorithm this way allows you to extend the number of values you operate simultaneusly very easily, like for instance eight floating point values for AVX (256 bits), 16 for AVX512 (512 bits) or even more by leveraign the GPU or other CPU coprocesors, like the Apple Silicon SME.

Performance

With all these changes, I was able to squeeze a not too shaby 18% performance increase. Feel free to play with the new imlpementation below.

Comparing with the numbers from the version published in the previous article, the table looks like this:

Version	Render Time (ms)
Previous Non-Optimized Version	229
Data-oriented design + SIMD	200

Numbers from an MBP M4Pro running macOS Sequoia 15.3 and Safary 18.3.

Using data-oriented design plus SIMD instructions gives a performance uplift of 1.14x, not bad.

Why not a 4x performance increase?

Being honest, I was expecting a much bigger performance uplift (4x-ish), but after the previous updates to use the surface area heuristic to create the BVH, the performance bottleneck is no longer the ray-triangle intersection test, as seen in this flame graph.

Where there is an almost 4x performance increase though, is when using the brute force algorithm, pushing me to think that the BVH traversal operation has now become the bottleneck. Look at the numbers obtained using the native version, runing on the same MBP, for a single frame of a 720 render:

Algorithm	Render Time (ms)
Brute Force Scalar	127273
Brute Force DO-D + SIMD (Neon)	29058
BVH + SAH + SIMD (Neon)	688

The numbers indicate that the SIMD implementation does improve the performance quite a bit on the ray-triangle interesction calculations, achieving a very good 4.37x improvement. But this definitely pales in comparison to the bounding volume hierarchy + surface area heuristic implementation, which gives an incredible 184.98x performance improvement over the naive brute force approach.

Conclusions

As previously discussed, implementing a data-oriented design algorithm that uses SIMD for ray-triangle intersections is quite interesting. I had a lot of fun and frustrating problems to solve, but as you previously read, the performance increase is very good when using data-oriented development techniques and SIMD vectorization. This will allso help extend the code to use AVX or Apple's SME in the (hopefully) not too distant future.

Now, the next target is identify the new bottleneck and work on improving it's performance.

Update

After publishing this article, I played around with the renderer and tought: wouldn't it be great to be able to rotate the camera?. So I spent some time implementing a very simple orbit camera.

The changes were easier to implement than I expected, although the result is a little bit wonky. To control it, drag the mouse to rotate and use the mouse wheel (or two fingers drag) to change the zoom, see it in action below.

Enrique CR - 2025-02-21