Eccentric Developments

WebAssembly & SIMD

Since the previous article, I kept playing with different optimizations to improve the speed of the data-oriented design implementation. The idea behind using this approach was to make it easy the use SIMD instructions in the rendering algorithm.

After lots of experimentation I found that, while the SIMD instructions helped increase the speed of the path tracer, this change alone was not entirely capable of reaching a speed of a frame under a second. In the end, I decided to move the sphere intersection code to the Wasm library which allowed the SIMD code to finally shine.

Adding to that, there were plenty of changes in this new update, the following are a list of the modifications that I made to the code:

Added a custom memory manager to allocate a block of space only once and just use offsets to the data for array references
Used SIMD instructions to calculate the intersections of four spheres at the same time
Did an small amount of loop unrolling
Implemented some other small optimizations like: not using the unit vector function in randomDirection and removing the scale operation from the trace and shade functions. This is because unit and scale create intermediary arrays causing memory pressure.
Removed the triangle intersection function for the time being as only spheres support is implemented in Wasm, this has nothing to do with speed but with code cleanup.
Since this is already very fast, I changed the code a bit to accumulate all the frames generated and visualize how the scene lighthing convergences.
Updated the pipeline to not execute all the steps for every frame but return a render function.

All the changes made it possible to improve the visualization of the same 8 spheres in the scene from a rendering time of 2273ms down to 409ms, or about 4.5x faster (!!!). Now I wonder how much further the performance can be increased if I implement more changes to the code. Some other optimization opportunities available are:

Accelerator structures
Multi-threading
GPU shaders

Code

I encourage you to play around with the code if your browser allows you; some do not work well with my code editor.

You can experiment with different performance characteristics by changing the scene composition, one easy way is to update the sceneSelector to have a value of 2, this will render a scene with only 4 spheres; then remove one, and see the performance difference.

Note: there is no code to disable the rendering loop after you hit Run, to make it stop you will have to reload the page.

Summary

In the end, the biggest performance increase was achieved by moving the critical code to the Wasm library and using SIMD instructions to calculate the sphere intersections.

One thing I want you to know is that the current SIMD support in WebAssembly only allows for vectors of 128 bits. This vector width can be used in different ways, in this case, I used it as four single precision (f32) values; this allowed me to calculate the intersection of a single ray against four different spheres at the same time.

Interestingly, outside of WebAssembly, there are several SIMD implementations with varying levels of CPU support, one of the newest ones is AVX-512, that can work with up to sixteen single precision values. It will be interesting to see if WebAssembly starts supporting wider SIMD vectors in the future.

Nonetheless, WebAssembly does not need to support them directly as long as the virtual machine it runs on makes the optimizations itself, potentially by means of fusing similar adyacent SIMD operations into wider ones; if this ends up being the case then loop unrolling will be key.

Another interesting tidbit, Safari shows the best performance out of the three major web browsers when working with this path tracer implementation:

Safari (v17.2.1): 409ms
Firefox (v122.0): 567ms
Chrome (v121.0.6167.139): 587ms

Time taken to render a single frame on a MacBook Air M1 with 16GB RAM.

Extras

Below is the code behind the vector_simd.wasm library. If you want to build it, make sure to use the following Cargo.toml configuration:

[lib]
crate-type = ["cdylib", "rlib"]

And use this commands for building:

rustup target add wasm32-unknown-unknown
RUSTFLAGS="-C target-feature=+simd128" cargo build --target wasm32-unknown-unknown --release

Finally, the code:

#![cfg(target_feature = "simd128")]
use core::arch::wasm32::*;
use std::ptr::copy;

macro_rules! unroll {
    ($i: ident, $size: ident, $block: block) => {
        while $i < $size {
            if $size - $i >= 8 {
                $block
                $i += 4;
                $block
                $i += 4;
            } else if $size - $i >= 4 {
                $block
                $i += 4;
            } else {
                break;
            }
        }
    }
}

#[no_mangle]
pub unsafe fn alloc(capacity: usize) -> *mut u8 {
    let mut memory = Vec::with_capacity(capacity);
    let ptr = memory.as_mut_ptr();
    std::mem::forget(memory);
    ptr
}

#[no_mangle]
pub unsafe fn dealloc(n: usize, ptr: *mut u8) {
    let _bytes: Vec<u8> = Vec::from_raw_parts(ptr, n, n);
}

#[no_mangle]
pub unsafe fn get(a: *const f32) -> f32 {
    *a
}

#[no_mangle]
pub unsafe fn set(a: *mut f32, value: f32) {
    copy(&value, a, 1);
}

#[no_mangle]
pub unsafe fn add(size: usize, a: *const f32, b: *const f32, c: *mut f32) {
    let mut i = 0;
    unroll!(i, size, {
        v128_store(
            c.add(i) as *mut v128,
            f32x4_add(
                v128_load(a.add(i) as *const v128),
                v128_load(b.add(i) as *const v128),
            ),
        );
    });

    while i < size {
        let x = *a.add(i);
        let y = *b.add(i);
        let r = x + y;
        set(c.add(i), r);
        i += 1;
    }
}

#[no_mangle]
pub unsafe fn fill(size: usize, v: f32, a: *mut f32) {
    let values = f32x4_splat(v);
    let mut i = 0;
    unroll!(i, size, {
        v128_store(a.add(i) as *mut v128, values);
    });

    while i < size {
        set(a.add(i), v);
        i += 1;
    }
}

#[no_mangle]
pub unsafe fn sub(size: usize, a: *const f32, b: *const f32, c: *mut f32) {
    let mut i = 0;
    unroll!(i, size, {
        v128_store(
            c.add(i) as *mut v128,
            f32x4_sub(
                v128_load(a.add(i) as *const v128),
                v128_load(b.add(i) as *const v128),
            ),
        );
    });

    while i < size {
        let x = *a.add(i);
        let y = *b.add(i);
        let r = x - y;
        set(c.add(i), r);
        i += 1;
    }
}

#[no_mangle]
pub unsafe fn mul_add(size: usize, a: *const f32, b: *const f32, c: *mut f32) {
    let mut i = 0;

    unroll!(i, size, {
        let va = v128_load(a.add(i) as *const v128);
        let vb = v128_load(b.add(i) as *const v128);
        let vc = v128_load(c.add(i) as *const v128);
        v128_store(c.add(i) as *mut v128, f32x4_add(vc, f32x4_mul(va, vb)));
    });

    while i < size {
        let x = *a.add(i);
        let y = *b.add(i);
        let z = *c.add(i);
        let r = x * y + z;
        set(c.add(i), r);
        i += 1;
    }
}

#[no_mangle]
pub unsafe fn sqrt(size: usize, a: *const f32, b: *mut f32) {
    let mut i = 0;

    unroll!(i, size, {
        v128_store(
            b.add(i) as *mut v128,
            f32x4_sqrt(v128_load(a.add(i) as *const v128)),
        );
    });

    while i < size {
        let x = *a.add(i);
        let r = x.sqrt();
        set(b.add(i), r);
        i += 1;
    }
}

#[no_mangle]
pub unsafe fn scale(size: usize, v: f32, a: *mut f32, b: *mut f32) {
    let values = f32x4_splat(v);
    let mut i = 0;

    unroll!(i, size, {
        let va = v128_load(a.add(i) as *const v128);
        v128_store(b.add(i) as *mut v128, f32x4_mul(va, values));
    });

    while i < size {
        let x = *a.add(i);
        let r = x * v;
        set(b.add(i), r);
        i += 1;
    }
}

#[no_mangle]
pub unsafe fn div(size: usize, a: *const f32, b: *const f32, c: *mut f32) {
    let mut i = 0;

    unroll!(i, size, {
        v128_store(
            c.add(i) as *mut v128,
            f32x4_div(
                v128_load(a.add(i) as *const v128),
                v128_load(b.add(i) as *const v128),
            ),
        );
    });

    while i < size {
        let x = *a.add(i);
        let y = *b.add(i);
        let r = x / y;
        set(c.add(i), r);
        i += 1;
    }
}

#[no_mangle]
pub unsafe fn mul(size: usize, a: *const f32, b: *const f32, c: *mut f32) {
    let mut i = 0;

    unroll!(i, size, {
        v128_store(
            c.add(i) as *mut v128,
            f32x4_mul(
                v128_load(a.add(i) as *const v128),
                v128_load(b.add(i) as *const v128),
            ),
        );
    });

    while i < size {
        let x = *a.add(i);
        let y = *b.add(i);
        let r = x * y;
        set(c.add(i), r);
        i += 1;
    }
}

#[no_mangle]
pub unsafe fn min(size: usize, a: *const f32, b: *const f32, c: *mut f32) {
    let mut i = 0;

    unroll!(i, size, {
        v128_store(
            c.add(i) as *mut v128,
            f32x4_min(
                v128_load(a.add(i) as *const v128),
                v128_load(b.add(i) as *const v128),
            ),
        );
    });

    while i < size {
        let x = *a.add(i);
        let y = *b.add(i);
        let r = x.min(y);
        set(c.add(i), r);
        i += 1;
    }
}

static mut MEMORY: [f32; 1024] = [0.0; 1024];

#[no_mangle]
pub unsafe fn spheres_intersect(
    size: usize,
    rox: f32,
    roy: f32,
    roz: f32,
    dx: f32,
    dy: f32,
    dz: f32,
    spcx: *mut f32,
    spcy: *mut f32,
    spcz: *mut f32,
    spcr: *mut f32,
    out: *mut f32,
) {
    let mem = MEMORY.as_mut_ptr() as *mut f32;
    let mut offset = 0;

    let ocx = mem.add(offset);
    offset += size;
    fill(size, rox, ocx);

    let ocy = mem.add(offset);
    offset += size;
    fill(size, roy, ocy);

    let ocz = mem.add(offset);
    offset += size;
    fill(size, roz, ocz);

    sub(size, ocx, spcx, ocx);
    sub(size, ocy, spcy, ocy);
    sub(size, ocz, spcz, ocz);

    let rdx = mem.add(offset);
    offset += size;
    fill(size, dx, rdx);

    let rdy = mem.add(offset);
    offset += size;
    fill(size, dy, rdy);

    let rdz = mem.add(offset);
    offset += size;
    fill(size, dz, rdz);

    let a = mem.add(offset);
    offset += size;
    fill(size, dx * dx + dy * dy + dz * dz, a);

    let b = mem.add(offset);

    fill(size, 0.0, b);
    mul_add(size, ocx, rdx, b);
    mul_add(size, ocy, rdy, b);
    mul_add(size, ocz, rdz, b);
    let m0 = rdy;
    fill(size, -1.0, m0);
    mul(size, b, m0, b);
    let c = rdz;
    fill(size, 0.0, c);
    mul_add(size, ocx, ocx, c);
    mul_add(size, ocy, ocy, c);
    mul_add(size, ocz, ocz, c);
    let mr = m0;
    mul(size, spcr, spcr, mr);
    sub(size, c, mr, c);
    let bb = ocx;
    mul(size, b, b, bb);
    let ac = ocy;
    mul(size, a, c, ac);
    let dis = ocz;
    sub(size, bb, ac, dis);
    let mask = mr;
    for i in 0..size {
        let v = if get(dis.add(i)) > 0.0 { 1.0 } else { 0.0 };
        set(mask.add(i), v);
    }

    mul(size, dis, mask, dis);
    let e = dis;
    sqrt(size, dis, e);
    let t1 = bb;
    sub(size, b, e, t1);
    let t2 = ac;
    add(size, b, e, t2);
    div(size, t1, a, t1);
    div(size, t2, a, t2);
    min(size, t1, t2, t1);
    mul(size, t1, mask, out);
}

Enrique CR - 2024-02-04

Eccentric Developments

WebAssembly & SIMD

Further reading

Code

Summary

Extras