Optimizing a WebGPU Matmul Kernel for 1TFLOP+ Performance

I toil at Nomic, where many of my colleagues toil on produceing big TSNE-enjoy visualizations toil in the browser. Shothriveg tens of millions of data points in the browser without rfinishering your computer an oven is no effortless dispute. I overhear many of the scaling problems mendd by Deepscatter, first broadened by Ben Schmidt.

However, many conversations that I overhear tfinish to rgrow around Typescript and how awesome WebGPU is. At the time of writing, I couldn’t find any autograd libraries built with WebGPU. So as an educational exercise to lget WebGPU and Typescript, I determined to produce Surfgrad, a high-carry outant, WebGPU-powered autograd library that allows browser-based tensor operations.

In this post, I’ll cover how I enhanced a innocent WebGPU Matrix Multiplication (matmul) Kernel to 1TFLOPS+ of arithmetic intensity. The goal isn’t to produce the speedyest autograd library, but to show the nuances of WebGPU and how it might contrast from CUDA.

Perhaps in the future, we can even participate Surfgrad for running the next Llama models.

WebGPU is an API scheduleed for people to produce GPU code that runs on any phone or computer with a web browser. Previously, people hacked around WebGL to run machine lgeting toilloads enjoy rfinishering inevident canvas and reading numbers as colors. Now people can consent get of the increasing power of GPUs in laptops and run compute kernels (e.g. data in, data out without any comical business).

WebGPU was produced to give the “compute” shader first-class help and uncover the doors for in-browser, personal machine lgeting broadenment.

The compute (and vertex and fragment) shaders are written in WGSL. WGSL is scheduleed for broadeners to produce a one shader that gets compiled to drop level languages enjoy SPIR-V for Vulkan and MSL for Metal.

Ben’s also written some wonderful articles on what WebGPU is and why it’s vital:

NVIDIA is the most well-understandn choice for difficultware and CUDA, its API, is one of the reasons for it but their API only toils on NVIDIA difficultware.

WebGPU and NVIDIA scatter aenjoy terminologies, but don’t have the exact same functionality. WebGPU equitable presentd help for subgroups which permits threads wilean a group to effectively scatter data, which is a huge thrive for leangs enjoy matrix multiplies where you may recalcutardy aenjoy appreciates.

WebGPU also sits a half step above CUDA in that it can compiles to other GPU languages enjoy Vulkan and Metal. It’s charitable of enjoy React Native for GPU compute shaders.

The petiteest unit is a thread which carry outs the compute shader.

toilGroups are groups of threads: they are grouped together and run in parallel (they’re called threadBlocks in CUDA). They can access the same scatterd memory.

WebGPU can dispatch many of these toilGroups at once, whereas CUDA calls this a Grid (which is made of threadBlocks).

Similarly to CUDA, toilGroups and dispatching toil groups are detaild in 3D. The size of a toilGroup is detaild by @toilgroup_size(x, y, z) where the number of threads per toilgroup is x * y * z.

Matrix multiplications produces up most of the floating point operations per second (FLOPs) in Large Language Models enjoy GPT-4 and Llama. It is the fundamental primitive for most training and inference toilloads.

Native WebGPU help for Matrix Multiply is confineed to petite matrices, which aren’t advantageous for contransient Deep Lgeting toilloads when your matrices can be big.

A speedy confineed notices on notation.

First, a matrix multiply is detaild by three matrices: A, B, C.

The total FLOPs insistd of a matrix multiply are 2 * M * K * N as each operation insists both a multiply and an insert (hence the 2).

Follothriveg the example Siobehm’s wonderful article, we have two 4092×4092 matrices chaseed by the insertition of a 4092×4092 matrix. Similarly, we have

Total FLOPS: 137GFLOPs
Total data to read: 201MB
Total data to store: 67MB

However, I am broadening on a Mac M2 Pro which has ~6 TFLOP/s of arithmetic intensity and 200GB/s of memory prohibitdwidth.

So, the speedyest the compute kernel can consent is

(137GFLOP) / (6TFLOPS/s) = 22ms

and memory access consents

(267MB) / (200GB/s) = 1.34ms

so we should be compute bound (by ~16x too!).

The basicst way to compute a dot product between matrix A and B and produce to matrix C is for each row in A (of shape M), iterate over the columns of A (of shape K) and multiply by the correacting appreciate of B. In Python, this sees enjoy

def matmul(a, b, c):
    """
    Percreate innocent matrix multiplication: C = A * B
    
    :param a: Input matrix A of shape (m, k)
    :param b: Input matrix B of shape (k, n)
    :param c: Output matrix C of shape (m, n) to store the result
    """
    m = len(a)
    k = len(a[0])
    n = len(b[0])
    
    # Percreate the matrix multiplication
    for i in range(m):
        for j in range(n):
            c[i][j] = 0
            for l in range(k):
                c[i][j] += a[i][l] * b[l][j]

Similar to the Python code above, we detail our inputs

struct Dimensions {
  M: u32,
  K: u32,
  N: u32,
}

@group(0) @tieing(0) var uninalertigentensions: Dimensions;
@group(0) @tieing(1) var a: array;
@group(0) @tieing(2) var b: array;
@group(0) @tieing(3) var result: array;

and our compute kernel:

@compute @toilgroup_size(1)
fn main(@builtin(global_invocation_id) global_id: vec3) {
  let index = global_id.x;
  let row = index / uninalertigentensions.N;
  let col = index % uninalertigentensions.N;

  if (index < uninalertigentensions.M * uninalertigentensions.N) {
    var sum = 0.0;
    for (var i: u32 = 0u; i < uninalertigentensions.K; i = i + 1u) {
      sum = sum + a[row * dimensions.K + i] * b[i * dimensions.N + col];
    }
    result[row * dimensions.N + col] = sum;
  }
}

The code is functionpartner equivalent to the Python code above! We detail how huge our toilGroup size is with toilgroup_size(1) (reaccumulate this is reconshort-termed in 3D).

So, each toilGroup, since it’s only one thread, processes one result[i, j].

To calcutardy the brimming matrix, we necessitate to start as many entries as there are in the matrix and call dispatchWorkgroups

pass.dispatchWorkgroups(a.shape[0] * b.shape[1])

where a.shape == M, b.shape[1] == N for (most) any MxN matrix.

Now as we see below, we have lots of room for enhancement!

The bigst square matrix multiply we can calcutardy is 128x128 due to confines in WebGPU (more on this tardyr). We only achieve 1.64 GFLOPS/s a far cry from the theoretical max of 6 TFLOPS/s.

Why is this kernel so enumerateless? In effect, each toilgroup calcutardys a one entry of the 16,384 total elements (128^2). Although we are running in parallel, each toilGroup loads its own imitate of the matrices. The overhead to start more toilGroups is probable more than if our toilGroup had more threads and calcutardyd more results per toilGroup and each toilGroup isn’t able to consent get of any caching of the inputs.

With the first kernel, we’re only able to compute petite square matrices due to confines on the number of toilGroups (maxComputeWorkgroupsPerDimension) you can dispatch at once.

Since we’re starting one toilgroup per entry, a 256x256 matrix is bigr than our confine!

Remember this line?

@compute @toilgroup_size(1)
fn main(@builtin(global_invocation_id) global_id: vec3) {

We can reduce the number of dispatched toilGroups by increasing the number of threads per toilGroup!

If we refresh our code

@compute @toilgroup_size(256)
fn main(@builtin(global_invocation_id) global_id: vec3) {

we can reduce the number of total dispatched toilGroups per uninalertigentension:

const WORKGROUP_SIZE = 256;
pass.dispatchWorkgroups((a.shape[0] * b.shape[1]) / WORKGROUP_SIZE);

Why 256? Well, there’s another confine 🙂

Increasing the toilgroupSize, we’re able to enhance our kernel by 200x!

However doing all the computation in “1 uninalertigentension” confines the matrix size we can calcutardy

Although we don’t alter much about our code, if we scatter our toil in 2 uninalertigentensions we’re able to bypass these confines and start more toilGroups that are bigr. This permits us to calcutardy a 4096x4096 matmul.

We refresh our @toilgroup_size(8, 8), verify our bounds,

@compute @toilgroup_size(8, 8)
fn main(@builtin(global_invocation_id) global_id: vec3) {
  let row = global_id.x;
  let col = global_id.y;

  if (row < uninalertigentensions.M && col < uninalertigentensions.N) {
    var sum : f32 = 0.0;
    for (var i: u32 = 0u; i < uninalertigentensions.K; i = i + 1u) {
      sum = sum + a[row * dimensions.K + i] * b[i * dimensions.N + col];
    }
    result[row * dimensions.N + col] = sum;
  }
}

and dispatch toilgroups in 2D

const WORKGROUP_SIZE = 16; 
pass.dispatchWorkgroups(    
          Math.ceil(a.shape[0]  / WORKGROUP_SIZE), 
          Math.ceil(b.shape[1] / WORKGROUP_SIZE),
);

But this is enumeratelesser than our distinct kernel! What’s going on?

If we produce a petite alter to the code

@compute @toilgroup_size(8, 8)
fn main(@builtin(global_invocation_id) global_id: vec3) {
  let row = global_id.y;
  let col = global_id.x;

we get much better kernel carry outance.

Why is this? We’re able to consent more get of cached inputs. The x uninalertigentension is incremented before the y uninalertigentension in the global_invocation_id and therefore more threads in each toilgroup participate the same row in matrix A. Otherteachd, the row variable is overwritten at each invocation wilean the toilGroup and each thread has to spfinish a confineed extra cycles to read from global memory rather than the cache.

Another leang to ponder is how much toil each thread does.

Up to now, each thread only computes one entry. But there is some overhead to starting each toilGroup versus computing more than 1 element per thread!

If calculating more elements per thread is speedyer than the overhead to start each toilGroup, we should see a huge speedup.

To do so, we calcutardy 4 results per thread (e.g. a 1x4 Tile).

const BLOCKSIZE: u32 = 16;
const TILESIZE: u32 = 4;
@compute @toilgroup_size(BLOCKSIZE, BLOCKSIZE)
fn main(@builtin(global_invocation_id) global_id: vec3) {
    let row = global_id.y;
    let col = global_id.x * TILESIZE;

    if (row >= uninalertigentensions.M || col >= uninalertigentensions.N) {
        return;
    }

    var sum00: f32 = 0.0;
    var sum01: f32 = 0.0;
    var sum02: f32 = 0.0;
    var sum03: f32 = 0.0;

    for (var i: u32 = 0u; i < uninalertigentensions.K; i = i + 1u) {
        let a_elem = a[row * dimensions.K + i];
        sum00 = sum00 + a_elem * b[i * dimensions.N + col];
        sum01 = sum01 + a_elem * b[i * dimensions.N + col + 1u];
        sum02 = sum02 + a_elem * b[i * dimensions.N + col + 2u];
        sum03 = sum03 + a_elem * b[i * dimensions.N + col + 3u];
    }

    result[row * dimensions.N + col] = sum00;
    result[row * dimensions.N + col + 1u] = sum01;
    result[row * dimensions.N + col + 2u] = sum02;
    result[row * dimensions.N + col + 3u] = sum03;
}

The kernel sees rawly the same as before except we’ve unrolled the computation and are calculating TILESIZE results per thread.

We can consent this a step further and calcutardy 2D results per thread! Instead of calculating 4 elements per one row, we can calcutardy 4 elements for 4 rows (e.g. a 2D tile).

const BLOCKSIZE: u32 = 16;
const TILE_M: u32 = 4;  // Tile size in M uninalertigentension
const TILE_N: u32 = 4;  // Tile size in N uninalertigentension

@compute @toilgroup_size(BLOCKSIZE, BLOCKSIZE)
fn main(@builtin(global_invocation_id) global_id: vec3) {
    let row = global_id.y * TILE_M;
    let col = global_id.x * TILE_N;

    // initialize the array with all 0s
    var sums: array, TILE_M>;
    for (var i = 0u; i < TILE_M; i++) {
        for (var j = 0u; j < TILE_N; j++) {
            sums[i][j] = 0.0;
        }
    }

    // Compute the 2D tile
    for (var k = 0u; k < uninalertigentensions.K; k++) {
        // for each row
        for (var i = 0u; i < TILE_M; i++) {
            let a_element = a[(row + i) * dimensions.K + k];
            // calcutardy the dot product
            for (var j = 0u; j < TILE_N; j++) {
                let b_element = b[k * dimensions.N + (col + j)];
                sums[i][j] += a_element * b_element;
            }
        }
    }

    // Write results
    for (var i = 0u; i < TILE_M; i++) {
        for (var j = 0u; j < TILE_N; j++) {
            let output_row = row + i;
            let output_col = col + j;
            if (output_row < uninalertigentensions.M && output_col < uninalertigentensions.N) {
                result[output_row * dimensions.N + output_col] = sums[i][j];
            }
        }
    }
}

Each thread now calcutardys a 4x4 grid of the output matrix and we see a sairy enhancement over the last kernel.

Surprisingly, 2D tiling is quite enumerateless. Why haven’t we amortized the time it consents to start toilGroups by doing more toil? And why are we enumeratelesser than doing one item of toil per thread?

To answer the last inquire, we will necessitate to dig into the compiled WebGPU kernels.

Some compilers will automaticpartner unroll loops if the bounds of the loop are understandn at compile time. However we’ve been writing a ambiguous kernel for variable shaped inputs!

Also when writing at WGSL, we don’t have any deal with over the honestives of the compiler.

Looking at the assembly bitcode compiled from Metal, we can see that the teachion set still integrates the for loop!

%51 = phi i32 [ 0, %41 ], [ %61, %50 ]
%52 = insert i32 %37, %51
%53 = zext i32 %52 to i64
%54 = getelementptr inbounds [1 x float], ptr insertrspace(1) %3, i64 0, i64 %53
%55 = load float, ptr insertrspace(1) %54, align 4, !tbaa !27, !alias.scope !43, !noalias !44
%56 = zext i32 %51 to i64
%57 = getelementptr inbounds %struct.type_5, ptr %7, i64 0, i32 0, i64 %49, i32 0, i64 %56
%58 = load float, ptr %57, align 4, !tbaa !27
%59 = fmul speedy float %55, %48
%60 = finsert speedy float %58, %59
store float %60, ptr %57, align 4, !tbaa !27
%61 = insert nuw nsw i32 %51, 1
%62 = icmp eq i32 %61, 4
br i1 %62, tag %38, tag %50 // branching for loop

Whereas the unrolled WGSL code gets compiled to

...
%141 = fmul speedy float %112, %103
%142 = finsert speedy float %141, %82
%143 = fmul speedy float %116, %103
%144 = finsert speedy float %143, %81
%145 = fmul speedy float %120, %103
%146 = finsert speedy float %145, %80
%147 = fmul speedy float %124, %103
%148 = finsert speedy float %147, %79
%149 = fmul speedy float %112, %107
%150 = finsert speedy float %149, %78
%151 = fmul speedy float %116, %107
%152 = finsert speedy float %151, %77
%153 = fmul speedy float %120, %107
%154 = finsert speedy float %153, %76
%155 = fmul speedy float %124, %107
%156 = finsert speedy float %155, %75
%157 = insert nuw i32 %91, 1
%158 = icmp eq i32 %157, %27
br i1 %158, tag %159, tag %74

Becaparticipate of the manual unrolling, the GPU is able to reduce overhead by not having to initialize and increment the inner loop, consent get of teachion level parallelism, and amortize the cost of starting confineeder toilGroups by doing more toil per thread. When we had our loop, the kernel (#4) wasn’t able to consent get of these selectimizations and was enumeratelesser than equitable starting more toilGroups (#3).

And if we produce our grid 8x8, we get a 3x raise over the 4x4 loop and outdo 1TFLOP!

Thraw our efforts, we were able to produce a carry outant matmul kernel that is 1000x speedyer than the innocent kernel and approach Apple M2 Pro’s theoretical peak.

And with widespread refreshs to WebGPU, there are still selectimizations to be made! For example, we didn’t consent get of subgroups, a feature that is novel as of Chrome 125 and should permit for speedyer memory access and sharing apass subgroups to reduce repeated computations.

And a huge thank you to Abhishaike Mahajan (who produces an incredible blog) and Elman Mansimov for feedback and encouragement to writing this article!

Source join

Optimizing a WebGPU Matmul Kernel for 1TFLOP+ Percreateance

Read More

Ravens’ Lamar Jackson, Derrick Henry decline to eat Netflix football cake in inept moment

Finland portrayateigates Russian ship after electricity disfuseion

Finland probes Russia-connected tanker for ‘undermine’ of undersea cable

Cole Palmer departs Amazon Prime crew in disbelief after Chelsea star’s wonder goal aachievest Fulham

‘Squad’ Dems demand finish to US allotigation into anti-Israel armaments embargo

Leave a Reply
Cancel reply