Optimizing Matrix Multiplication on RDNA3

118 points Mar 25, 2025

almostgotcaught 4 days ago

> Furthermore, performing custom ISA optimizations makes these changes RDNA3-specific

this is overblown at least wrt forward compatibility - all of the instructions used are in RDNA4 and most of them are even in CDNA3 (CDNA4 isn't public yet?) and the ones that aren't exactly there are only slightly renamed (ds_load -> ds_read). Sure it's annoying but it's not the end of the world to have some `#ifdef`s in your code (that's not very much different from what the compiler itself is going to do anyway).

imtringued 3 days ago

You're making the assumption that every kernel developer has enough AMD GPUs from different eras that they can test their ifdefs on all the possible ISAs.

tgtweak 3 days ago

Cuda has similar inefficiencies and many use cases can have equal uplifts by going lower level on the code.

I think this is what deepseek had done to get their speedups on older hardware.

Even way back in the days of GPU crypto mining - custom kernels hand built (mostly just unrolling loops) would yield 20% improvements over just running opencl and letting the drivers compile it down.

touisteur 3 days ago

People have been trying to bypass CUDA and even PTX for a long time. One long rundown of optimizing gemm on NVIDIA hardware (https://salykova.github.io/sgemm-gpu) mentions 'maxas' (https://github.com/NervanaSystems/maxas/wiki/Introduction) - which was really a step forward in this space. I still blame Intel (buying NervanaSystems) for killing it...

almostgotcaught 3 days ago

> People have been trying to bypass CUDA and even PTX for a long time

i swear it's so funny when people talk about this stuff like it's all weird/surprising. y'all realize that there are hundreds (thousands?) of engineers across FAANG whose full time job is optimizing CUDA/ROCm/whatever code for their team/org/company's specific workloads? like do y'all think that serious shops really just go with whatever the vendor gives you? ie none of this is in the least surprising - it's completely expected that whatever the vendor designs generically for the entire market segment will fail to achieve peak perf for your use case.

cma 3 days ago

>it's completely expected that whatever the vendor designs generically for the entire market segment will fail to achieve peak perf for your use case.

When Carmack left Meta I believe he claimed they were only getting around 20% utilization on their even then enormous GPU fleet. So I could see them also leaving a lot of perf headroom on the table.

touisteur 3 days ago

Not saying it's surprising. My day job is doing exactly this, not in any FAANG.

Working on a platform that hides so many low-level details is a challenge, and the fact people have to go to such length to get access to it is noteworthy. 'maxas' was noteworthy and unneeded on many (most ?) other platforms.

Not saying Intelstuff or armstuff is 'easier' but at least you get access and are tooled to work on the actual low-level asm.

almostgotcaught 3 days ago

> and the fact people have to go to such length to get access to it is noteworthy

I'll repeat myself: no it's not. There's nothing noteworthy about it at all. In fact I literally cannot fathom why anyone ever expects or expected otherwise. Is it because the oft-repeated notion of "abstraction"? I guess I must be the sole programmer that has always known/understood, even from the first intro class, that abstractions are just assumptions and when those assumptions don't hold I will need to remove the abstraction.

SavageNoble 4 days ago

This is really cool. 60% is no joke and as a 7900XTX owner I would love the performance boost.

Well done!

delusional 3 days ago

I find it quite interesting that while vector instructions are present every other sort of "hardware level grouping" (wave, SIMD, thread) is hidden from the programmer. Why would vector instructions be the only thing the programmer ought to care about?

I wonder if there's untapped potential in a GPU language which made all of those implicit classes explicit in code, now that we've sort of stabilized on them. It wouldn't allow you to do anything that you can't already do with clever optimizations and a profiler, but it could have the potential to make the optimizations clearer.

In general I'm very curious as to why we don't have any new languages that are better aligned with current hardware. For some reason we collectively decided that it was more fun to make everything general, which is especially unfortunate considering the real world got increasingly homogeneous. Compiling to some intermediate language makes no sense when you're only ever going to run on x86 anyway.

randomNumber7 4 days ago

Is the author a genius or has AMD questionable software?

kimixa 4 days ago

Many of the optimizations here rely heavily on the size of matrix and it's relationship to hardware specific details, like LDS size, how they're banked and register count.

It's probably not surprising that you can grind a decent improvement over a general solution, and many of the improvements shown here will need to be re-balanced, or even simply not work, for kernels working on different matrix layouts. Similarly for trying to work on different hardware - even in the same architecture and generation these sort of details are often changing.

And all that required going down to the ISA level, which is a lot less easy (certainly less documented) for Nvidia - for example the "inspiration" post linked [0] on CUDA didn't beat cuBLAS also didn't try modifying the SASS directly, so there might be similar level gains unrealized there.

[0] https://siboehm.com/articles/22/CUDA-MMM

almostgotcaught 4 days ago

> like LDS size, how they're banked and register count.

but you're acting like they pick these numbers using a random number generator for each generation when it's just reasonable/rational stuff like "here's 2x more LDS or more registers for free because the new process node is 2x smaller". like you must realize that they're not throwing everything away and starting completely from scratch for every new gen right? incidentally, while LDS will grow and # of registers will grow, there's absolutely no way they'd change the banking - e.g., CUDA hasn't changed it since 2.0.

kimixa 4 days ago

No, but it's not obviously clear that other sized kernels will hit the same bottlenecks seen in the post. It's not really shown one way or the other - is it that the rocm kernels are just inefficient, or just the author identified one that wasn't particularly well optimized? And do these opportunities for improvement really mean that the software is "Questionable", or just that you cannot really do an equivalent comparison at the level of ISA on other vendor's software stacks?

I'm not trying to minimize the work here, it's interesting and a good example of the sort of lengths you can go to in order to squeeze that last little bit of performance out (and again, showing the advantages of public ISA documentation and support for users working at that level), I just took issue to the parent comment seeming to use this work as evidence of a poor baseline.

roenxi 3 days ago

ROCm multiplies in 4.5ms and the author multiplies in 2.8ms. The naive algorithm is 136ms. I don't think anyone at AMD is losing sleep over this; for a general purpose library this isn't horrible performance. It could be better, hand optimising to specific conditions often is. But as this blog post shows, optimising kernels is the sort of thing that people can do for fun and post blogs about if they care. They don't need AMD to be involved.

The problem with ROCm isn't that it only half-utilises the hardware, the problem was that someone trying to write this blog post in 2020 would have had (or at least the probability was rather high) a heading somewhere around implementing Kernel 0 talking about how the software crashed or the kernel panicked when they tried to run the benchmarks. That was what happened to me when I tried a conceptually similar exercise. I was wandering around HN posting comments about how there were no articles like this one to be found for AMD hardware and musing whether it was technically possible to do.

This makes me wish I'd bought an RDNA3 card instead of a Nvidia one for my last purchase. Not that I really regret the choice, AMD are going to have to show that they're interested in supporting consumer cards for a little longer to get me to trust them again although they're on the right path.

saagarjha 3 days ago

AMD isn’t losing sleep over the fact that J. Random Blogger is beating their GEMM by 60% on 4096x4096? What universe are you living in? This company is fighting for their life against CUDA and you’re telling me their software stack being so bad it can’t use a third of the hardware on the the first and literally only thing people want it to do is somehow not a problem?

roenxi 3 days ago

The point of a platform is for software engineers to provide key functionality independently. Your issue here is you don't understand why CUDA has been so dominant over the last decade - a ~50% software performance gap isn't that material when hardware capacity doubles every generation. If we've reached the point where J. Random Blogger can solve their own problems then the CUDA moat has quite possibly been broken.

If AMD was only 1 hardware generation behind Nvidia they'd be pretty competitive. People are happy using CPUs with a gap of several generations from the cutting edge. And it isn't even that bad because anyone who particularly cares can optimise their software and avoid using rocBLAS.

MITSardine 3 days ago

Though people may use CPUs several years old, they generally weren't at the moment they were bought, and the decision came from comparing with the competition. This argument of "my software will be faster when computers are faster" does not hold given that the competition is also benefiting from Moore's law. Nothing changes in relative terms, which is what matters, until you actually improve your slow software.

And while a possibility may exist to improve software on the user's end, do people not base their decisions on benchmarks involving existing (not potential) software? They find comparisons using the provided kernels, find AMD to be slower, unaware that they could (maybe, at that) find a 30% speedup to be had. Even if they stumbled on this article, would they trust they could pull it off, or simply go with the GPU that has the best performance with existing libraries?

These are machines sold for crunching numbers, they might as well crunch numbers as best they can...

roenxi 2 days ago

At risk of repeating myself, you're not anywhere close to grappling with how bad the situation has been on AMD cards. If they could consistently half-saturate the hardware they'd have a place in the AI revolution instead of being left out in the cold. The traditional achievement of an AMD card, in practice, is 0% hardware saturation because when they tried to multiply matricies then there was a good chance that the system would crash.

The type of commercial logic you're talking about isn't the important factor in the real world. 50% saturation with the option to fully saturate is amazing by AMDs standards and they have much bigger problems than this affecting people's buying decisions. If they had been able to achieve this standard in 2020 I would still be buying AMD.

latchkey 3 days ago

Follow Anush on Twitter and give him feedback. He's actively listening.

https://x.com/AnushElangovan

latchkey 3 days ago

He used to work for AMD.

https://www.linkedin.com/in/sebastienvince/

imtringued 3 days ago

Considering the biggest difference between the kernels is the lack of dual issue instructions (an AMD specific innovation). I'd bet on the latter.

nyanpasu64 3 days ago

Is it worth implementing sub-cubic matrix multiplication algorithms like Strassen etc. for 4096x4096?

spookie 3 days ago

Dependent on your case, but yes, even for smaller matrices.

saagarjha 3 days ago

I don't think anyone really does this, at least on the GPU.

1W6MIC49CYX9GAP 3 days ago

This item has no comments currently.