i swear it's so funny when people talk about this stuff like it's all weird/surprising. y'all realize that there are hundreds (thousands?) of engineers across FAANG whose full time job is optimizing CUDA/ROCm/whatever code for their team/org/company's specific workloads? like do y'all think that serious shops really just go with whatever the vendor gives you? ie none of this is in the least surprising - it's completely expected that whatever the vendor designs generically for the entire market segment will fail to achieve peak perf for your use case.
When Carmack left Meta I believe he claimed they were only getting around 20% utilization on their even then enormous GPU fleet. So I could see them also leaving a lot of perf headroom on the table.
Working on a platform that hides so many low-level details is a challenge, and the fact people have to go to such length to get access to it is noteworthy. 'maxas' was noteworthy and unneeded on many (most ?) other platforms.
Not saying Intelstuff or armstuff is 'easier' but at least you get access and are tooled to work on the actual low-level asm.
I'll repeat myself: no it's not. There's nothing noteworthy about it at all. In fact I literally cannot fathom why anyone ever expects or expected otherwise. Is it because the oft-repeated notion of "abstraction"? I guess I must be the sole programmer that has always known/understood, even from the first intro class, that abstractions are just assumptions and when those assumptions don't hold I will need to remove the abstraction.