Files
Jeff Bolz 1946e46f4c vulkan: For coopmat2 FA, use fp16 accumulators for the final result (#19376)
The cpu and cuda backends use fp16 for the VKQ accumulator type, this change
does the same for vulkan. This helps particularly with large head sizes which
are very register-limited.

I tried this for the coopmat1 path and it slowed down a bit. I didn't try for
scalar.

I applied the softmax bias that the cuda backend uses to avoid overflow,
although I was not able to reproduce the original bug without it.
2026-02-06 09:15:13 +01:00
..
2026-02-02 08:38:55 +02:00
2026-01-29 11:10:53 +01:00
2025-08-05 22:10:36 +03:00
2025-08-05 22:10:36 +03:00