Aman Gupta
c0bfc57af4
CUDA: mul_mat_id for mmf for bs <= 64 for f16 and bs <= 32 for f32 ( #16277 )
...
* CUDA: mul_mat_id for mmf for bs <= 64 for f16 and bs <= 32 for f32
This commit adds mul_mat_id support for ncols_dst >= 16. It does this by
packing ncols_dst tiles into the blockDim.y.
My tests on a RTX 3090 show that this is faster than the cuBLAS fallback
for f16 till bs=64, and for f32 till bs=32
* Review: refactor if statement
2025-09-27 18:49:32 +02:00
..
2025-09-10 14:28:47 +03:00
2025-09-27 02:03:33 +08:00
2024-01-26 14:18:00 +02:00
2024-01-26 14:18:00 +02:00
2025-05-02 20:27:13 +02:00
2025-09-24 16:17:49 +02:00
2025-05-20 12:03:17 +02:00
2025-01-12 11:32:42 +02:00
2025-09-27 18:49:32 +02:00
2024-11-03 19:34:08 +01:00
2025-07-03 07:48:32 +03:00
2025-09-08 16:59:48 +02:00
2025-08-23 15:21:52 +02:00
2025-09-19 09:57:30 -06:00
2024-07-12 10:46:02 +03:00
2025-04-24 16:00:10 +03:00
2025-06-01 18:08:05 +02:00
2025-05-30 16:25:45 +03:00
2025-04-24 16:00:10 +03:00
2025-04-24 16:00:10 +03:00
2025-05-25 01:48:08 +01:00
2025-09-08 16:14:32 -05:00
2025-04-24 16:00:10 +03:00
2024-10-10 22:57:42 +02:00
2025-06-30 10:17:18 +02:00
2025-01-06 10:55:18 +02:00
2025-05-04 23:43:42 +02:00
2025-08-26 22:14:38 +02:00
2025-03-10 14:07:15 +02:00
2025-09-25 08:06:06 +03:00
2025-04-30 10:44:07 +02:00
2025-05-14 19:50:57 +01:00
2024-12-14 14:43:46 +02:00
2025-08-31 20:41:02 +03:00
2025-07-30 15:12:02 +03:00
2025-01-12 11:32:42 +02:00
2024-05-05 08:07:48 +03:00
2025-06-30 10:17:18 +02:00
2025-04-24 16:00:10 +03:00
2025-04-24 16:00:10 +03:00
2025-09-09 06:06:52 +02:00
2025-09-27 02:03:33 +08:00