Files
llama.cpp/ggml/src/ggml-cuda
Gaurav Garg 517b5ddbf0 CUDA: Improve flash decoding kernel GPU occupancy for BS=1 case (#12183)
- Find out active blocks per SM using cudaOccupancyMaxActiveBlocksPerMultiprocessor API. Use this value to determine the optimal parallel_blocks value.
- Prefer vector flash attention kernels over MMA kernel for BS=1

Fixes Issue: #12182
---------

Co-authored-by: Johannes Gäßler <johannesg@5d6.de>
2025-03-19 20:52:06 +01:00
..
2024-11-21 18:18:50 +01:00
2024-11-21 18:18:50 +01:00
2024-08-27 22:41:27 +03:00
2024-08-27 22:41:27 +03:00