llama.cpp/tests/test-backend-ops.cpp at af04481e6b3b8dcdf5c1cc7d84bc7ece5658e9ab

mirror of https://github.com/ggml-org/llama.cpp.git synced 2025-11-04 09:32:00 +00:00

Files

Gaurav Garg 517b5ddbf0 CUDA: Improve flash decoding kernel GPU occupancy for BS=1 case (#12183 )

- Find out active blocks per SM using cudaOccupancyMaxActiveBlocksPerMultiprocessor API. Use this value to determine the optimal parallel_blocks value.
- Prefer vector flash attention kernels over MMA kernel for BS=1

Fixes Issue: #12182
---------

Co-authored-by: Johannes Gäßler <johannesg@5d6.de>

2025-03-19 20:52:06 +01:00

170 KiB

Raw Blame History

View Raw

170 KiB Raw Blame History

170 KiB

Raw Blame History