llama.cpp/ggml-cuda.cu at b41b4cad6f956b5f501db0711dd7007c32b5eee5

mirror of https://github.com/ggml-org/llama.cpp.git synced 2025-11-11 10:36:54 +00:00

Files

Kawrakow 3d01122610 CUDA : faster k-quant dot kernels (#1862 )

* cuda : faster k-quant dot kernels

* Imrove Q2_K dot kernel on older GPUs

We now have a K_QUANTS_PER_ITERATION macro, which should be
set to 1 on older and to 2 on newer GPUs.
With this, we preserve the performance of the original
PR on RTX-4080, and are faster compared to master on
GTX-1660.

* Imrove Q6_K dot kernel on older GPUs

Using the same K_QUANTS_PER_ITERATION macro as last commit,
we preserve performance on RTX-4080 and speed up
Q6_K on a GTX-1660.

* Add LLAMA_CUDA_KQUANTS_ITER to CMakeLists.txt and Makefile

Allowed values are 1 or 2. 2 gives the best performance on
modern GPUs and is set as default. On older GPUs 1 may work
better.

* PR comments

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>

2023-06-16 20:08:44 +03:00

98 KiB

Raw Blame History

View Raw

98 KiB Raw Blame History

98 KiB

Raw Blame History