llama.cpp/ggml-cuda.h at dc271c52ed65e7c8dfcbaaf84dabb1f788e4f3d0

mirror of https://github.com/ggml-org/llama.cpp.git synced 2025-10-31 08:51:55 +00:00

Files

Johannes Gäßler 905d87b70a ggml : GPU-accelerated token generation (#1412 )

* CUDA kernel for q4_0 dequant. + mat. vec. mult.

* Added q4_1 via template

* Added missing __syncthreads();

* --gpu_layers -> --gpu-layers

* Shorter dequantize_mul_mat_vec line

* q5_0 dequantize_mul_mat kernel

* More readable dequantize_mul_mat_vec logic

* dequantize_mul_mat_vec kernels for q5_1, q8_0, f16

* llama : offload "output" tensor to GPU too + coding style fixes

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

2023-05-13 16:38:36 +03:00

701 B

Raw Blame History

View Raw

701 B Raw Blame History

701 B

Raw Blame History