llama.cpp

mirror of https://github.com/ggml-org/llama.cpp.git synced 2025-11-19 11:57:07 +00:00

Author	SHA1	Message	Date
Johannes Gäßler	b69f1647f9	CUDA: skip fully masked-out KV in FA vec kernel (#13584 ) * CUDA: skip fully masked-out KV in FA vec kernel	2025-05-20 14:45:07 +02:00
Johannes Gäßler	0208355f42	CUDA: fix race conditions FlashAttention kernels (#13438 )	2025-05-10 22:22:48 +02:00
Johannes Gäßler	0cf6725e9f	CUDA: FA support for Deepseek (Ampere or newer) (#13306 ) * CUDA: FA support for Deepseek (Ampere or newer) * do loop unrolling via C++ template	2025-05-09 13:34:58 +02:00
R0CKSTAR	492d7f1ff7	musa: fix all warnings, re-enable `-DLLAMA_FATAL_WARNINGS=ON` in ci and update doc (#12611 ) * musa: fix all warnings Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com> * musa: enable -DLLAMA_FATAL_WARNINGS=ON in run.sh Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com> * musa: update ci doc (install ccache) Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com> * fix Windows build issue Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com> * Address review comments Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com> * Address review comments Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com> --------- Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com>	2025-03-30 10:59:38 +02:00
Gaurav Garg	517b5ddbf0	CUDA: Improve flash decoding kernel GPU occupancy for BS=1 case (#12183 ) - Find out active blocks per SM using cudaOccupancyMaxActiveBlocksPerMultiprocessor API. Use this value to determine the optimal parallel_blocks value. - Prefer vector flash attention kernels over MMA kernel for BS=1 Fixes Issue: #12182 --------- Co-authored-by: Johannes Gäßler <johannesg@5d6.de>	2025-03-19 20:52:06 +01:00
Johannes Gäßler	a28e0d5eb1	CUDA: app option to compile without FlashAttention (#12025 )	2025-02-22 20:44:34 +01:00
Johannes Gäßler	5fa07c2f93	CUDA: optimize FA for GQA + large batches (#12014 )	2025-02-22 12:20:17 +01:00
Johannes Gäßler	864a0b67a6	CUDA: use mma PTX instructions for FlashAttention (#11583 ) * CUDA: use mma PTX instructions for FlashAttention * __shfl_sync workaround for movmatrix * add __shfl_sync to HIP Co-authored-by: Diego Devesa <slarengh@gmail.com>	2025-02-02 19:31:09 +01:00
mahorozte	e9e661bd59	CUDA: remove unnecessary warp reduce in FA (ggml/1032) * kqmax_new_j in every thread within warp is same after operate at line 199,this reduce can be omit * same problem in vec32 --------- Co-authored-by: ZhaoXiaoYu <zhao.xiaoyu@zte.com.cn>	2024-12-03 20:04:49 +02:00
Diego Devesa	ae8de6d50a	ggml : build backends as libraries (#10256 ) * ggml : build backends as libraries --------- Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com> Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> Co-authored-by: R0CKSTAR <xiaodong.ye@mthreads.com>	2024-11-14 18:04:35 +01:00
Johannes Gäßler	fabdc3bda3	ggml/ex: calculate accuracy in graph, adapt MNIST (ggml/980)	2024-10-03 21:17:26 +03:00
Johannes Gäßler	e11bd856d5	CPU/CUDA: Gemma 2 FlashAttention support (#8542 ) * CPU/CUDA: Gemma 2 FlashAttention support * apply logit_softcap to scale in kernel * disable logit softcapping tests on Metal * remove metal check	2024-08-24 21:34:59 +02:00
Georgi Gerganov	f3f65429c4	llama : reorganize source code + improve CMake (#8006 ) * scripts : update sync [no ci] * files : relocate [no ci] * ci : disable kompute build [no ci] * cmake : fixes [no ci] * server : fix mingw build ggml-ci * cmake : minor [no ci] * cmake : link math library [no ci] * cmake : build normal ggml library (not object library) [no ci] * cmake : fix kompute build ggml-ci * make,cmake : fix LLAMA_CUDA + replace GGML_CDEF_PRIVATE ggml-ci * move public backend headers to the public include directory (#8122) * move public backend headers to the public include directory * nix test * spm : fix metal header --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> * scripts : fix sync paths [no ci] * scripts : sync ggml-blas.h [no ci] --------- Co-authored-by: slaren <slarengh@gmail.com>	2024-06-26 18:33:02 +03:00

13 Commits