llama.cpp

mirror of https://github.com/ggml-org/llama.cpp.git synced 2025-11-14 11:07:10 +00:00

Author	SHA1	Message	Date
Aman Gupta	f77c13b91f	CUDA: General GEMV fusion (#16715 )	2025-10-26 19:28:04 +08:00
R0CKSTAR	0f7c69689f	musa: fix build warnings (#15611 ) Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com>	2025-09-26 02:56:10 +02:00
Johannes Gäßler	5143fa895e	CUDA: fastdiv, launch bounds for mmvq + q8_1 quant (#15802 ) * CUDA: fastdiv, launch bounds for mmvq + q8_1 quant	2025-09-05 16:07:02 +02:00
R0CKSTAR	8ad038c0fd	musa: add GGML_UNUSED_VARS (#15446 ) Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com>	2025-08-21 11:06:05 +08:00
Georgi Gerganov	fd1234cb46	llama : add gpt-oss (#15091 ) * oai moe * compat with new checkpoint * add attn sink impl * add rope scaling yarn * logits match with latest transformers code * wip chat template * rm trailing space * use ggml_scale_bias * rm redundant is_swa_all * convert interleaved gate_up * graph : fix activation function to match reference (#7) * vocab : handle o200k_harmony special tokens * ggml : add attention sinks support (#1) * llama : add attn sinks * ggml : add attn sinks * cuda : add attn sinks * vulkan : add support for sinks in softmax remove unnecessary return * ggml : add fused swiglu_oai op (#11) * ggml : add fused swiglu_oai op * Update ggml/src/ggml-cpu/ops.cpp Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> * update CUDA impl * cont : metal impl * add vulkan impl * test-backend-ops : more test cases, clean up * llama : remove unfused impl * remove extra lines --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> --------- Co-authored-by: slaren <slarengh@gmail.com> * repack mxfp4 upon conversion * clean up a bit * enable thinking * add quick hack to render only some special tokens * fix bf16 conversion * remove vocab hack * webui ok * support chat parsing for gpt-oss * fix webui * direct mapping mxfp4, FINALLY * force using mxfp4 * properly use lazy tensor * ggml : add mxfp4 ggml : use e8m0 conversion instead of powf Co-authored-by: Diego Devesa <slarengh@gmail.com> change kvalues_mxfp4 table to match e2m1 (#6) metal : remove quantization for now (not used) cuda : fix disabled CUDA graphs due to ffn moe bias vulkan : add support for mxfp4 cont : add cm2 dequant * ggml : add ggml_add_id (#13) * ggml : add ggml_add_id * add cuda impl * llama : add weight support check for add_id * perf opt * add vulkan impl * rename cuda files * add metal impl * allow in-place ggml_add_id * llama : keep biases on CPU with --cpu-moe * llama : fix compile error ggml-ci * cuda : add fallback for __nv_cvt_e8m0_to_bf16raw ggml-ci * cleanup ggml-ci * sycl : fix supports_op for MXFP4 ggml-ci * fix Unknown reasoning format * ggml-cpu : fix AVX build ggml-ci * fix hip build ggml-ci * cuda : add mxfp4 dequantization support for cuBLAS ggml-ci * ggml-cpu : fix mxfp4 fallback definitions for some architectures ggml-ci * cuda : fix version required for __nv_cvt_e8m0_to_bf16raw --------- Co-authored-by: Xuan Son Nguyen <son@huggingface.co> Co-authored-by: slaren <slarengh@gmail.com>	2025-08-05 22:10:36 +03:00
Johannes Gäßler	7474e00b34	CUDA: fix crash with partial offloading of MoE (#13439 )	2025-05-11 16:09:33 +02:00
Johannes Gäßler	2356fb1d53	CUDA: fix bad asserts for partial offload (#13337 )	2025-05-06 13:58:51 +02:00
Johannes Gäßler	9070365020	CUDA: fix logic for clearing padding with -ngl 0 (#13320 )	2025-05-05 22:32:13 +02:00
Johannes Gäßler	e1e8e0991f	CUDA: batched+noncont MMQ, refactor bs>1 MoE code (#13199 )	2025-04-30 23:12:59 +02:00
Johannes Gäßler	b10d8bfdb1	CUDA: use switch statements in constexpr functions (#13095 )	2025-04-24 15:57:10 +02:00
Johannes Gäßler	658987cfc9	CUDA: noncont MMVQ + batched bs1 MUL_MAT_ID (#13014 ) * CUDA: noncont MMVQ + batched bs1 MUL_MAT_ID * fix logic for RoPE support, CUDA graphs	2025-04-22 21:27:40 +02:00
R0CKSTAR	492d7f1ff7	musa: fix all warnings, re-enable `-DLLAMA_FATAL_WARNINGS=ON` in ci and update doc (#12611 ) * musa: fix all warnings Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com> * musa: enable -DLLAMA_FATAL_WARNINGS=ON in run.sh Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com> * musa: update ci doc (install ccache) Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com> * fix Windows build issue Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com> * Address review comments Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com> * Address review comments Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com> --------- Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com>	2025-03-30 10:59:38 +02:00
Slobodan Josic	bd40678df7	HIP: Add support for RDNA4 targets (#12372 )	2025-03-26 23:46:30 +01:00
uvos	10f2e81809	CUDA/HIP: refractor mmqv to unify the calculation of nwarps and rows per block between host and device code. (#12177 ) refactor mmqv to unify the calculation of nwarps and rows per block between host and device code. --------- Co-authored-by: Johannes Gäßler <johannesg@5d6.de>	2025-03-11 20:16:03 +01:00
uvos	9fbadaef4f	rocBLAS: Avoid fp32->fp16->fp32 conversion on cdna (#11356 )	2025-01-24 17:50:49 +01:00
Andreas Kieslinger	750cb3e246	CUDA: rename macros to avoid conflicts with WinAPI (#10736 ) * Renames NVIDIA GPU-architecture flags to avoid name clashes with WinAPI. (e.g. CC_PASCAL, GPU architecture or WinAPI pascal compiler flag?) * Reverts erroneous rename in SYCL-code. * Renames GGML_CUDA_MIN_CC_DP4A to GGML_CUDA_CC_DP4A. * Renames the rest of the compute capability macros for consistency.	2024-12-10 18:23:24 +01:00
uvos	3ad5451f3b	Add some minimal optimizations for CDNA (#10498 ) * Add some minimal optimizations for CDNA * ggml_cuda: set launch bounds also for GCN as it helps there too	2024-11-27 17:10:08 +01:00
Diego Devesa	ae8de6d50a	ggml : build backends as libraries (#10256 ) * ggml : build backends as libraries --------- Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com> Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> Co-authored-by: R0CKSTAR <xiaodong.ye@mthreads.com>	2024-11-14 18:04:35 +01:00
slaren	2b1f616b20	ggml : reduce hash table reset cost (#8698 ) * ggml : reduce hash table reset cost * fix unreachable code warnings after GGML_ASSERT(false) * GGML_ASSERT(false) -> GGML_ABORT("fatal error") * GGML_ABORT use format string	2024-07-27 04:41:55 +02:00
Johannes Gäßler	cb5fad4c6c	CUDA: refactor and optimize IQ MMVQ (#8215 ) * CUDA: refactor and optimize IQ MMVQ * uint -> uint32_t * __dp4a -> ggml_cuda_dp4a * remove MIN_CC_DP4A checks * change default * try CI fix	2024-07-01 20:39:06 +02:00
Georgi Gerganov	f3f65429c4	llama : reorganize source code + improve CMake (#8006 ) * scripts : update sync [no ci] * files : relocate [no ci] * ci : disable kompute build [no ci] * cmake : fixes [no ci] * server : fix mingw build ggml-ci * cmake : minor [no ci] * cmake : link math library [no ci] * cmake : build normal ggml library (not object library) [no ci] * cmake : fix kompute build ggml-ci * make,cmake : fix LLAMA_CUDA + replace GGML_CDEF_PRIVATE ggml-ci * move public backend headers to the public include directory (#8122) * move public backend headers to the public include directory * nix test * spm : fix metal header --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> * scripts : fix sync paths [no ci] * scripts : sync ggml-blas.h [no ci] --------- Co-authored-by: slaren <slarengh@gmail.com>	2024-06-26 18:33:02 +03:00

21 Commits