llama.cpp

mirror of https://github.com/ggml-org/llama.cpp.git synced 2025-11-01 09:01:57 +00:00

Author	SHA1	Message	Date
leejet	8f5e7b0ce6	remove unused variables	2025-08-31 12:02:23 +08:00
leejet	e66bf6e503	cpu: im2col_3d support non continuous src Co-authored-by: Jeff Bolz <jbolz@nvidia.com>	2025-08-31 11:58:32 +08:00
leejet	0d5eb51252	cuda: use simpler loop in get_rows	2025-08-31 00:21:24 +08:00
leejet	131ae2d585	adjust the code style	2025-08-31 00:04:27 +08:00
leejet	c9b9fabe08	fix cpu im2col_3d	2025-08-30 11:25:07 +08:00
leejet	f6278c832f	cuda: remove unnecessary MIN define	2025-08-30 04:14:19 +08:00
leejet	f6a874c04a	avoid build failure on MacOS	2025-08-30 03:53:03 +08:00
leejet	d11a729898	avoid build failure	2025-08-30 03:48:47 +08:00
leejet	9d035c4c4a	correct GGML_OP_COUNT assertion	2025-08-30 03:36:59 +08:00
leejet	df05913bc4	avoid ggml_conv_3d conflict	2025-08-30 03:28:07 +08:00
leejet	d30e07dbb3	fix cuda get_rows	2025-08-30 03:13:57 +08:00
leejet	d8377a0a37	gguf: support loading tensors which n_dims > GGML_MAX_DIMS	2025-08-30 03:11:09 +08:00
leejet	dd745ba31f	make im2col_3d faster	2025-08-30 03:11:09 +08:00
leejet	ae47caca70	fix cuda pad/scale/im2col3d	2025-08-30 03:11:08 +08:00
leejet	85c8e1e519	cuda: make im2col a little faster	2025-08-30 03:11:08 +08:00
leejet	f7a12f9e69	cuda/cpu: add im2col_3d support	2025-08-30 03:11:08 +08:00
leejet	93c7e775b8	add ggml_pad_ext for cpu & cuda backend	2025-08-30 02:56:56 +08:00
leejet	c92f9b4a68	add conv3d support	2025-08-30 02:56:56 +08:00
Aman Gupta	81017865ee	CUDA: fix bug in rms_norm fusion (#15660 ) * CUDA: fix bug in rms_norm fusion * Fix bug for OP_REPEAT * Fix index for add	2025-08-29 21:30:06 +08:00
Aman Gupta	009b709d6e	CUDA: fuse adds, fuse add with rms norm (#15631 ) * CUDA: fused add with rms_norm_mul * Non-broadcast fuse works * Add fused adds * format * Remove n_fuse from template params * Address review comments * Move template inside binbcast	2025-08-29 11:35:58 +08:00
mnehete32	c97dc09391	CUDA: add conv2d (#15635 ) * CUDA: add conv2d * CUDA: conv2d - correct formatting and added const	2025-08-28 20:33:03 +02:00
Aaron Teo	6c442f42ff	ggml-cpu: fix invalid hsum build in debug s390x (#15634 ) Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>	2025-08-28 22:39:27 +08:00
compilade	73804145ab	ggml : fix SSM_SCAN for n_groups > 1 (#15625 )	2025-08-28 10:11:36 -04:00
Georgi Gerganov	8a4280ce43	kv-cache : remove LLAMA_SET_ROWS checks (#15505 ) ggml-ci	2025-08-28 12:27:02 +03:00
matiaslin	5a0e3ef6f0	cuda: Add cublasLt_static linking when GGML_STATIC is enabled (#15622 ) Prior to this change, we faced undefined cublasLt references when attempting to compile 'llama-cli' with GGML_STATIC=ON on Linux. We add linking with CUDA::cublasLt_static when CUDA version is greater than 10.1.	2025-08-28 02:32:36 +02:00
uvos	47373271f9	HIP: Enable support for ggml_backend_cuda_register_host_buffer (#15615 )	2025-08-27 13:58:54 +02:00
Chenguang Li	1e7489745a	CANN: refactor mask handling and improve performance in FA (#15561 ) * CANN(flash-attn): refactor mask handling and improve performance 1. Refactored the mask computation in Flash Attention, unified the logic without separating prefill and decode. 2. Optimized performance in non-alibi scenarios by reducing one repeat operation. 3. Updated operator management to explicitly mark unsupported cases on 310P devices and when dim is not divisible by 16. Signed-off-by: noemotiovon <757486878@qq.com> * [CANN]: fix review Signed-off-by: noemotiovon <757486878@qq.com> * [CANN]: Optimization FA BNSD to BSND Signed-off-by: noemotiovon <757486878@qq.com> --------- Signed-off-by: noemotiovon <757486878@qq.com>	2025-08-27 17:21:41 +08:00
xctan	1cf123a343	ggml-cpu : add basic RVV support for vector f32 ops (#15057 ) * ggml-cpu : add basic RVV support for vector f32 ops * ggml-cpu : add RVV support for f32 softmax	2025-08-27 16:44:22 +08:00
rmatif	86076f92de	OpenCL: add fused group_norm/norm, mul, add (#15314 ) * add fused group_norm/norm, mul, add * fix spacing * revert rms_norm logic * fix trailing whitespace	2025-08-26 23:36:05 -07:00
Akarshan Biswas	8b69686136	SYCL: fix rms_norm_mul_add for tensor dim not a multiple of sg_size (#15592 ) The original implementation unconditionally returned true for this operation, leading to a failure when the tensor's first dimension (ne[0]) was not a multiple of WARP_SIZE. This caused an GGML_ASSERT(ncols % WARP_SIZE == 0) failure in ggml-sycl/norm.cpp. This change updates the ggml_backend_sycl_device_supports_op check to correctly return true for GGML_OP_RMS_NORM only when the first dimension of the tensor is a multiple of WARP_SIZE, ensuring the operation can be performed without error.	2025-08-27 00:27:49 +05:30
shalinib-ibm	a6a58d6478	llamafile: PowerPC Sgemm Optimization (#15558 ) This patch improves GEMM for FP32 Data Type on PowerPC Implements GEMM on large blocks with configurable block size mc, nc, kc (default: 256, 256, 256). Packing Function optimized to access blocks as per memory layout. GEMM Optimized to work on larger blocks. Isolated Packing from GEMM Operations for better MMA utilization. Verified functionality and correctness uing llama-cli and stand alone test case (performs matmul and compares final mattrix C result with base). Minor code refactoring changes: Replace macro with inline function Code Indent made consistent with 4 spaces Performance Testing: Observed 50% ~ 70% improvement in Prompt Processing Speed mesured using llama-bench with Meta-Llama3-8B FP32 Model. Similar gains observed with Mistral-7b-Instruct-v0.3 Model. model Size Params Backend Threads Test Patch Base llama 8B all F32 29.92 GiB 8.03 B CPU 20 pp512 98.58 60.3 llama 8B all F32 29.92 GiB 8.03 B CPU 20 pp1024 95.88 57.36 llama 8B all F32 29.92 GiB 8.03 B CPU 20 pp2048 85.46 53.26 llama 8B all F32 29.92 GiB 8.03 B CPU 20 pp4096 68.66 45.78 llama 8B all F32 29.92 GiB 8.03 B CPU 20 pp6144 57.35 40.44 25 ~ 30% improvement in llama-batched-bench with Metla-Llama3-8B in Prompt Processing Speed for large prompts (256, 512, 1024, 2048, 4096)tokens with various batch sizes ( 1, 2, 4, 8, 16) Signed-off-by: Shalini Salomi Bodapati <Shalini.Salomi.Bodapati@ibm.com>	2025-08-26 23:35:25 +08:00
Johannes Gäßler	8f5afa94c4	CUDA: return -1 for nonexistent compiled arch (#15587 )	2025-08-26 16:01:20 +02:00
Georgi Gerganov	b3964c1e89	metal : optimize FA vec for large sequences and BS <= 8 (#15566 ) * metal : optmize FA vec for large heads and sequences * metal : adjust small-batch mul mv kernels ggml-ci * batched-bench : fix total speed computation ggml-ci * cont : add comments ggml-ci	2025-08-26 14:22:14 +03:00
Georgi Gerganov	1d8d83deaa	metal : improve `MUL_MAT_ID` (#15541 ) * metal : mul_mm_id remove hdst * metal : remove mul_mm_id hsrc1 * metal : mul_mm_id simplify + add test * metal : opt mul_mm_id map0 * metal : optimize mul_mm_id id gathering * metal : mul/div opt * metal : optimize mul_mm_id_map0 ggml-ci	2025-08-26 12:46:15 +03:00
Sigbjørn Skjæret	0fd90db585	metal : remove contiguous assertion for src0 in IM2COL (#15577 ) * remove contiguous assertion for src0 in IM2COL * add contiguous check in supports_op	2025-08-26 09:51:43 +03:00
Yoshi_likes_e4	4c37636b3e	Add a warning for special devices (#15563 ) * Add warning * Print the devices names * Add newlines * Apply suggestions from code review Co-authored-by: Johannes Gäßler <johannesg@5d6.de> * Fix vector names --------- Co-authored-by: Johannes Gäßler <johannesg@5d6.de>	2025-08-26 08:15:33 +02:00
Jeff Bolz	34bdbbd7c2	vulkan: Remove splitting for mul_mat_id (#15568 ) row_ids only needs to hold the BN rows for the current tile.	2025-08-26 06:42:44 +02:00
Qeeweew	74f52f77f2	CUDA: Accelerate MXFP4 table lookup using `__byte_perm` (#15451 ) * CUDA: optimize get_int_from_table_16 * CUDA: use v_perm_b32 to replace byte_perm on AMD GPUs * revise documentation --------- Co-authored-by: xix <xiapc@outlook.com> Co-authored-by: Johannes Gäßler <johannesg@5d6.de>	2025-08-25 23:21:22 +02:00
lhez	f7207b0415	opencl: fix support ops condition for `rms_norm` (#15560 )	2025-08-25 14:18:09 -07:00
Ruben Ortlam	4d917cd4f6	vulkan: fix min subgroup 16 condition for mmid subgroup optimization (#15565 )	2025-08-25 17:56:59 +02:00
Ihar Hrachyshka	111f8d06f0	metal: fix regression when no metal devices are present (#15531 )	2025-08-25 18:27:34 +03:00
Johannes Gäßler	5eff6ec9b1	CUDA: MoE helper in device code, better tile sizes (#15525 ) * CUDA: MoE helper in device code, better tile sizes * reduce superfluous CUDA blocks	2025-08-25 17:23:40 +02:00
Georgi Gerganov	b0ba31f525	metal : add FA kernels for HS=40 (#15559 ) ggml-ci	2025-08-25 10:14:48 +03:00
Chenguang Li	c247d06f38	CANN: ROPE cache sin/cos repeat (#15501 ) Signed-off-by: noemotiovon <757486878@qq.com>	2025-08-25 10:32:21 +08:00
Ruben Ortlam	043fb27d38	vulkan: apply MUL_MAT_ID subgroup optimization to non-coopmat devices (#15524 ) * vulkan: use subgroup function for mul_mat_id shader even without coopmat * vulkan: fix compile warnings * vulkan: properly check for subgroup size control and require full subgroups for subgroup mul_mat_id * vulkan: disable subgroup mul_mat_id on devices with subgroups < 16	2025-08-24 19:36:36 +02:00
Jeff Bolz	c9a24fb932	vulkan: Support FA with any multiple of 8 head sizes (#15537 ) The scalar FA shader already handled multiples of 8. The coopmat1 FA shader assumed 16x16x16 and the shared memory allocations need the HSK dimensions padded to a multiple of 16. NVIDIA's coopmat2 implementation requires multiples of 16 for N and K, and needs the matrix dimensions padded and loads clamped. Store the FA pipelines in a map, indexed by the pipeline state.	2025-08-24 11:24:25 +02:00
Ruben Ortlam	a9c6ffcbfa	vulkan: enable Conv2D for Apple after MoltenVK fixed the bug (#15526 )	2025-08-24 10:48:53 +02:00
Jeff Bolz	e78cf0d4b1	vulkan: workaround MoltenVK compile failure in multi_add (#15506 ) * vulkan: workaround MoltenVK compile failure in multi_add * Update ggml/src/ggml-vulkan/vulkan-shaders/multi_add.comp Co-authored-by: 0cc4m <picard12@live.de>	2025-08-24 10:48:21 +02:00
Johannes Gäßler	710dfc465a	CUDA: fix half2 -> half conversion for HIP (#15529 )	2025-08-23 21:37:06 +02:00
Jeff Bolz	611f419cff	vulkan: optimize rms_norm, and allow the work to spread across multiple SMs (#15281 ) * vulkan: optimize rms_norm, and allow the work to spread across multiple SMs There are really two parts to this change: (1) Some optimizations similar to what we have in soft_max, to unroll with different numbers of iterations. (2) A fusion optimization where we detect add followed by rms_norm, and make the add shader atomically accumulate the values^2 into memory. Then the rms_norm shader can just load that sum. This allows the rms_norm to be parallelized across multiple workgroups, it just becomes a simple per-element multiply. The fusion optimization is currently only applied when the rms_norm is on a single vector. This previously always ran on a single SM. It could apply more broadly, but when there are other dimensions the work can already spread across SMs, and there would be some complexity to tracking multiple atomic sums. * Change add+rms_norm optimization to write out an array of partial sums rather than using atomic add, to make it deterministic. The rms_norm shader fetches a subgroup's worth in parallel and uses subgroupAdd to add them up. * complete rebase against fused adds - multi_add shader can also compute partial sums * fix validation errors * disable add_rms_fusion for Intel due to possible driver bug * resolve against #15489, sync after clearing partial sums	2025-08-23 13:16:17 -05:00

1 2 3 4 5 ...

1253 Commits