llama.cpp

mirror of https://github.com/ggml-org/llama.cpp.git synced 2025-10-31 08:51:55 +00:00

Author	SHA1	Message	Date
0cc4m	cc71ccca82	vulkan: add mmq q2_k integer dot support	2025-10-25 13:31:41 +00:00
Giuseppe Scrivano	f90b4a8efe	vulkan: delete dead code (#16732 ) ggml_vk_create_buffer_temp is not used anywhere, and it is the only caller for ggml_vk_pool_malloc. Signed-off-by: Giuseppe Scrivano <gscrivan@redhat.com>	2025-10-25 10:59:54 +02:00
Jeff Bolz	8423d01931	vulkan: Optimize SSM_SCAN (#16645 )	2025-10-25 07:04:12 +02:00
Jeff Bolz	e56abd2098	vulkan: Implement topk_moe fused shader, ported from CUDA (#16641 ) This is similar to the CUDA shader from #16130, but doesn't use shared memory and handles different subgroup sizes.	2025-10-18 12:22:57 +02:00
Giuseppe Scrivano	3d4e86bbeb	vulkan: Add State Space Model (SSM) Operations Support (#16463 ) * vulkan: implement SSM scan operation Add State Space Model scan operation to the Vulkan backend. Signed-off-by: Giuseppe Scrivano <gscrivan@redhat.com> * vulkan: implement SSM conv operation Add State Space Model conv operation to the Vulkan backend. Signed-off-by: Giuseppe Scrivano <gscrivan@redhat.com> --------- Signed-off-by: Giuseppe Scrivano <gscrivan@redhat.com>	2025-10-17 14:23:47 +02:00
Jeff Bolz	4258e0cfe7	vulkan: Support FA with K/V in F32 (#16543 )	2025-10-14 15:53:37 +02:00
Jeff Bolz	2aaf0a2a20	vulkan: Replace uses of maxMemoryAllocationSize and VK_WHOLE_SIZE (#16354 ) * vulkan: Replace uses of maxMemoryAllocationSize and VK_WHOLE_SIZE Replace maxMemoryAllocationSize check with maxBufferSize when creating buffers. The maxMemoryAllocationSize limit is a "soft" limit and allocations can succeed beyond that limit. This allows > 4GB buffers to be allocated on some implementations (e.g. NVIDIA) and tensors this large can be used for im2col and mul_mat. For temporary buffers (prealloc_x/y/etc) check against maxStorageBufferRange. I'm not sure this check is ideal, but we always use these buffers as a single full size binding and the limit may be smaller than maxMemoryAllocationSize or maxBufferSize, so I think this is reasonable. Replace descriptor range uses of VK_WHOLE_SIZE with a manually computed range. The maxStorageBufferRange may be smaller than the maxBufferSize or maxMemoryAllocationSize (and the Vulkan spec warns about this in a note) and it's invalid usage if VK_WHOLE_SIZE computes a range larger than maxStorageBufferRange. With this change, it should be possible to generate videos using wan networks in stable-diffusion.cpp. * vulkan: Add env var GGML_VK_FORCE_MAX_BUFFER_SIZE and use stoull	2025-10-03 12:50:46 +02:00
Jeff Bolz	e308efda8e	vulkan: in flash attention, bounds check against nem1 (don't rely on GGML_KQ_MASK_PAD) (#16316 )	2025-10-03 10:33:08 +02:00
Eve	132d673554	vulkan: make ggml_vk_default_dispatcher support older vulkan headers (#16345 ) * make ggml_vk_default_dispatcher support older vulkan headers * simpilfy with using	2025-10-01 09:56:36 +02:00
Jeff Bolz	d8359f5fde	vulkan: 64-bit im2col (#16135 ) * vulkan: 64-bit im2col Add variants of the im2col shaders that use buffer_device_address/buffer_reference, and use 64-bit address calculations. This is needed for large convolutions used in stable-diffusion.cpp. * fix validation error for large im2col	2025-09-28 08:38:37 +02:00
Jeff Bolz	1384abf8b8	vulkan: handle mat_mul with A matrix > 4GB (#16176 ) * vulkan: handle mat_mul with A matrix > 4GB This change splits mat_mul operations with huge A matrix into chunks in the M dimension. This works well for stable-diffusion use cases where the im2col matrix has very large M. Fix the order of setting the stride in mul_mm_cm2 - setting the dimension clobbers the stride, so stride should be set after. * build fixes	2025-09-27 20:36:34 -05:00
Acly	8656f5de68	vulkan : make the vulkan.hpp dynamic dispatcher instance private (#16224 ) * don't use VULKAN_HPP_DEFAULT_DISPATCH_LOADER_DYNAMIC_STORAGE which can cause conflicts if application or other libraries do the same	2025-09-27 22:41:03 +02:00
Dmytro Minochkin	0499b29c6f	vulkan: throw system error instead of SIGABRT during init on older devices (#16156 ) * Throw system error on old Vulkan driver rather than SIGABRT * Optionally handle any potential error in vulkan init	2025-09-27 18:26:46 +02:00
Jeff Bolz	3f81b4e91c	vulkan: support GET_ROWS for k-quants (#16235 ) The dequantize functions are copy/pasted from mul_mm_funcs.comp with very few changes - add a_offset and divide iqs by 2. It's probably possible to call these functions from mul_mm_funcs and avoid the duplication, but I didn't go that far in this change.	2025-09-27 12:36:11 +02:00
Sigbjørn Skjæret	3ecb2f671a	ggml : implement set_rows with i32 index (#16159 ) * implement set_rows with i32 index * template fix * test quantized path warnings-- * Apply suggestions from code review Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> * forgotten name change * deduplicate cuda/sycl and test-fix * indent++ * vulkan: support set_rows with i32 index type (#16162) * disable i32 index for webgpu for now --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> Co-authored-by: Jeff Bolz <jbolz@nvidia.com>	2025-09-22 19:13:00 +02:00
Shin-myoung-serp	96fdca043b	Vulkan: add conv_transpose_2d operation (#16022 ) * Vulkan: add conv_transpose_2d operation * Vulkan: fix typo in conv_transpose_2d shader(s0mp, s0L, s1mp, s1L) * Vulkan: fix incorrect indentation in conv_transpose_2d shader * Vulkan: add checking the push constants size limit and reuse conv2d_mm.comp for conv_transpose_2d operation * Vulkan: revert the order of the index calculation and bound check in conv_2d shader * Vulkan: explicity check push constants limit in supports_op() for conv_transpose_2d operation. * Vulkan: remove unnecessary lower bound checks for H/W_idx in the conv_2d shader.	2025-09-22 10:04:01 +02:00
Jeff Bolz	a20d810d79	vulkan: add RTE variants of exp shader (#16165 ) This fixes some failures on Turing where "round to zero" rounds to the max f16 value but the CPU reference value is infinite.	2025-09-22 07:37:17 +02:00
Giuseppe Scrivano	1eeb523c3e	vulkan: optimize UMA buffer operations and fix driver hangs (#16059 ) * vulkan: optimize UMA buffer operations and fix driver hangs The previous implementation was blocking the GPU for extended periods, causing the i915 driver to reset the context due to the hangcheck protection. [32628.443070] i915 0000:00:02.0: [drm] GPU HANG: ecode 12:1:85dffffb, in llama-server [194114] [32628.443091] i915 0000:00:02.0: [drm] llama-server[194114] context reset due to GPU hang * vulkan: implement deferred_memset on UMA --------- Signed-off-by: Giuseppe Scrivano <gscrivan@redhat.com>	2025-09-21 08:31:55 +02:00
Jeff Bolz	5bb4a3edec	vulkan: fix validation error about VK_PIPELINE_CREATE_CAPTURE_STATISTICS_BIT_KHR (#16086 )	2025-09-21 08:23:37 +02:00
Jeff Bolz	c0b45097c3	rename optimize_graph to graph_optimize (#16082 )	2025-09-18 13:46:17 -05:00
Eve	cb5bb6cc05	vulkan: automatically remove unsupported devices (#15976 ) * remove unsupported vulkan devices * make this happen during selection instead * pass by reference	2025-09-17 09:35:37 +02:00
Ruben Ortlam	261e6a20ff	Vulkan: Clean up mul_mm shader (#15987 ) * vulkan: move mul_mm dequantization steps into a separate file and functions * improve mul_mm vector load code * fix debug mode issues and warnings	2025-09-14 16:56:28 +02:00
Jeff Bolz	b9c9c9f789	vulkan: initialize vulkan-hpp to allow using extension function pointers (#15705 ) Use this to query register count for shader compiles on NVIDIA. Currently this is only for performance debug, but it could eventually be used in some heuristics like split_k.	2025-09-13 17:23:30 +02:00
Ruben Ortlam	304ac5693d	Vulkan iGPU device selection overhaul and PCI ID API support (#15947 ) * vulkan: implement ggml igpu device type, implement pci id support * fix compiler warning * prevent printf overflow warning	2025-09-12 13:24:21 +02:00
Mathieu Baudier	6c88ad8fa7	vulkan: Make device memory check more portable (#15939 )	2025-09-12 09:06:20 +02:00
Diego Devesa	360d6533db	ggml-backend : add GGML_BACKEND_DEVICE_TYPE_IGPU device type (#15797 ) * ggml-backend : add GGML_BACKEND_DEVICE_TYPE_IGPU device type ggml-backend : add device id to device props llama : only use iGPU devices if there are no GPU devices llama : do not use multiple devices from different backends with the same device id	2025-09-11 22:47:38 +02:00
Ruben Ortlam	ae355f6f71	vulkan: throw the oom error instead of no memory type found (#15905 )	2025-09-09 22:26:03 +02:00
Jeff Bolz	4f63cd705c	vulkan: Fix OOB accesses in soft_max_back (#15861 )	2025-09-09 14:41:15 +02:00
lksj92hs	ed54e32558	Workaround for subgroup arithmetic failing on MoltenVK with AMD GPUs (issue 15846) (#15886 )	2025-09-09 14:01:15 +02:00
Jeff Bolz	e68aa10d8f	vulkan: sort graph to allow more parallel execution (#15850 ) * vulkan: sort graph to allow more parallel execution Add a backend proc to allow the backend to modify the graph. The vulkan implementation looks at which nodes depend on each other and greedily reorders them to group together nodes that don't depend on each other. It only reorders the nodes, doesn't change the contents of any of them. With #15489, this reduces the number of synchronizations needed. * call optimize_graph per-split	2025-09-09 02:10:07 +08:00
Xuan-Son Nguyen	9fcb29f22f	ggml: allow casting between f32 and i32 (#15783 ) * ggml: allow casting between f32 and i32 * fix cuda * add vulkan * fix CPU non-cont * add non-cont test case * add note * extend test number range * correct note * add cont version for vulkan	2025-09-08 12:33:01 +02:00
Jeff Bolz	3976dfbe00	vulkan: support im2col_3d (#15795 )	2025-09-07 13:50:26 -05:00
Jeff Bolz	c97b5e5854	vulkan: Support pad_ext (#15794 )	2025-09-07 19:00:49 +02:00
leejet	0a1b3982cd	ggml: add ops for WAN video model (cuda && cpu) (#15669 ) * add conv3d support * add ggml_pad_ext for cpu & cuda backend * cuda/cpu: add im2col_3d support * cuda: make im2col a little faster * fix cuda pad/scale/im2col3d * make im2col_3d faster * gguf: support loading tensors which n_dims > GGML_MAX_DIMS * fix cuda get_rows * avoid ggml_conv_3d conflict * correct GGML_OP_COUNT assertion * avoid build failure * avoid build failure on MacOS * cuda: remove unnecessary MIN define * fix cpu im2col_3d * adjust the code style * cuda: use simpler loop in get_rows * add test_im2col_3d to test-backend-ops * test-backend-ops.cpp: remove trailing whitespace * cpu: im2col_3d support non continuous src Co-authored-by: Jeff Bolz <jbolz@nvidia.com> * fix test_im2col_3d * remove unused variables * cuda: get_rows: dfloat2 -> float2 * add test_pad_ext to test-backend-ops.cpp * add gguf_init_from_file_ext impl * Revert "gguf: support loading tensors which n_dims > GGML_MAX_DIMS" This reverts commit `d8377a0a37`. * Revert "add gguf_init_from_file_ext impl" This reverts commit `d9f1d13208`. * update ggml_backend_vk_device_supports_op * fix ggml_backend_vk_device_supports_op * update other backend supports op for ggml_pad_ext * metal/opencl/sycl/vulkan: fix GGML_OP_PAD check in supports_op --------- Co-authored-by: Jeff Bolz <jbolz@nvidia.com>	2025-09-04 10:38:49 +02:00
Ruben Ortlam	dff7551bfd	vulkan: fix mmv subgroup16 selection (#15775 )	2025-09-03 21:55:10 +01:00
Jeff Bolz	0fce7a1248	vulkan: don't use std::string in load_shaders, to improve compile time (#15724 ) * vulkan: don't use std::string in load_shaders, to improve compile time * keep the string version for those calls that use it	2025-09-03 20:33:15 +02:00
Daniel Bevenius	8227695d7a	vulkan : update ggml_vk_instance_validation_ext_available (#15666 ) * vulkan : update ggml_vk_instance_validation_ext_available This commit updates ggml_vk_instance_validation_ext_available() to check for VK_EXT_validation_features instead of VK_KHR_portability_enumeration. Based on how the returned boolean is used later in the code (to enable both the validation layer and the VK_EXT_validation_features extension), it appears the function may have been intended to check for the validation layer features extension. * remove try/catch This was a left over from a previous iteration where I was explicitly quering for a specific validation layer first, which would throw. * update warning message about validation layers	2025-09-03 20:24:50 +02:00
Shin-myoung-serp	0014fb4add	ggml vulkan: add hardsigmoid and hardswish operations (#15762 )	2025-09-03 20:22:55 +02:00
Jeff Bolz	25f1045f07	vulkan: Fix macro parameter order for f32 matmul shaders (#15716 )	2025-09-02 14:37:01 +08:00
Gilad S.	d4d8dbe383	vulkan: use memory budget extension to read memory usage (#15545 ) * vulkan: use memory budget extension to read memory usage * fix: formatting and names * formatting * fix: detect and cache memory budget extension availability on init * fix: read `budgetprops.heapBudget` instead of `heap.size` when memory budget extension is available * style: lints	2025-09-01 21:17:42 +02:00
Ruben Ortlam	fec7911f8f	vulkan: disable large mmv subgroups on older Nvidia GPUs (#15717 )	2025-09-01 20:58:35 +02:00
Ruben Ortlam	02c1813517	Vulkan: Add Integer Dot Product mul_mat_vec shader for legacy quants (#14903 ) * vulkan: Add Integer Dot Product mul_mat_vec shader for legacy quants * vulkan: use subgroup operations for quantize_q8_1 shader * vulkan: add q8_1_x4 type with 128-bit alignment, use in mul_mat_vecq shader * vulkan: use q8_1_x4 blocks in mul_mmq shader * vulkan: do 8 calculations per invocation instead of 32 in mul_mat_vecq, similar to mul_mat_vec * vulkan: tune mul_mat_vecq performance for Intel * vulkan: fix quantizing issue when tensor is not divisible by 128 * vulkan: adapt integer dot mmv to mmv small m optimization (#15355) * vulkan: allow all subgroup modes for mmv and mmvq * vulkan: use prealloc intermediate reuse for mmvq path * vulkan: tune mmvq for Intel, AMD GCN and Nvidia RTX 3090 * vulkan: adapt mmv quantize_y path to conditional sync logic * vulkan: disable q8_0 mmvq on Nvidia * vulkan: enable q8_0 on Nvidia pre-turing * fix prealloc sync condition * fix llvmpipe subgroup 8 issue	2025-09-01 16:19:07 +02:00
Jeff Bolz	bbbf5ecccb	vulkan: handle large sizes for get_rows (#15686 )	2025-08-31 10:13:27 +02:00
Jeff Bolz	c37052ab4d	vulkan: mul_mat_id coopmat2 optimizations (#15546 ) * vulkan: mul_mat_id coopmat2 optimizations Add a path for when the tile fits in BN/2, similar to what we have for mul_mat. Only call fetch_scales/store_scales once per QUANT_K block, and once at the beginning in case start_k is not aligned. * Also add a path for BN/4 - worth a couple more percent	2025-08-31 09:06:43 +02:00
Daniel Bevenius	5c16b9c87d	vulkan : remove unused portability_enumeration_ext variable (#15679 ) This commit removes the portability_enumeration_ext variable from the ggml_vk_instance_portability_enumeration_ext_available function as it is initialized to false but never modified, making it redundant.	2025-08-31 08:46:42 +02:00
Jeff Bolz	b97c9edc59	vulkan: Allow fallback to sysmem memory when vidmem is full (#15649 ) * vulkan: Allow fallback to sysmem memory when vidmem is full * vulkan: Add env var GGML_VK_ALLOW_SYSMEM_FALLBACK	2025-08-31 08:30:54 +02:00
Jeff Bolz	696fccf354	vulkan: Skip syncing for prealloc_y when it is reused (#15544 )	2025-08-30 11:11:22 +02:00
Jeff Bolz	34bdbbd7c2	vulkan: Remove splitting for mul_mat_id (#15568 ) row_ids only needs to hold the BN rows for the current tile.	2025-08-26 06:42:44 +02:00
Ruben Ortlam	4d917cd4f6	vulkan: fix min subgroup 16 condition for mmid subgroup optimization (#15565 )	2025-08-25 17:56:59 +02:00
Ruben Ortlam	043fb27d38	vulkan: apply MUL_MAT_ID subgroup optimization to non-coopmat devices (#15524 ) * vulkan: use subgroup function for mul_mat_id shader even without coopmat * vulkan: fix compile warnings * vulkan: properly check for subgroup size control and require full subgroups for subgroup mul_mat_id * vulkan: disable subgroup mul_mat_id on devices with subgroups < 16	2025-08-24 19:36:36 +02:00

1 2 3 4 5

216 Commits