Commit Graph

216 Commits

Author SHA1 Message Date
0cc4m
cc71ccca82 vulkan: add mmq q2_k integer dot support 2025-10-25 13:31:41 +00:00
Giuseppe Scrivano
f90b4a8efe vulkan: delete dead code (#16732)
ggml_vk_create_buffer_temp is not used anywhere, and it is the only
caller for ggml_vk_pool_malloc.

Signed-off-by: Giuseppe Scrivano <gscrivan@redhat.com>
2025-10-25 10:59:54 +02:00
Jeff Bolz
8423d01931 vulkan: Optimize SSM_SCAN (#16645) 2025-10-25 07:04:12 +02:00
Jeff Bolz
e56abd2098 vulkan: Implement topk_moe fused shader, ported from CUDA (#16641)
This is similar to the CUDA shader from #16130, but doesn't use shared memory
and handles different subgroup sizes.
2025-10-18 12:22:57 +02:00
Giuseppe Scrivano
3d4e86bbeb vulkan: Add State Space Model (SSM) Operations Support (#16463)
* vulkan: implement SSM scan operation

Add State Space Model scan operation to the Vulkan backend.

Signed-off-by: Giuseppe Scrivano <gscrivan@redhat.com>

* vulkan: implement SSM conv operation

Add State Space Model conv operation to the Vulkan backend.

Signed-off-by: Giuseppe Scrivano <gscrivan@redhat.com>

---------

Signed-off-by: Giuseppe Scrivano <gscrivan@redhat.com>
2025-10-17 14:23:47 +02:00
Jeff Bolz
4258e0cfe7 vulkan: Support FA with K/V in F32 (#16543) 2025-10-14 15:53:37 +02:00
Jeff Bolz
2aaf0a2a20 vulkan: Replace uses of maxMemoryAllocationSize and VK_WHOLE_SIZE (#16354)
* vulkan: Replace uses of maxMemoryAllocationSize and VK_WHOLE_SIZE

Replace maxMemoryAllocationSize check with maxBufferSize when creating buffers.
The maxMemoryAllocationSize limit is a "soft" limit and allocations can succeed
beyond that limit. This allows > 4GB buffers to be allocated on some
implementations (e.g. NVIDIA) and tensors this large can be used for im2col
and mul_mat.

For temporary buffers (prealloc_x/y/etc) check against maxStorageBufferRange.
I'm not sure this check is ideal, but we always use these buffers as a single
full size binding and the limit may be smaller than maxMemoryAllocationSize
or maxBufferSize, so I think this is reasonable.

Replace descriptor range uses of VK_WHOLE_SIZE with a manually computed range.
The maxStorageBufferRange may be smaller than the maxBufferSize or
maxMemoryAllocationSize (and the Vulkan spec warns about this in a note) and
it's invalid usage if VK_WHOLE_SIZE computes a range larger than
maxStorageBufferRange.

With this change, it should be possible to generate videos using wan networks
in stable-diffusion.cpp.

* vulkan: Add env var GGML_VK_FORCE_MAX_BUFFER_SIZE and use stoull
2025-10-03 12:50:46 +02:00
Jeff Bolz
e308efda8e vulkan: in flash attention, bounds check against nem1 (don't rely on GGML_KQ_MASK_PAD) (#16316) 2025-10-03 10:33:08 +02:00
Eve
132d673554 vulkan: make ggml_vk_default_dispatcher support older vulkan headers (#16345)
* make ggml_vk_default_dispatcher support older vulkan headers

* simpilfy with using
2025-10-01 09:56:36 +02:00
Jeff Bolz
d8359f5fde vulkan: 64-bit im2col (#16135)
* vulkan: 64-bit im2col

Add variants of the im2col shaders that use buffer_device_address/buffer_reference,
and use 64-bit address calculations. This is needed for large convolutions used in
stable-diffusion.cpp.

* fix validation error for large im2col
2025-09-28 08:38:37 +02:00
Jeff Bolz
1384abf8b8 vulkan: handle mat_mul with A matrix > 4GB (#16176)
* vulkan: handle mat_mul with A matrix > 4GB

This change splits mat_mul operations with huge A matrix into chunks in the M
dimension. This works well for stable-diffusion use cases where the im2col
matrix has very large M.

Fix the order of setting the stride in mul_mm_cm2 - setting the dimension
clobbers the stride, so stride should be set after.

* build fixes
2025-09-27 20:36:34 -05:00
Acly
8656f5de68 vulkan : make the vulkan.hpp dynamic dispatcher instance private (#16224)
* don't use VULKAN_HPP_DEFAULT_DISPATCH_LOADER_DYNAMIC_STORAGE which can cause conflicts if application or other libraries do the same
2025-09-27 22:41:03 +02:00
Dmytro Minochkin
0499b29c6f vulkan: throw system error instead of SIGABRT during init on older devices (#16156)
* Throw system error on old Vulkan driver rather than SIGABRT

* Optionally handle any potential error in vulkan init
2025-09-27 18:26:46 +02:00
Jeff Bolz
3f81b4e91c vulkan: support GET_ROWS for k-quants (#16235)
The dequantize functions are copy/pasted from mul_mm_funcs.comp with very few
changes - add a_offset and divide iqs by 2. It's probably possible to call
these functions from mul_mm_funcs and avoid the duplication, but I didn't go
that far in this change.
2025-09-27 12:36:11 +02:00
Sigbjørn Skjæret
3ecb2f671a ggml : implement set_rows with i32 index (#16159)
* implement set_rows with i32 index

* template fix

* test quantized path

warnings--

* Apply suggestions from code review

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

* forgotten name change

* deduplicate cuda/sycl and test-fix

* indent++

* vulkan: support set_rows with i32 index type (#16162)

* disable i32 index for webgpu for now

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
Co-authored-by: Jeff Bolz <jbolz@nvidia.com>
2025-09-22 19:13:00 +02:00
Shin-myoung-serp
96fdca043b Vulkan: add conv_transpose_2d operation (#16022)
* Vulkan: add conv_transpose_2d operation

* Vulkan: fix typo in conv_transpose_2d shader(s0mp, s0L, s1mp, s1L)

* Vulkan: fix incorrect indentation in conv_transpose_2d shader

* Vulkan: add checking the push constants size limit and reuse conv2d_mm.comp for conv_transpose_2d operation

* Vulkan: revert the order of the index calculation and bound check in conv_2d shader

* Vulkan: explicity check push constants limit in supports_op() for conv_transpose_2d operation.

* Vulkan: remove unnecessary lower bound checks for H/W_idx in the conv_2d shader.
2025-09-22 10:04:01 +02:00
Jeff Bolz
a20d810d79 vulkan: add RTE variants of exp shader (#16165)
This fixes some failures on Turing where "round to zero" rounds to the max f16
value but the CPU reference value is infinite.
2025-09-22 07:37:17 +02:00
Giuseppe Scrivano
1eeb523c3e vulkan: optimize UMA buffer operations and fix driver hangs (#16059)
* vulkan: optimize UMA buffer operations and fix driver hangs

The previous implementation was blocking the GPU for extended periods,
causing the i915 driver to reset the context due to the hangcheck
protection.

[32628.443070] i915 0000:00:02.0: [drm] GPU HANG: ecode 12:1:85dffffb, in llama-server [194114]
[32628.443091] i915 0000:00:02.0: [drm] llama-server[194114] context reset due to GPU hang

* vulkan: implement deferred_memset on UMA

---------

Signed-off-by: Giuseppe Scrivano <gscrivan@redhat.com>
2025-09-21 08:31:55 +02:00
Jeff Bolz
5bb4a3edec vulkan: fix validation error about VK_PIPELINE_CREATE_CAPTURE_STATISTICS_BIT_KHR (#16086) 2025-09-21 08:23:37 +02:00
Jeff Bolz
c0b45097c3 rename optimize_graph to graph_optimize (#16082) 2025-09-18 13:46:17 -05:00
Eve
cb5bb6cc05 vulkan: automatically remove unsupported devices (#15976)
* remove unsupported vulkan devices

* make this happen during selection instead

* pass by reference
2025-09-17 09:35:37 +02:00
Ruben Ortlam
261e6a20ff Vulkan: Clean up mul_mm shader (#15987)
* vulkan: move mul_mm dequantization steps into a separate file and functions

* improve mul_mm vector load code

* fix debug mode issues and warnings
2025-09-14 16:56:28 +02:00
Jeff Bolz
b9c9c9f789 vulkan: initialize vulkan-hpp to allow using extension function pointers (#15705)
Use this to query register count for shader compiles on NVIDIA. Currently
this is only for performance debug, but it could eventually be used in some
heuristics like split_k.
2025-09-13 17:23:30 +02:00
Ruben Ortlam
304ac5693d Vulkan iGPU device selection overhaul and PCI ID API support (#15947)
* vulkan: implement ggml igpu device type, implement pci id support

* fix compiler warning

* prevent printf overflow warning
2025-09-12 13:24:21 +02:00
Mathieu Baudier
6c88ad8fa7 vulkan: Make device memory check more portable (#15939) 2025-09-12 09:06:20 +02:00
Diego Devesa
360d6533db ggml-backend : add GGML_BACKEND_DEVICE_TYPE_IGPU device type (#15797)
* ggml-backend : add GGML_BACKEND_DEVICE_TYPE_IGPU device type

ggml-backend : add device id to device props

llama : only use iGPU devices if there are no GPU devices

llama : do not use multiple devices from different backends with the same device id
2025-09-11 22:47:38 +02:00
Ruben Ortlam
ae355f6f71 vulkan: throw the oom error instead of no memory type found (#15905) 2025-09-09 22:26:03 +02:00
Jeff Bolz
4f63cd705c vulkan: Fix OOB accesses in soft_max_back (#15861) 2025-09-09 14:41:15 +02:00
lksj92hs
ed54e32558 Workaround for subgroup arithmetic failing on MoltenVK with AMD GPUs (issue 15846) (#15886) 2025-09-09 14:01:15 +02:00
Jeff Bolz
e68aa10d8f vulkan: sort graph to allow more parallel execution (#15850)
* vulkan: sort graph to allow more parallel execution

Add a backend proc to allow the backend to modify the graph. The
vulkan implementation looks at which nodes depend on each other
and greedily reorders them to group together nodes that don't
depend on each other. It only reorders the nodes, doesn't change
the contents of any of them.

With #15489, this reduces the number of synchronizations needed.

* call optimize_graph per-split
2025-09-09 02:10:07 +08:00
Xuan-Son Nguyen
9fcb29f22f ggml: allow casting between f32 and i32 (#15783)
* ggml: allow casting between f32 and i32

* fix cuda

* add vulkan

* fix CPU non-cont

* add non-cont test case

* add note

* extend test number range

* correct note

* add cont version for vulkan
2025-09-08 12:33:01 +02:00
Jeff Bolz
3976dfbe00 vulkan: support im2col_3d (#15795) 2025-09-07 13:50:26 -05:00
Jeff Bolz
c97b5e5854 vulkan: Support pad_ext (#15794) 2025-09-07 19:00:49 +02:00
leejet
0a1b3982cd ggml: add ops for WAN video model (cuda && cpu) (#15669)
* add conv3d support

* add ggml_pad_ext for cpu & cuda backend

* cuda/cpu: add im2col_3d support

* cuda: make im2col a little faster

* fix cuda pad/scale/im2col3d

* make im2col_3d faster

* gguf: support loading tensors which n_dims > GGML_MAX_DIMS

* fix cuda get_rows

* avoid ggml_conv_3d conflict

* correct GGML_OP_COUNT assertion

* avoid build failure

* avoid build failure on MacOS

* cuda: remove unnecessary MIN define

* fix cpu im2col_3d

* adjust the code style

* cuda: use simpler loop in get_rows

* add test_im2col_3d to test-backend-ops

* test-backend-ops.cpp: remove trailing whitespace

* cpu: im2col_3d support non continuous src

Co-authored-by: Jeff Bolz <jbolz@nvidia.com>

* fix test_im2col_3d

* remove unused variables

* cuda: get_rows: dfloat2 -> float2

* add test_pad_ext to test-backend-ops.cpp

* add gguf_init_from_file_ext impl

* Revert "gguf: support loading tensors which n_dims > GGML_MAX_DIMS"

This reverts commit d8377a0a37.

* Revert "add gguf_init_from_file_ext impl"

This reverts commit d9f1d13208.

* update ggml_backend_vk_device_supports_op

* fix ggml_backend_vk_device_supports_op

* update other backend supports op for ggml_pad_ext

* metal/opencl/sycl/vulkan: fix GGML_OP_PAD check in supports_op

---------

Co-authored-by: Jeff Bolz <jbolz@nvidia.com>
2025-09-04 10:38:49 +02:00
Ruben Ortlam
dff7551bfd vulkan: fix mmv subgroup16 selection (#15775) 2025-09-03 21:55:10 +01:00
Jeff Bolz
0fce7a1248 vulkan: don't use std::string in load_shaders, to improve compile time (#15724)
* vulkan: don't use std::string in load_shaders, to improve compile time

* keep the string version for those calls that use it
2025-09-03 20:33:15 +02:00
Daniel Bevenius
8227695d7a vulkan : update ggml_vk_instance_validation_ext_available (#15666)
* vulkan : update ggml_vk_instance_validation_ext_available

This commit updates ggml_vk_instance_validation_ext_available() to
check for VK_EXT_validation_features instead of
VK_KHR_portability_enumeration.

Based on how the returned boolean is used later in the code (to enable
both the validation layer and the VK_EXT_validation_features extension),
it appears the function may have been intended to check for the
validation layer features extension.

* remove try/catch

This was a left over from a previous iteration where I was explicitly
quering for a specific validation layer first, which would throw.

* update warning message about validation layers
2025-09-03 20:24:50 +02:00
Shin-myoung-serp
0014fb4add ggml vulkan: add hardsigmoid and hardswish operations (#15762) 2025-09-03 20:22:55 +02:00
Jeff Bolz
25f1045f07 vulkan: Fix macro parameter order for f32 matmul shaders (#15716) 2025-09-02 14:37:01 +08:00
Gilad S.
d4d8dbe383 vulkan: use memory budget extension to read memory usage (#15545)
* vulkan: use memory budget extension to read memory usage

* fix: formatting and names

* formatting

* fix: detect and cache memory budget extension availability on init

* fix: read `budgetprops.heapBudget` instead of `heap.size` when memory budget extension is available

* style: lints
2025-09-01 21:17:42 +02:00
Ruben Ortlam
fec7911f8f vulkan: disable large mmv subgroups on older Nvidia GPUs (#15717) 2025-09-01 20:58:35 +02:00
Ruben Ortlam
02c1813517 Vulkan: Add Integer Dot Product mul_mat_vec shader for legacy quants (#14903)
* vulkan: Add Integer Dot Product mul_mat_vec shader for legacy quants

* vulkan: use subgroup operations for quantize_q8_1 shader

* vulkan: add q8_1_x4 type with 128-bit alignment, use in mul_mat_vecq shader

* vulkan: use q8_1_x4 blocks in mul_mmq shader

* vulkan: do 8 calculations per invocation instead of 32 in mul_mat_vecq, similar to mul_mat_vec

* vulkan: tune mul_mat_vecq performance for Intel

* vulkan: fix quantizing issue when tensor is not divisible by 128

* vulkan: adapt integer dot mmv to mmv small m optimization (#15355)

* vulkan: allow all subgroup modes for mmv and mmvq

* vulkan: use prealloc intermediate reuse for mmvq path

* vulkan: tune mmvq for Intel, AMD GCN and Nvidia RTX 3090

* vulkan: adapt mmv quantize_y path to conditional sync logic

* vulkan: disable q8_0 mmvq on Nvidia

* vulkan: enable q8_0 on Nvidia pre-turing

* fix prealloc sync condition

* fix llvmpipe subgroup 8 issue
2025-09-01 16:19:07 +02:00
Jeff Bolz
bbbf5ecccb vulkan: handle large sizes for get_rows (#15686) 2025-08-31 10:13:27 +02:00
Jeff Bolz
c37052ab4d vulkan: mul_mat_id coopmat2 optimizations (#15546)
* vulkan: mul_mat_id coopmat2 optimizations

Add a path for when the tile fits in BN/2, similar to what we have for mul_mat.

Only call fetch_scales/store_scales once per QUANT_K block, and once at the
beginning in case start_k is not aligned.

* Also add a path for BN/4 - worth a couple more percent
2025-08-31 09:06:43 +02:00
Daniel Bevenius
5c16b9c87d vulkan : remove unused portability_enumeration_ext variable (#15679)
This commit removes the portability_enumeration_ext variable from the
ggml_vk_instance_portability_enumeration_ext_available function as it
is initialized to false but never modified, making it redundant.
2025-08-31 08:46:42 +02:00
Jeff Bolz
b97c9edc59 vulkan: Allow fallback to sysmem memory when vidmem is full (#15649)
* vulkan: Allow fallback to sysmem memory when vidmem is full

* vulkan: Add env var GGML_VK_ALLOW_SYSMEM_FALLBACK
2025-08-31 08:30:54 +02:00
Jeff Bolz
696fccf354 vulkan: Skip syncing for prealloc_y when it is reused (#15544) 2025-08-30 11:11:22 +02:00
Jeff Bolz
34bdbbd7c2 vulkan: Remove splitting for mul_mat_id (#15568)
row_ids only needs to hold the BN rows for the current tile.
2025-08-26 06:42:44 +02:00
Ruben Ortlam
4d917cd4f6 vulkan: fix min subgroup 16 condition for mmid subgroup optimization (#15565) 2025-08-25 17:56:59 +02:00
Ruben Ortlam
043fb27d38 vulkan: apply MUL_MAT_ID subgroup optimization to non-coopmat devices (#15524)
* vulkan: use subgroup function for mul_mat_id shader even without coopmat

* vulkan: fix compile warnings

* vulkan: properly check for subgroup size control and require full subgroups for subgroup mul_mat_id

* vulkan: disable subgroup mul_mat_id on devices with subgroups < 16
2025-08-24 19:36:36 +02:00