Commit Graph

160 Commits

Author SHA1 Message Date
Jeff Bolz
00d5282c7f vulkan: lock accesses of pinned_memory vector (#14333) 2025-06-28 17:17:09 +02:00
Jeff Bolz
ceb1bf5a34 vulkan: Fix GGML_VULKAN_SHADER_DEBUG_INFO (#14427)
This setting needs to be passed through to vulkan-shaders-gen
2025-06-27 22:35:30 -05:00
bandoti
a01047b041 cmake: regen vulkan shaders when shaders-gen sources change (#14398)
* Add shaders-gen sources as target deps
2025-06-26 13:46:53 -03:00
Markus Tavenrath
bb16041cae Add support for VK_EXT_debug_utils to add labels to Vulkan objects. (#13792)
* Add support for VK_EXT_debug_utils to add labels to Vulkan objects. In step 1 compute pipelines are getting labeled.

* remove #ifdef for debug utils and add queue marker.
2025-06-21 08:17:12 +02:00
0cc4m
10bb545c5b Vulkan: Set device max size for host memory to avoid OOM warning and fallback to CPU buffer (#14249) 2025-06-19 09:15:42 +02:00
bandoti
c46503014d cmake: remove shader-gen step-targets from ggml-vulkan (#14226)
* Remove step-targets from vulkan-shaders-gen

* Unset DESTDIR when building vulkan-shaders-gen
2025-06-17 22:33:25 +02:00
bandoti
0dbcabde8c cmake: clean up external project logic for vulkan-shaders-gen (#14179)
* Remove install step for vulkan-shaders-gen

* Add install step to normalize msvc with make

* Regenerate modified shaders at build-time
2025-06-16 10:32:13 -03:00
Jeff Bolz
c89c2d1ab9 vulkan: mutex around vkQueueSubmit (#14127)
This fixes the remaining crash in test-thread-safety on my system.
2025-06-16 08:21:08 +02:00
Jeff Bolz
bd248d4dc7 vulkan: Better thread-safety for command pools/buffers (#14116)
This change moves the command pool/buffer tracking into a vk_command_pool
structure. There are two instances per context (for compute+transfer) and
two instances per device for operations that don't go through a context.
This should prevent separate contexts from stomping on each other.
2025-06-11 09:48:52 -05:00
Jeff Bolz
1f7d50b293 vulkan: Track descriptor pools/sets per-context (#14109)
Use the same descriptor set layout for all pipelines (MAX_PARAMETER_COUNT == 8)
and move it to the vk_device. Move all the descriptor pool and set tracking to
the context - none of it is specific to pipelines anymore. It has a single vector
of pools and vector of sets, and a single counter to track requests and a single
counter to track use.
2025-06-11 07:19:25 +02:00
0cc4m
97340b4c99 Vulkan: Don't default to CPU device (like llvmpipe), even if no other device is available, to allow fallback to CPU backend (#14099) 2025-06-10 13:01:33 +01:00
Masato Nakasaka
669c13e0f6 vulkan: Enable VK_KHR_cooperative_matrix extension for Intel Xe2 GPUs (#14001)
* allowing B580 and U9-288V

* experimenting code to detect Xe2

* allowing coopmat only for Xe2 GPUs

* fixed comment wording

* fixed comment wording

* removed unnecessary driver check
2025-06-05 16:00:29 +02:00
Jeff Bolz
5a8ae3053c vulkan: automatically deduce size of push constants (#13936) 2025-06-05 07:17:58 +02:00
Ervin Áron Tasnádi
0d3984424f ggml-vulkan: adds support for op CONV_TRANSPOSE_1D (#13813)
* * ggml-vulkan: adds op CONV_TRANSPOSE_1D

* test-backend-ops: adds more spohisticated tests for CONV_TRANSPOSE_1D

* Missing barrier added to shader.
Number of additional tests reduced to 108.

* * Fixes typo in variable name.

* Removes extra whitespaces.

* Adds int64->int32 casts to prevent possible warnings.

* Problem size reduced in tests to pass tests with llvmpipe.

* supports_op condition moved from unintended position
2025-06-04 22:02:00 +02:00
Jeff Bolz
7e00e60ef8 vulkan: fix warnings in perf logger querypool code (#13937) 2025-06-03 20:30:22 +02:00
Kai Pastor
108009f5c7 vulkan : Remove unexpected ; (ggml/1253) 2025-06-01 13:43:57 +03:00
Jeff Bolz
bef8176387 vulkan: use timestamp queries for GGML_VULKAN_PERF (#13817)
Also change it to be controlled by an env var rather than cmake flag
2025-05-27 18:39:07 +02:00
Jeff Bolz
fef693dc6b vulkan: mark IM2COL as supporting non-contig (#13783) 2025-05-26 06:02:07 +02:00
Jeff Bolz
1dcd01960c vulkan: support CPY from any type to itself (#13695)
Reuse the f16/f32 copy shaders, and just scale the number of elements
according to the type size.
2025-05-23 06:45:02 +02:00
Jeff Bolz
c10ed6cbcc vulkan: Disable coopmat/coopmat2/bfloat extensions if glslc doesn't support it (#13696) 2025-05-23 06:33:45 +02:00
Judd
a127ff1780 use LOG_WARN to replace std::cerr (#13657) 2025-05-23 06:33:08 +02:00
Eve
fb1cab201c vulkan: fix warnings (#13626)
* small fixes

* remove ifdef
2025-05-20 21:35:16 +00:00
0cc4m
8960efd0a6 Vulkan: Add f32 accumulator support to quantized mul mat to fix GLM4 32B incoherence (#13607) 2025-05-19 17:54:08 +02:00
Gilad S.
e3a7cf6c5b cmake: use the current build config for vulkan-shaders-gen (#13595)
* fix: use the current build config for `vulkan-shaders-gen`

* fix: only pass a valid build type to `--config`
2025-05-17 15:26:43 -03:00
Jeff Bolz
2f5a4e1e09 vulkan: move common FA code to flash_attn_base.comp (#13556)
* vulkan: move common FA code to flash_attn_base.comp

* vulkan: move common FA index/stride setup code to flash_attn_base.comp

* build fix
2025-05-17 09:14:55 +02:00
Jeff Bolz
4f41ee11d6 vulkan: use scalar FA rather than coopmat2 when N==1 (#13554) 2025-05-17 08:35:47 +02:00
bandoti
09d13d94fb cmake: simplify vulkan shader test logic (#13263) 2025-05-14 07:53:57 -03:00
Jeff Bolz
24e86cae72 vulkan: KHR_coopmat flash attention (#13506)
This shader uses coopmat1 to do the Q*K^T multiply. The P*V multiply is more
difficult for various reasons so I haven't done it. Performance for this
shader is around 2.5x better than for the scalar shader when doing prompt
processing. Some of the benefit may be from other optimizations like staging
through shared memory, or splitting by rows.
2025-05-14 11:55:26 +02:00
Jeff Bolz
ab3971f2a0 vulkan: workaround FA compile failures on macos (#13517) 2025-05-14 06:15:50 +02:00
Jeff Bolz
dc1d2adfc0 vulkan: scalar flash attention implementation (#13324)
* vulkan: scalar flash attention implementation

* vulkan: always use fp32 for scalar flash attention

* vulkan: use vector loads in scalar flash attention shader

* vulkan: remove PV matrix, helps with register usage

* vulkan: reduce register usage in scalar FA, but perf may be slightly worse

* vulkan: load each Q value once. optimize O reduction. more tuning

* vulkan: support q4_0/q8_0 KV in scalar FA

* CI: increase timeout to accommodate newly-supported tests

* vulkan: for scalar FA, select between 1 and 8 rows

* vulkan: avoid using Float16 capability in scalar FA
2025-05-10 08:07:07 +02:00
Jeff Bolz
02115dcd9a vulkan: Allow up to 4096 elements for mul_mat_id row_ids (#13326)
This assert fired running Qwen_Qwen3-30B-A3B-Q2_K.gguf:

GGML_ASSERT(nei0 * nei1 <= 3072);

The tensor is 8 x 512. Increase this array size to accommodate.
2025-05-09 09:23:41 +02:00
Jeff Bolz
8ae5ebcf85 vulkan: Additional type support for unary, binary, and copy (#13266)
Support f16->f32 copy.
Support f16->f16 and f32->f32 unary ops.
Support all combinations of f16/f32 for src0/src1/dst for add/sub/mul/div.
2025-05-04 07:17:16 +02:00
Georgi Gerganov
b34443923c sync : ggml (#13268)
* vulkan : kernels for depthwise 2D convolution (CONV_2D_DW) (ggml/1204)

* vulkan : add kernels for depthwise 2d convolution (OP_CONV_2D_DW)

* review: remove src_x/y < 0 checks; add performance tests

* sync : ggml

ggml-ci

* vulkan : fix lint (#0)

---------

Co-authored-by: Acly <aclysia@gmail.com>
2025-05-02 20:54:30 +03:00
Jeff Bolz
79f26e9e12 vulkan: Add bfloat16 support (#12554)
* vulkan: Add bfloat16 support

This adds bfloat16 matrix multiply support based on VK_KHR_shader_bfloat16.
The extension is required for coopmat multiply support, but matrix-vector
multiply trivially promotes bf16 to fp32 and doesn't require the extension.
The copy/get_rows shaders also don't require the extension.

It's probably possible to fall back to non-coopmat and promote to fp32 when
the extension isn't supported, but this change doesn't do that.

The coopmat support also requires a glslc that supports the extension, which
currently requires a custom build.

* vulkan: Support bf16 tensors without the bf16 extension or coopmat support

Compile a variant of the scalar mul_mm shader that will promote the bf16
values to float, and use that when either the bf16 extension or the coopmat
extensions aren't available.

* vulkan: bfloat16 fixes (really works without bfloat16 support now)

* vulkan: fix spirv-val failure and reenable -O
2025-05-01 20:49:39 +02:00
Jeff Bolz
fc727bcdd5 vulkan: Handle src1 batch dimension in non-contiguous mat-vec-mul shader (#13191)
* vulkan: Handle src1 batch dimension in non-contiguous mat-vec-mul shader
2025-05-01 20:19:31 +02:00
Jeff Bolz
e5007a5edf vulkan: use uint array index to avoid glslang bug (#13193) 2025-04-30 14:38:37 +02:00
Eve
b3b6d862cf vulkan: matmul gcn tuning (#13016)
* tune matmul for gcn

* this one is more power efficient

* Update ggml/src/ggml-vulkan/ggml-vulkan.cpp

Co-authored-by: 0cc4m <picard12@live.de>

* disable this tune for the proprietary driver

---------

Co-authored-by: 0cc4m <picard12@live.de>
2025-04-24 09:18:33 +02:00
Jeff Bolz
66168204be vulkan: support noncontiguous rms_norm (#13031) 2025-04-20 10:50:02 +02:00
Georgi Gerganov
2f74c354c0 graph : make FA compatible with MLA + add initial Metal kernels (#12953)
* graph : make mla compatible with FA

* metal : add exp FA kernels for DeepSeek models

ggml-ci

* llama : minor naming updates

ggml-ci

* ggml : disable FA for DS head sizes

* tests : add FA tests for MLA shapes

ggml-ci
2025-04-17 18:16:36 +03:00
Jeff Bolz
015022bb53 vulkan: enable coopmat2 FA gqa and split_k optimizations more often (#12931)
The grouped query attention optmization doesn't require a power of two ratio,
the only thing relying on it was the modulo operation written as bitwise &.

split_k need not depend on gqa_ratio - enable it any time there's only one
workgroup in the X dimension. The shader gets the split index from the x coord,
and multiple workgroups in the X dimension (pre-split) indicates a larger
FA operation that wouldn't need splitting.
2025-04-16 20:37:25 +02:00
Jeff Bolz
a4837577aa vulkan: use aligned loads for flash attention mask (#12853)
Rewrite the stride logic for the mask tensor in the FA shader to force the
stride to be aligned, to allow using more efficient loads.
2025-04-12 10:44:48 +02:00
Diego Devesa
fe92821ea9 ggml : add bilinear upscale support (ggml/1185) 2025-04-11 00:17:47 +03:00
Jeff Bolz
0090950f67 vulkan: In coopmat2 mmq, load q4_k/q5_k scales through shared memory (#12833)
q4_k and q5_k had a lot of redundant global loads where the same 16B of
scale information is repeatedly loaded and decoded during each loop iteration.
This change restructures the loops to more explicitly iterate over whole
blocks in the outer loop (with unrolled inner loop) and to copy/decode the
scale data into shared memory once at the start of each outer loop. The copy
is pipelined so the scale load from global memory is relatively cheap.

This improves q4_k/q5_k model prompt processing performance by around 5-7%.
I briefly tried applying this to q6_k and q4_0, and it didn't help for q6_k
and hurt for q4_0.

The big "else" path in mul_mm_cm2.comp that had all the clamped/unclamped
variants isn't used as often as it originally was (e.g. due to the padded_N
change), so I trimmed it down to offset some of the new complexity of the
semi-manual loop unrolling.
2025-04-09 07:25:08 +02:00
Jeff Bolz
7ecd780b1a vulkan: Use fp16 for the flash attention P*V multiplication (#12783)
This is consistent with the ggml-cuda behavior and the mul_mat fallback.
2025-04-09 07:12:57 +02:00
Jeff Bolz
0c74b04376 vulkan: fix NaN issue in flash attention shader (#12776)
Use -FLT_MAX/2 rather than -inf as the initial value for computing the maximum.
2025-04-06 11:03:47 +02:00
Jeff Bolz
80b717d493 vulkan: Use unclamped loads for flash attention mask (#12720)
nem1 must be a multiple of GGML_KQ_MASK_PAD, and GGML_KQ_MASK_PAD is a multiple
of the number of rows in the matrix. The KV dim is a multiple of the number of
columns for the aligned shader.
2025-04-06 10:47:13 +02:00
0cc4m
6bf28f0111 Vulkan: Tune Vulkan mmq int dot shader for performance (#12767) 2025-04-05 18:04:03 +02:00
Ronny Brendel
9ac4d611d0 cmake: fix ggml-shaders-gen compiler paths containing spaces (#12747)
fixes error for compiler paths with spaces
2025-04-04 10:12:40 -03:00
Jeff Bolz
74d4f5b041 vulkan: Hybrid waitForFences/getFenceStatus to reduce fence latency (#12630)
There seems to be a bubble waking up from waitForFences, which costs a few
percent performance and also increased variance in performance. This change
inserts an "almost_ready" fence when the graph is about 80% complete and we
waitForFences for the almost_ready fence and then spin (with _mm_pauses) waiting
for the final fence to be signaled.
2025-04-04 07:54:35 +02:00
Jeff Bolz
35e592eb30 vulkan: set cmake minimum and project name in vulkan-shaders (#12744) 2025-04-04 07:53:20 +02:00