Commit Graph

143 Commits

Author SHA1 Message Date
Jeff Bolz
fef693dc6b vulkan: mark IM2COL as supporting non-contig (#13783) 2025-05-26 06:02:07 +02:00
Jeff Bolz
1dcd01960c vulkan: support CPY from any type to itself (#13695)
Reuse the f16/f32 copy shaders, and just scale the number of elements
according to the type size.
2025-05-23 06:45:02 +02:00
Jeff Bolz
c10ed6cbcc vulkan: Disable coopmat/coopmat2/bfloat extensions if glslc doesn't support it (#13696) 2025-05-23 06:33:45 +02:00
Judd
a127ff1780 use LOG_WARN to replace std::cerr (#13657) 2025-05-23 06:33:08 +02:00
Eve
fb1cab201c vulkan: fix warnings (#13626)
* small fixes

* remove ifdef
2025-05-20 21:35:16 +00:00
0cc4m
8960efd0a6 Vulkan: Add f32 accumulator support to quantized mul mat to fix GLM4 32B incoherence (#13607) 2025-05-19 17:54:08 +02:00
Gilad S.
e3a7cf6c5b cmake: use the current build config for vulkan-shaders-gen (#13595)
* fix: use the current build config for `vulkan-shaders-gen`

* fix: only pass a valid build type to `--config`
2025-05-17 15:26:43 -03:00
Jeff Bolz
2f5a4e1e09 vulkan: move common FA code to flash_attn_base.comp (#13556)
* vulkan: move common FA code to flash_attn_base.comp

* vulkan: move common FA index/stride setup code to flash_attn_base.comp

* build fix
2025-05-17 09:14:55 +02:00
Jeff Bolz
4f41ee11d6 vulkan: use scalar FA rather than coopmat2 when N==1 (#13554) 2025-05-17 08:35:47 +02:00
bandoti
09d13d94fb cmake: simplify vulkan shader test logic (#13263) 2025-05-14 07:53:57 -03:00
Jeff Bolz
24e86cae72 vulkan: KHR_coopmat flash attention (#13506)
This shader uses coopmat1 to do the Q*K^T multiply. The P*V multiply is more
difficult for various reasons so I haven't done it. Performance for this
shader is around 2.5x better than for the scalar shader when doing prompt
processing. Some of the benefit may be from other optimizations like staging
through shared memory, or splitting by rows.
2025-05-14 11:55:26 +02:00
Jeff Bolz
ab3971f2a0 vulkan: workaround FA compile failures on macos (#13517) 2025-05-14 06:15:50 +02:00
Jeff Bolz
dc1d2adfc0 vulkan: scalar flash attention implementation (#13324)
* vulkan: scalar flash attention implementation

* vulkan: always use fp32 for scalar flash attention

* vulkan: use vector loads in scalar flash attention shader

* vulkan: remove PV matrix, helps with register usage

* vulkan: reduce register usage in scalar FA, but perf may be slightly worse

* vulkan: load each Q value once. optimize O reduction. more tuning

* vulkan: support q4_0/q8_0 KV in scalar FA

* CI: increase timeout to accommodate newly-supported tests

* vulkan: for scalar FA, select between 1 and 8 rows

* vulkan: avoid using Float16 capability in scalar FA
2025-05-10 08:07:07 +02:00
Jeff Bolz
02115dcd9a vulkan: Allow up to 4096 elements for mul_mat_id row_ids (#13326)
This assert fired running Qwen_Qwen3-30B-A3B-Q2_K.gguf:

GGML_ASSERT(nei0 * nei1 <= 3072);

The tensor is 8 x 512. Increase this array size to accommodate.
2025-05-09 09:23:41 +02:00
Jeff Bolz
8ae5ebcf85 vulkan: Additional type support for unary, binary, and copy (#13266)
Support f16->f32 copy.
Support f16->f16 and f32->f32 unary ops.
Support all combinations of f16/f32 for src0/src1/dst for add/sub/mul/div.
2025-05-04 07:17:16 +02:00
Georgi Gerganov
b34443923c sync : ggml (#13268)
* vulkan : kernels for depthwise 2D convolution (CONV_2D_DW) (ggml/1204)

* vulkan : add kernels for depthwise 2d convolution (OP_CONV_2D_DW)

* review: remove src_x/y < 0 checks; add performance tests

* sync : ggml

ggml-ci

* vulkan : fix lint (#0)

---------

Co-authored-by: Acly <aclysia@gmail.com>
2025-05-02 20:54:30 +03:00
Jeff Bolz
79f26e9e12 vulkan: Add bfloat16 support (#12554)
* vulkan: Add bfloat16 support

This adds bfloat16 matrix multiply support based on VK_KHR_shader_bfloat16.
The extension is required for coopmat multiply support, but matrix-vector
multiply trivially promotes bf16 to fp32 and doesn't require the extension.
The copy/get_rows shaders also don't require the extension.

It's probably possible to fall back to non-coopmat and promote to fp32 when
the extension isn't supported, but this change doesn't do that.

The coopmat support also requires a glslc that supports the extension, which
currently requires a custom build.

* vulkan: Support bf16 tensors without the bf16 extension or coopmat support

Compile a variant of the scalar mul_mm shader that will promote the bf16
values to float, and use that when either the bf16 extension or the coopmat
extensions aren't available.

* vulkan: bfloat16 fixes (really works without bfloat16 support now)

* vulkan: fix spirv-val failure and reenable -O
2025-05-01 20:49:39 +02:00
Jeff Bolz
fc727bcdd5 vulkan: Handle src1 batch dimension in non-contiguous mat-vec-mul shader (#13191)
* vulkan: Handle src1 batch dimension in non-contiguous mat-vec-mul shader
2025-05-01 20:19:31 +02:00
Jeff Bolz
e5007a5edf vulkan: use uint array index to avoid glslang bug (#13193) 2025-04-30 14:38:37 +02:00
Eve
b3b6d862cf vulkan: matmul gcn tuning (#13016)
* tune matmul for gcn

* this one is more power efficient

* Update ggml/src/ggml-vulkan/ggml-vulkan.cpp

Co-authored-by: 0cc4m <picard12@live.de>

* disable this tune for the proprietary driver

---------

Co-authored-by: 0cc4m <picard12@live.de>
2025-04-24 09:18:33 +02:00
Jeff Bolz
66168204be vulkan: support noncontiguous rms_norm (#13031) 2025-04-20 10:50:02 +02:00
Georgi Gerganov
2f74c354c0 graph : make FA compatible with MLA + add initial Metal kernels (#12953)
* graph : make mla compatible with FA

* metal : add exp FA kernels for DeepSeek models

ggml-ci

* llama : minor naming updates

ggml-ci

* ggml : disable FA for DS head sizes

* tests : add FA tests for MLA shapes

ggml-ci
2025-04-17 18:16:36 +03:00
Jeff Bolz
015022bb53 vulkan: enable coopmat2 FA gqa and split_k optimizations more often (#12931)
The grouped query attention optmization doesn't require a power of two ratio,
the only thing relying on it was the modulo operation written as bitwise &.

split_k need not depend on gqa_ratio - enable it any time there's only one
workgroup in the X dimension. The shader gets the split index from the x coord,
and multiple workgroups in the X dimension (pre-split) indicates a larger
FA operation that wouldn't need splitting.
2025-04-16 20:37:25 +02:00
Jeff Bolz
a4837577aa vulkan: use aligned loads for flash attention mask (#12853)
Rewrite the stride logic for the mask tensor in the FA shader to force the
stride to be aligned, to allow using more efficient loads.
2025-04-12 10:44:48 +02:00
Diego Devesa
fe92821ea9 ggml : add bilinear upscale support (ggml/1185) 2025-04-11 00:17:47 +03:00
Jeff Bolz
0090950f67 vulkan: In coopmat2 mmq, load q4_k/q5_k scales through shared memory (#12833)
q4_k and q5_k had a lot of redundant global loads where the same 16B of
scale information is repeatedly loaded and decoded during each loop iteration.
This change restructures the loops to more explicitly iterate over whole
blocks in the outer loop (with unrolled inner loop) and to copy/decode the
scale data into shared memory once at the start of each outer loop. The copy
is pipelined so the scale load from global memory is relatively cheap.

This improves q4_k/q5_k model prompt processing performance by around 5-7%.
I briefly tried applying this to q6_k and q4_0, and it didn't help for q6_k
and hurt for q4_0.

The big "else" path in mul_mm_cm2.comp that had all the clamped/unclamped
variants isn't used as often as it originally was (e.g. due to the padded_N
change), so I trimmed it down to offset some of the new complexity of the
semi-manual loop unrolling.
2025-04-09 07:25:08 +02:00
Jeff Bolz
7ecd780b1a vulkan: Use fp16 for the flash attention P*V multiplication (#12783)
This is consistent with the ggml-cuda behavior and the mul_mat fallback.
2025-04-09 07:12:57 +02:00
Jeff Bolz
0c74b04376 vulkan: fix NaN issue in flash attention shader (#12776)
Use -FLT_MAX/2 rather than -inf as the initial value for computing the maximum.
2025-04-06 11:03:47 +02:00
Jeff Bolz
80b717d493 vulkan: Use unclamped loads for flash attention mask (#12720)
nem1 must be a multiple of GGML_KQ_MASK_PAD, and GGML_KQ_MASK_PAD is a multiple
of the number of rows in the matrix. The KV dim is a multiple of the number of
columns for the aligned shader.
2025-04-06 10:47:13 +02:00
0cc4m
6bf28f0111 Vulkan: Tune Vulkan mmq int dot shader for performance (#12767) 2025-04-05 18:04:03 +02:00
Ronny Brendel
9ac4d611d0 cmake: fix ggml-shaders-gen compiler paths containing spaces (#12747)
fixes error for compiler paths with spaces
2025-04-04 10:12:40 -03:00
Jeff Bolz
74d4f5b041 vulkan: Hybrid waitForFences/getFenceStatus to reduce fence latency (#12630)
There seems to be a bubble waking up from waitForFences, which costs a few
percent performance and also increased variance in performance. This change
inserts an "almost_ready" fence when the graph is about 80% complete and we
waitForFences for the almost_ready fence and then spin (with _mm_pauses) waiting
for the final fence to be signaled.
2025-04-04 07:54:35 +02:00
Jeff Bolz
35e592eb30 vulkan: set cmake minimum and project name in vulkan-shaders (#12744) 2025-04-04 07:53:20 +02:00
Jeff Bolz
1c059995e0 vulkan: Fix missing cmake logic for dot product extension (#12721) 2025-04-03 10:08:26 -05:00
Jeff Bolz
f01bd02376 vulkan: Implement split_k for coopmat2 flash attention. (#12627)
When using group query attention, we have one workgroup per KV batch and this
can be very few workgroups (e.g. just 8 in some models). Enable split_k to
spread the work across SMs. This helps a lot when the KV cache is large.
2025-04-02 14:25:08 -05:00
bandoti
6f3bd38640 cmake: remove caching from vulkan coopmat checks (#12719) 2025-04-02 14:56:26 -03:00
Jeff Bolz
be0a0f8cae vulkan: Implement grouped query attention in the coopmat2 FA shader (#12559)
When adjacent batches of Q share the same batches of K/V, batch them into
the same workgroup. For example, when:

dst(128,32,1,1) = FA(q(128,1,32,1), k(128,16640,8,1), v(128,16640,8,1))

previously we would run 32 workgroups computing 1 result each, now we will
run 8 workgroups computing 4 results each.

This doesn't directly translate to better performance (at least when you have
>=32 SMs), but in a subsequent change I'll enable split_k which will scale much
better with 4x fewer workgroups.
2025-04-02 19:40:32 +02:00
0cc4m
92e3006bb6 Vulkan: Fix mmq int dot float cache size (#12722) 2025-04-02 19:12:30 +02:00
Wagner Bruna
2bb3597e42 vulkan: fix build when glslc doesn't support coopmat (#12683) 2025-04-01 11:38:07 +02:00
0cc4m
a8a1f33567 Vulkan: Add DP4A MMQ and Q8_1 quantization shader (#12135)
* Vulkan: Add DP4A MMQ and Q8_1 quantization shader

* Add q4_0 x q8_1 matrix matrix multiplication support

* Vulkan: Add int8 coopmat MMQ support

* Vulkan: Add q4_1, q5_0 and q5_1 quants, improve integer dot code

* Add GL_EXT_integer_dot_product check

* Remove ggml changes, fix mmq pipeline picker

* Remove ggml changes, restore Intel coopmat behaviour

* Fix glsl compile attempt when integer vec dot is not supported

* Remove redundant code, use non-saturating integer dot, enable all matmul sizes for mmq

* Remove redundant comment

* Fix integer dot check

* Fix compile issue with unsupported int dot glslc

* Update Windows build Vulkan SDK version
2025-03-31 14:37:01 +02:00
Georgi Gerganov
1790e73157 cmake : fix whitespace (#0) 2025-03-31 15:07:32 +03:00
Sandro Hanea
a7724480fd cmake: improve Vulkan cooperative matrix support checks (whisper/2966)
Co-authored-by: Sandro Hanea <me@sandro.rocks>
2025-03-31 15:07:32 +03:00
Georgi Gerganov
b4ae50810e metal : improve FA + improve MoE (#12612)
* ggml : FA with different K, V head sizes (CPU)

ggml-ci

* metal : add FA with HS=192

* metal : extend FA to support different K and V head sizes

ggml-ci

* metal : add FA vector kernels for heads K 192 and V 128

ggml-ci

* ggml : restrict op on other backends to equal head sizes

ggml-ci

* metal : optimize FA-vec kernel

ggml-ci

* metal : FA remove mq registers

* metal : improve MoE mul_mat_id condition

ggml-ci

* metal : fix comments + remove unnecessary addition

ggml-ci

* metal : avoid too much shared memory usage with mul_mat_id

ggml-ci
2025-03-28 20:21:59 +02:00
Icenowy Zheng
b86f600723 vulkan: fix coopmat shader generation when cross-compiling (#12272)
* vulkan: fix coopmat shader generation when cross-compiling

Previously the status of coopmat{,2} support isn't passed to the
vulkan-shaders-gen project building on the host, which leads to build
failure because of the cross-compiling code expecting coopmat{,2}
shaders that didn't get generated.

Fix this by passing the coopmat{,2} support status to vulkan-shaders
subproject.

Signed-off-by: Icenowy Zheng <uwu@icenowy.me>

* Only call coop-mat shaders once

* Fix whitespace

---------

Signed-off-by: Icenowy Zheng <uwu@icenowy.me>
Co-authored-by: bandoti <141645996+bandoti@users.noreply.github.com>
2025-03-28 14:51:06 -03:00
Jeff Bolz
9b169a4d4e vulkan: fix mul_mat_vec failure in backend tests (#12529)
The OOB calculation could be wrong if the last iteration was during one of
the unrolled loops. Adjust the unrolling counts to avoid this. Add a couple
new backend tests that hit this failure on NVIDIA GPUs.
2025-03-24 07:56:17 +01:00
Jeff Bolz
eddfb43850 vulkan: Optimize mul_mat_vec p021 and nc shaders (#12505)
* tests: add mul_mat perf/functional tests for p021/nc vulkan shaders

* vulkan: Optimize mul_mat_vec p021 and nc shaders.

These shaders are used in attention calculations, and when the KV cache grows
large they start to dominate the run time. For the nc shader (which is called
with large 'k' dimension), use unrolling and vector loads. For the p021 shader
(which is called with large 'm' and small 'k' dimensions), take advantage of
grouped query attention to reuse loads from the A matrix for the whole group,
and reduce the number of workgroups (too much overhead from tiny dispatches).

Using subgroupAdd in the p021 shader also helps, use that conditionally.
2025-03-22 09:40:11 +01:00
stduhpf
4375415b4a Vulkan: RTE rounding for cpy to quant (#12480)
* Vulkan: RTE rounding for cpy to quant

Co-Authored-By: Jeff Bolz <jbolz@nvidia.com>

* remove trailing whitespace

* avoid duplicating pipeline_cpy_f32_quant

* fix copypasting issue

* remove duplicated code

---------

Co-authored-by: Jeff Bolz <jbolz@nvidia.com>
2025-03-21 20:34:50 +01:00
Eve
30c42ef5cb vulkan: workaround for AMD Windows driver 16 bit unpack8 bug (#12472) 2025-03-21 20:27:47 +01:00
Jeff Bolz
a9b59288e2 vulkan: optimize iq1 coopmat2 dequant functions (#12427) 2025-03-19 19:56:23 +01:00
Jeff Bolz
c446b2edd2 vulkan: Submit once enough matmul work has been recorded (#12406)
I've been seeing significantly worse performance for tg with flash attention
enabled vs disabled, and it seems to be related to the submit heuristic.
Change the heuristic to check how many bytes worth of weight matrix are
used and flush every 100MB, and ramp up after the first few submits.
This seems to resolve the issue, and also increases perf for non-FA a bit.
2025-03-19 08:26:26 +01:00