* vulkan : implement upscale with bicubic interpolation
* cuda : implement upscale with bicubic interpolation
* tests : add ggml_interpolate with GGML_SCALE_MODE_BICUBIC to backend tests
* adapt OpenCL backend to not support the OP in that case so tests don't fail
* print scale mode & flags in test-backend-ops
* vulkan: use all device-local heaps for memory availability reporting
Co-authored-by: Giuseppe Scrivano <gscrivan@redhat.com>
* use all available heaps for iGPU memory reporting
* Allow multiple memory types per buffer request for devices with split heaps
---------
Co-authored-by: Giuseppe Scrivano <gscrivan@redhat.com>
This change combines the rms_norm+mul and rope+view+set_rows fusions to
allow fusing the whole sequence together. This comes up in Qwen3, Bailing,
and some other models.
The std::map pipeline_flash_attn_f32_f16 could be searched and inserted at the
same time, which needs to hold the lock. To be safe, hold the lock for all of
ggml_vk_load_shaders.
* vulkan: remove the need for the dryrun
Allocate pipelines and descriptor sets when requested.
Reallocate the prealloc buffers when needed, and flush any pending work
before reallocating.
For rms_partials and total_mul_mat_bytes, use the sizes computed the last time
the graph was executed.
* remove dryrun parameters
* vulkan: fuse mul_mat+add and mul_mat_id+add_id
The fusion is only applied for the mat-vec mul paths.
* Apply suggestions from code review
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
* fix 32b build
---------
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
* Experimenting crash fix
* added assert for aborting and fixed comment
* changed to check if a pipeline is empty or not
* Moved function in class definition
* replaced with is_empty
* Modified is_empty to check only unaligned pipelines
This pattern appears in a lot of models, the rope operation is applied right
before storing into the KV cache (usually on the K tensor).
Add a path to some of the rope shaders that computes the destination address
based on the set_rows tensor. Compile variants of the shader with D_TYPE of
f16 (the usual KV cache type).
Add a src3 operand to ggml_vk_op_f32 - sometimes rope uses three srcs and needs
the fourth for the row indices.
Add fused_ops_write_mask to indicate which intermediate tensors need to write
their results to memory. Skipping writing the roped K value helps to allow more
nodes to run concurrently.
Add logic to ggml_vk_graph_optimize to make ROPE+VIEW+SET_ROWS consecutive. It
rarely starts out that way in the graph.
Add new backend tests.
* vulkan: Update topk_moe fusion to handle gpt's late softmax
Based on #16649.
* Add ggml_check_edges
* Add sync logging to show fusion effects
* handle clamp added in #16655
* Update ggml/src/ggml-impl.h
Co-authored-by: Diego Devesa <slarengh@gmail.com>
* ggml : fix interpolate with align-corners and ne=1
* avoid division by zero if one of the spatial dimensions is 1
* cpu, cuda, opencl returned correct result anyway due to clamp
* vulkan didn't clamp for align-corners so results were broken
* fix clang warning
* fix: deduplicate and deprioritize Microsoft Direct3D12 vulkan devices from the `vulkan-dozen` driver
* style: indent
* fix: decrease priority
* fix: switch to `||`
ggml_vk_create_buffer_temp is not used anywhere, and it is the only
caller for ggml_vk_pool_malloc.
Signed-off-by: Giuseppe Scrivano <gscrivan@redhat.com>
* vulkan: implement SSM scan operation
Add State Space Model scan operation to the Vulkan backend.
Signed-off-by: Giuseppe Scrivano <gscrivan@redhat.com>
* vulkan: implement SSM conv operation
Add State Space Model conv operation to the Vulkan backend.
Signed-off-by: Giuseppe Scrivano <gscrivan@redhat.com>
---------
Signed-off-by: Giuseppe Scrivano <gscrivan@redhat.com>
* vulkan (DRAFT): split shader generation by GLSL source file, to improve incremental build times
* support dep-files so shaders are recompiled if their included files change
* rename shader files which are used as "headers" to use .glsl extension
* move glslc extension detection shaders to separate folders
* the above is to prevent them from getting glob'd with the actual compute shaders that need to be compiled
* vulkan : only write embedded shader .hpp/.cpp when they change
* avoid recompiling ggml-vulkan.cpp when editing shaders
* pass single --source argument instead of --input-dir & --filter to shader gen
* check for source file match earlier
* fix hang in vulkan-shaders-gen when there are compilation errors
* early out did not decrement compile_count
* clean up
* fix glslc integer dot product test
* unconditionally write the embedded shader cpp output
* replace output filepath in generated dep-files to match output in CMakeLists
---------
Co-authored-by: Jeff Bolz <jbolz@nvidia.com>
* vulkan: Replace uses of maxMemoryAllocationSize and VK_WHOLE_SIZE
Replace maxMemoryAllocationSize check with maxBufferSize when creating buffers.
The maxMemoryAllocationSize limit is a "soft" limit and allocations can succeed
beyond that limit. This allows > 4GB buffers to be allocated on some
implementations (e.g. NVIDIA) and tensors this large can be used for im2col
and mul_mat.
For temporary buffers (prealloc_x/y/etc) check against maxStorageBufferRange.
I'm not sure this check is ideal, but we always use these buffers as a single
full size binding and the limit may be smaller than maxMemoryAllocationSize
or maxBufferSize, so I think this is reasonable.
Replace descriptor range uses of VK_WHOLE_SIZE with a manually computed range.
The maxStorageBufferRange may be smaller than the maxBufferSize or
maxMemoryAllocationSize (and the Vulkan spec warns about this in a note) and
it's invalid usage if VK_WHOLE_SIZE computes a range larger than
maxStorageBufferRange.
With this change, it should be possible to generate videos using wan networks
in stable-diffusion.cpp.
* vulkan: Add env var GGML_VK_FORCE_MAX_BUFFER_SIZE and use stoull
When computing sinks, the cm1 shader was looping r from 0 to Br rather than
to rows_per_thread. I must have copied this from the scalar path (where it is
correct), and somehow it wasn't causing failures on current drivers.
* vulkan: 64-bit im2col
Add variants of the im2col shaders that use buffer_device_address/buffer_reference,
and use 64-bit address calculations. This is needed for large convolutions used in
stable-diffusion.cpp.
* fix validation error for large im2col
* vulkan: handle mat_mul with A matrix > 4GB
This change splits mat_mul operations with huge A matrix into chunks in the M
dimension. This works well for stable-diffusion use cases where the im2col
matrix has very large M.
Fix the order of setting the stride in mul_mm_cm2 - setting the dimension
clobbers the stride, so stride should be set after.
* build fixes
The "Clamp" spec constant is already based on whether KV is a multiple of Bc,
so use that to control whether bounds checking is performed. Add bounds checking
to the scalar and coopmat1 paths. Coopmat2 didn't need any changes (the K/V
tensors are already optionally clamped, nothing else needed to be changed).
The dequantize functions are copy/pasted from mul_mm_funcs.comp with very few
changes - add a_offset and divide iqs by 2. It's probably possible to call
these functions from mul_mm_funcs and avoid the duplication, but I didn't go
that far in this change.