* vulkan: handle mat_mul with A matrix > 4GB
This change splits mat_mul operations with huge A matrix into chunks in the M
dimension. This works well for stable-diffusion use cases where the im2col
matrix has very large M.
Fix the order of setting the stride in mul_mm_cm2 - setting the dimension
clobbers the stride, so stride should be set after.
* build fixes
The dequantize functions are copy/pasted from mul_mm_funcs.comp with very few
changes - add a_offset and divide iqs by 2. It's probably possible to call
these functions from mul_mm_funcs and avoid the duplication, but I didn't go
that far in this change.
* implement set_rows with i32 index
* template fix
* test quantized path
warnings--
* Apply suggestions from code review
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
* forgotten name change
* deduplicate cuda/sycl and test-fix
* indent++
* vulkan: support set_rows with i32 index type (#16162)
* disable i32 index for webgpu for now
---------
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
Co-authored-by: Jeff Bolz <jbolz@nvidia.com>
* Vulkan: add conv_transpose_2d operation
* Vulkan: fix typo in conv_transpose_2d shader(s0mp, s0L, s1mp, s1L)
* Vulkan: fix incorrect indentation in conv_transpose_2d shader
* Vulkan: add checking the push constants size limit and reuse conv2d_mm.comp for conv_transpose_2d operation
* Vulkan: revert the order of the index calculation and bound check in conv_2d shader
* Vulkan: explicity check push constants limit in supports_op() for conv_transpose_2d operation.
* Vulkan: remove unnecessary lower bound checks for H/W_idx in the conv_2d shader.
* vulkan: optimize UMA buffer operations and fix driver hangs
The previous implementation was blocking the GPU for extended periods,
causing the i915 driver to reset the context due to the hangcheck
protection.
[32628.443070] i915 0000:00:02.0: [drm] GPU HANG: ecode 12:1:85dffffb, in llama-server [194114]
[32628.443091] i915 0000:00:02.0: [drm] llama-server[194114] context reset due to GPU hang
* vulkan: implement deferred_memset on UMA
---------
Signed-off-by: Giuseppe Scrivano <gscrivan@redhat.com>
Use this to query register count for shader compiles on NVIDIA. Currently
this is only for performance debug, but it could eventually be used in some
heuristics like split_k.
* ggml-backend : add GGML_BACKEND_DEVICE_TYPE_IGPU device type
ggml-backend : add device id to device props
llama : only use iGPU devices if there are no GPU devices
llama : do not use multiple devices from different backends with the same device id
* vulkan: sort graph to allow more parallel execution
Add a backend proc to allow the backend to modify the graph. The
vulkan implementation looks at which nodes depend on each other
and greedily reorders them to group together nodes that don't
depend on each other. It only reorders the nodes, doesn't change
the contents of any of them.
With #15489, this reduces the number of synchronizations needed.
* call optimize_graph per-split
* ggml: allow casting between f32 and i32
* fix cuda
* add vulkan
* fix CPU non-cont
* add non-cont test case
* add note
* extend test number range
* correct note
* add cont version for vulkan
* vulkan : update ggml_vk_instance_validation_ext_available
This commit updates ggml_vk_instance_validation_ext_available() to
check for VK_EXT_validation_features instead of
VK_KHR_portability_enumeration.
Based on how the returned boolean is used later in the code (to enable
both the validation layer and the VK_EXT_validation_features extension),
it appears the function may have been intended to check for the
validation layer features extension.
* remove try/catch
This was a left over from a previous iteration where I was explicitly
quering for a specific validation layer first, which would throw.
* update warning message about validation layers
* vulkan: use memory budget extension to read memory usage
* fix: formatting and names
* formatting
* fix: detect and cache memory budget extension availability on init
* fix: read `budgetprops.heapBudget` instead of `heap.size` when memory budget extension is available
* style: lints
* vulkan: Add Integer Dot Product mul_mat_vec shader for legacy quants
* vulkan: use subgroup operations for quantize_q8_1 shader
* vulkan: add q8_1_x4 type with 128-bit alignment, use in mul_mat_vecq shader
* vulkan: use q8_1_x4 blocks in mul_mmq shader
* vulkan: do 8 calculations per invocation instead of 32 in mul_mat_vecq, similar to mul_mat_vec
* vulkan: tune mul_mat_vecq performance for Intel
* vulkan: fix quantizing issue when tensor is not divisible by 128
* vulkan: adapt integer dot mmv to mmv small m optimization (#15355)
* vulkan: allow all subgroup modes for mmv and mmvq
* vulkan: use prealloc intermediate reuse for mmvq path
* vulkan: tune mmvq for Intel, AMD GCN and Nvidia RTX 3090
* vulkan: adapt mmv quantize_y path to conditional sync logic
* vulkan: disable q8_0 mmvq on Nvidia
* vulkan: enable q8_0 on Nvidia pre-turing
* fix prealloc sync condition
* fix llvmpipe subgroup 8 issue
* vulkan: mul_mat_id coopmat2 optimizations
Add a path for when the tile fits in BN/2, similar to what we have for mul_mat.
Only call fetch_scales/store_scales once per QUANT_K block, and once at the
beginning in case start_k is not aligned.
* Also add a path for BN/4 - worth a couple more percent
This commit removes the portability_enumeration_ext variable from the
ggml_vk_instance_portability_enumeration_ext_available function as it
is initialized to false but never modified, making it redundant.
* vulkan: use subgroup function for mul_mat_id shader even without coopmat
* vulkan: fix compile warnings
* vulkan: properly check for subgroup size control and require full subgroups for subgroup mul_mat_id
* vulkan: disable subgroup mul_mat_id on devices with subgroups < 16
The scalar FA shader already handled multiples of 8. The coopmat1 FA
shader assumed 16x16x16 and the shared memory allocations need the HSK
dimensions padded to a multiple of 16. NVIDIA's coopmat2 implementation
requires multiples of 16 for N and K, and needs the matrix dimensions
padded and loads clamped.
Store the FA pipelines in a map, indexed by the pipeline state.
* vulkan: optimize rms_norm, and allow the work to spread across multiple SMs
There are really two parts to this change:
(1) Some optimizations similar to what we have in soft_max, to unroll with
different numbers of iterations.
(2) A fusion optimization where we detect add followed by rms_norm, and make
the add shader atomically accumulate the values^2 into memory. Then the
rms_norm shader can just load that sum. This allows the rms_norm to be
parallelized across multiple workgroups, it just becomes a simple per-element
multiply.
The fusion optimization is currently only applied when the rms_norm is on a
single vector. This previously always ran on a single SM. It could apply more
broadly, but when there are other dimensions the work can already spread across
SMs, and there would be some complexity to tracking multiple atomic sums.
* Change add+rms_norm optimization to write out an array of partial sums
rather than using atomic add, to make it deterministic. The rms_norm
shader fetches a subgroup's worth in parallel and uses subgroupAdd to
add them up.
* complete rebase against fused adds - multi_add shader can also compute partial sums
* fix validation errors
* disable add_rms_fusion for Intel due to possible driver bug
* resolve against #15489, sync after clearing partial sums
Track a list of nodes that need synchronization, and only sync if the new node
depends on them (or overwrites them). This allows some overlap which can
improve performance, and centralizes a big chunk of the synchronization logic.
The remaining synchronization logic involves writes to memory other than the
nodes, e.g. for dequantization or split_k. Each of these allocations has a bool
indicating whether they were in use and need to be synced. This should be
checked before they are written to, and set to true after they are done being
consumed.
* vulkan : support ggml_mean
* vulkan : support sum, sum_rows and mean with non-contiguous tensors
* vulkan : fix subbuffer size not accounting for misalign offset
* tests : add backend-op tests for non-contiguous sum_rows
* cuda : require contiguous src for SUM_ROWS, MEAN support
* sycl : require contiguous src for SUM, SUM_ROWS, ARGSORT support
* require ggml_contiguous_rows in supports_op and expect nb00=1 in the shader
- Spread the work across the whole workgroup. Using more threads seems to
far outweigh the synchronization overhead.
- Specialize the code for when the division is by a power of two.
* vulkan: Reuse conversion results in prealloc_y
Cache the pipeline and tensor that were most recently used to fill prealloc_y,
and skip the conversion if the current pipeline/tensor match.
* don't use shared pointer for prealloc_y_last_pipeline_used