- Launch an appropriate number of invocations (next larger power of two).
32 invocations is common and the barrier is much cheaper there.
- Specialize for "needs bounds checking" vs not.
- Make the code less branchy and [[unroll]] the loops. In the final code,
I see no branches inside the main loop (only predicated stores) when
needs_bounds_check is false.
- Always sort ascending, then apply the ascending vs descending option when
doing the final stores to memory.
- Copy the values into shared memory, makes them slightly cheaper to access.
* vulkan: fuse adds
Fuse adds that have the same shape, which are common in MoE models.
It will currently fuse up to 6 adds, because we assume no more than
8 descriptors per dispatch. But this could be changed.
* check runtimeDescriptorArray feature
* disable multi_add for Intel due to likely driver bug
* vulkan: Add missing bounds checking to scalar/coopmat1 mul_mat_id
* vulkan: Support mul_mat_id with f32 accumulators, but they are not hooked up
- There's no explicit way to request f32 precision for mul_mat_id, but there
probably should be, and this gets the code in place for that.
- A couple fixes to check_results.
- Remove casts to fp16 in coopmat1 FA shader (found by inspection).
* examples/finetune -opt SGD (stochastic gradient descent) memory opt
add unit tested GGML_OPT_OPTIMIZER_SGD to ggml - avoids allocating
m, v tensors.
support finetune.cpp arg -opt SGD (or sgd). (default adamw as before)
llama 3.2-1b-F32 result: observed 11gb gpu ram (41 sec/epoch)
when using SGD instead of 19gb (55 sec/epoch) using adamw.
(wikipedia 100 lines finetune)
(
using the same GPU memory, adamw can only do before OOM 512
batch/context, reaching:
train: [███████▉] data=0000140/0000140 loss=0.02575±0.00099 acc=99.52±0.03% t=00:00:47 ETA=00:00:00
val: [███████▉] data=0000008/0000008 loss=4.76565±0.28810 acc=41.46±0.77% t=00:00:00 ETA=00:00:00
SGD is superior, though it converges slower, with max before OOM 1728
batch/context (esp see the better validation perf):
train: [███████▉] data=0000039/0000039 loss=0.00371±0.00010 acc=99.96±0.01% t=00:00:41 ETA=00:00:00
val: [███████▉] data=0000003/0000003 loss=5.11406±0.76034 acc=48.01±0.69% t=00:00:01 ETA=00:00:00
)
note: when finetuning long enough (or w/ enough -lr),
validation accuracy *eventually* drops ('catastrophic forgetting')
-lr-half (halflife) option useful for SGD to avoid oscillation or
super slow underdamped learning (makes setting -lr more forgiving).
terminal -lr for now is set by lr-halvings i.e. if you want at most
1/8 the inital -lr you set -lr-halvings 3.
note: objective loss not directly comparable between adamw, sgd? -
check perplexity or accuracy or consider relative improvements
for convergence
new finetune args -wd 1e-9 to enable weight decay in sgd or adamw,
and max -epochs N (default 2 as before)
cache (1 - wd*alpha) in 'adamw' opt struct -
no noticeable perf benefit, disabled (still done
for new SGD though)
since opt. memory is pre-allocated, the ggml_opt_get_optimizer_params
would probably be able to change between SGD and AdamW with each epoch
but would need to use adamw for the first (unconfirmed - no cmdline arg
to set such a policy yet)
test-opt checks adamw as before and now sgd (except for a few disabled
tests for sgd only; probably just needs logging values and adding
alternate reference values); tolerance on the 'regression'
test is broader for sgd (so we don't need many more epochs)
* Vulkan: Implement GGML_OP_OPT_STEP_SGD
* tests: Fix OPT_STEP_SGD test-backend-ops
* SGD op param store weight-decay and not 1-alpha*wd
* minor + cosmetic changes
* fix vulkan sgd
* try CI fix
---------
Co-authored-by: 0cc4m <picard12@live.de>
Co-authored-by: Johannes Gäßler <johannesg@5d6.de>
* vulkan: optimizations for direct convolution
- Empirically choose a better tile size. Reducing BS_K/BS_NPQ helps fill
the GPU. The new size should be amenable to using coopmat, too.
- Fix shmem bank conflicts. 16B padding should work with coopmat.
- Some explicit loop unrolling.
- Skip math/stores work for parts of the tile that are OOB.
- Apply fastdiv opt.
- Disable shuffles for NV.
* Three tiles sizes for CONV_2D, and a heuristic to choose
* reallow collectives for pre-Turing
* make SHMEM_PAD a spec constant
* fixes for intel perf - no shmem padding, placeholder shader core count
* shader variants with/without unrolling
* 0cc4m's fixes for AMD perf
Co-authored-by: 0cc4m <picard12@live.de>
---------
Co-authored-by: 0cc4m <picard12@live.de>
* ggml/ggml-vulkan/test-backend-ops: adds CONV_2D for Vulkan
* ggml-vulkan: adds f32 scalar shader to compute 2D convolution directly
with gemm (no need for im2col),
* test-backend-ops: adds test_case_ref to check the validity/performance of ops
against reference implementations having different graphs, adds tests
* * Performance fixes: minimized branch divergence, uses collectives to
eliminate redundant calculation, macros removed.
* Kernel shared memory size check
* Updates test-backend-ops to support graphs for performance
measurement.
* * Apple/Win32 compile errors fixed
* Subgroup size used to determine tile size -> fixes llvmpipe errors.
* Collectives disabled by default.
* Intel support is disabled as the performance is poor.
* Conv2d enabled for Intel with disabled collectives, disabled for Apple
* test-backend-ops modifications are reverted
* Trailing spaces and missing override fixed.
* Triggering pipeline relaunch.
* Code formatted with .clang-format.
* vulkan: support SET_ROWS
Add variants of the copy_to_quant shader that do the SET_ROWS operation.
Change these shaders to spread the work across the workgroup.
The memory access pattern is probably not great (one thread per quant block),
but should be fine for now.
* vulkan: optimize set_rows
Larger workgroups for non-quant types.
Set "norepeat" (there is manual repeat logic).
Use fastmod.
* vulkan: allow unclamped loads in coopmat2 mul_mat_id shader
* vulkan: increase coopmat2 mul_mat_id tile size
* vulkan: optimize mat_mul_id row_ids search to batch loads, and port to coopmat1 path
* vulkan: use smaller FA row size when head size is large. applies to both scalar and CM2 paths (CM1 isn't used due to shared memory limits)
* ggml : add ggml_scale_bias
* ggml_vec_mad1_f32
* add more simd
* add CUDA
* sycl
* vulkan
* cann (placeholder)
* opencl
* will this fix cpu?
* fix cuda
* suggestions from coderabbit
* fix cann compile error
* vDSP_vsmsa
* rm __ARM_FEATURE_SVE
* use memcpy for op params
* make code looks more consistent
* use scalar for __ARM_FEATURE_SVE
* add x param to ggml_vec_mad1_f32
* vulkan: allow FA split_k with smaller KV values
* vulkan: spread split_k_reduce work across more threads
k_num can get rather large. Use the whole workgroup to reduce the M/L values.
Launch a thread for each element in the HSV dimension of the output. Helps a
lot for large HSV (like deepseek).
* vulkan: Handle updated FA dim2/3 definition
Pack mask boolean and n_head_log2 into a single dword to keep the push
constant block under the 128B limit.
* handle null mask for gqa
* allow gqa with dim3>1
* implement unary REGLU/GEGLU/SWIGLU cpu ops
* relax constraints
* duplicate shape of source
* fix ggml_vec_geglu_f16
* special case gated ops
* implement unary REGLU/GEGLU/SWIGLU cuda ops
* tighten constraints again
* refactor into GGML_GLU_OP
* metal : add glu kernels
ggml-ci
* add CUDA_GLU_BLOCK_SIZE [no ci]
* more constraints and use 64bit ints
ggml-ci
* 64bit multiplication [no ci]
* implement swapped variants (cpu/cuda)
* update comment [no ci]
ggml-ci
* Vulkan: Add GLU ops and shaders
* SYCL: Implement fused kernel GEGLU, SWIGLU and REGLU for single up+gate
* ggml : implement GLU for split up/gate (#14181)
* implement GLU for split up/gate
* add tests for ggml_glu_split
* Vulkan: Implement glu_split logic and shader support
* add split to logging [no ci]
* SYCL: refactor element_size ops and add split up and gate support to gated kernels
* SYCL: switch GEGLU to use tanh approximation
---------
Co-authored-by: 0cc4m <picard12@live.de>
Co-authored-by: Akarshan <akarshan@menlo.ai>
* GGML: increase OP count in assertion
* Refactor: Optimize SYCL element-wise operations with unary function inlining
This commit refactors the SYCL element-wise operations to improve performance by:
- Inlining unary operations (sgn, abs, elu, gelu, silu, etc.) to reduce kernel launch overhead.
- Introducing helper functions `op_xxx` for each unary operation to encapsulate the logic.
- Replacing direct kernel calls with calls to these inlined functions.
- Using `__dpct_inline__` to encourage compiler inlining.
- Minor code cleanup and consistency improvements.
The changes aim to reduce kernel launch overhead and improve the overall efficiency of element-wise operations on SYCL devices.
* vulkan: Increase workgroup size for GLU, for performance (#14345)
* vulkan: Increase workgroup size for GLU, for performance
* vulkan: change GLU shaders to do one element per invocation rather than one row per workgroup
* merge fix
* metal : add support for split and swap
ggml-ci
---------
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
Co-authored-by: 0cc4m <picard12@live.de>
Co-authored-by: Akarshan <akarshan@menlo.ai>
Co-authored-by: Jeff Bolz <jbolz@nvidia.com>
* vulkan: Add fusion support for RMS_NORM+MUL
- Add a use_count to ggml_tensor, so we can detect if an output is used more than once.
- Change the ggml-vulkan rms_norm shader to optionally multiply by another tensor.
- Add detection logic and basic fusion logic in ggml-vulkan.
- Add some testing support for fusion. Rather than computing one node at a time, allow
for computing the whole graph and just testing one node's results. Add rms_norm_mul tests
and enable a llama test.
* extract some common fusion logic
* fix -Winconsistent-missing-override
* move ggml_can_fuse to a common function
* build fix
* C and C++ versions of can_fuse
* move use count to the graph to avoid data races and double increments when used in multiple threads
* use hash table lookup to find node index
* change use_counts to be indexed by hash table slot
* minimize hash lookups
style fixes
* last node doesn't need single use.
fix type.
handle mul operands being swapped.
* remove redundant parameter
---------
Co-authored-by: slaren <slarengh@gmail.com>
* * ggml-vulkan: adds op CONV_TRANSPOSE_1D
* test-backend-ops: adds more spohisticated tests for CONV_TRANSPOSE_1D
* Missing barrier added to shader.
Number of additional tests reduced to 108.
* * Fixes typo in variable name.
* Removes extra whitespaces.
* Adds int64->int32 casts to prevent possible warnings.
* Problem size reduced in tests to pass tests with llvmpipe.
* supports_op condition moved from unintended position
This shader uses coopmat1 to do the Q*K^T multiply. The P*V multiply is more
difficult for various reasons so I haven't done it. Performance for this
shader is around 2.5x better than for the scalar shader when doing prompt
processing. Some of the benefit may be from other optimizations like staging
through shared memory, or splitting by rows.
* vulkan: scalar flash attention implementation
* vulkan: always use fp32 for scalar flash attention
* vulkan: use vector loads in scalar flash attention shader
* vulkan: remove PV matrix, helps with register usage
* vulkan: reduce register usage in scalar FA, but perf may be slightly worse
* vulkan: load each Q value once. optimize O reduction. more tuning
* vulkan: support q4_0/q8_0 KV in scalar FA
* CI: increase timeout to accommodate newly-supported tests
* vulkan: for scalar FA, select between 1 and 8 rows
* vulkan: avoid using Float16 capability in scalar FA
This assert fired running Qwen_Qwen3-30B-A3B-Q2_K.gguf:
GGML_ASSERT(nei0 * nei1 <= 3072);
The tensor is 8 x 512. Increase this array size to accommodate.
* vulkan: Add bfloat16 support
This adds bfloat16 matrix multiply support based on VK_KHR_shader_bfloat16.
The extension is required for coopmat multiply support, but matrix-vector
multiply trivially promotes bf16 to fp32 and doesn't require the extension.
The copy/get_rows shaders also don't require the extension.
It's probably possible to fall back to non-coopmat and promote to fp32 when
the extension isn't supported, but this change doesn't do that.
The coopmat support also requires a glslc that supports the extension, which
currently requires a custom build.
* vulkan: Support bf16 tensors without the bf16 extension or coopmat support
Compile a variant of the scalar mul_mm shader that will promote the bf16
values to float, and use that when either the bf16 extension or the coopmat
extensions aren't available.
* vulkan: bfloat16 fixes (really works without bfloat16 support now)
* vulkan: fix spirv-val failure and reenable -O