lhez
52e5d421f1
opencl: fix rms_norm_mul ( #17250 )
...
* opencl: use subgrroup reduce for reduction in rms_norm_mul
* opencl: add comment about workgroup size
2025-11-15 17:40:14 -08:00
shaofeiqi
4db5641210
opencl: add kernel to handle mat mul in attention to improve encoding speed ( #17181 )
...
* Add mul_mm_f16_f32_kq_kqv kernel
* Add ggml_cl_mul_mat_kq_kqv_adreno func
* fix whitespace
* remove unused variable
* remove redundant
* refactor and clean up
* remove trailing whitespace
2025-11-15 17:33:10 -08:00
lhez
ece0f5c177
opencl: add fastdiv and use it in set_rows, ported from cuda ( #17090 )
...
* opencl: add fastdiv for mm q8_0
* opencl: use uint4 for fastdiv vals
* opencl: use fastdiv for set_rows
* opencl: do not use fastdiv for q8_0 mm
2025-11-10 15:00:13 -08:00
Acly
1032256ec9
cuda/vulkan : bicubic interpolation ( #17022 )
...
* vulkan : implement upscale with bicubic interpolation
* cuda : implement upscale with bicubic interpolation
* tests : add ggml_interpolate with GGML_SCALE_MODE_BICUBIC to backend tests
* adapt OpenCL backend to not support the OP in that case so tests don't fail
* print scale mode & flags in test-backend-ops
2025-11-10 10:19:39 +01:00
lhez
c5023daf60
opencl: support imrope ( #16914 )
...
* opencl: support imrope
* opencl: fix whitespace
2025-11-03 11:47:57 -08:00
Acly
10640e31aa
ggml : fix interpolate with align-corners and ne=1 ( #16700 )
...
* ggml : fix interpolate with align-corners and ne=1
* avoid division by zero if one of the spatial dimensions is 1
* cpu, cuda, opencl returned correct result anyway due to clamp
* vulkan didn't clamp for align-corners so results were broken
* fix clang warning
2025-10-27 21:50:22 +01:00
lhez
6ea37f5739
opencl: fix warnings and clean up profiling ( #16688 )
...
* opencl: remove unused headers, fix warnings
* opencl: clean up profiling, only keep kernel time
2025-10-20 22:26:17 -07:00
Shawn Gu
81387858f1
opencl: transposed gemm/gemv moe kernel with mxfp4,f32 ( #16602 )
...
* opencl: transposed gemm/gemv moe kernel with mxfp4,f32
* add restore kernel for moe transpose
* fix trailing whitespaces
* resolve compilation warnings
2025-10-17 17:55:32 -07:00
lhez
0cb7a0683b
opencl: add q8_0 mm support ( #16469 )
...
* opencl: add mm_q8_0_f32
* opencl: fix data loading for incomplete tile
* opencl: use q8_0 mm for larger matrix
* opencl: add some tests to cover the path
2025-10-15 10:51:04 -07:00
Aman Gupta
120bf7046d
CUDA + openCL: fix bug in accessing rms_norm->src while doing fusion ( #16577 )
2025-10-14 07:48:08 -07:00
lhez
5016b72862
opencl: fix build targeting CL 2 ( #16554 )
2025-10-13 11:50:37 -07:00
lhez
7c156df414
opencl: support pad_ext ( #15888 )
2025-09-30 10:45:45 -07:00
lhez
d1c84a662d
opencl: support ne3 in get_rows ( #15866 )
2025-09-30 09:55:13 -07:00
Sigbjørn Skjæret
3ecb2f671a
ggml : implement set_rows with i32 index ( #16159 )
...
* implement set_rows with i32 index
* template fix
* test quantized path
warnings--
* Apply suggestions from code review
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com >
* forgotten name change
* deduplicate cuda/sycl and test-fix
* indent++
* vulkan: support set_rows with i32 index type (#16162 )
* disable i32 index for webgpu for now
---------
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com >
Co-authored-by: Jeff Bolz <jbolz@nvidia.com >
2025-09-22 19:13:00 +02:00
lhez
51f5a45fbe
opencl: fix concat crash on win arm64 with Adreno ( #15944 )
2025-09-21 16:42:10 -07:00
lhez
c4510dc937
opencl: initial q8_0 mv support ( #15732 )
2025-09-21 14:48:44 -07:00
Shawn Gu
3edd87cd05
opencl: optimize mxfp4 kernels ( #16037 )
...
- flatten mxfp4 and packed fp4->fp16 bit-wise convert function (replace lut)
- MoE kernel optimizations
---------
Co-authored-by: Li He <lih@qti.qualcomm.com >
2025-09-18 12:03:34 -07:00
Jeff Bolz
c0b45097c3
rename optimize_graph to graph_optimize ( #16082 )
2025-09-18 13:46:17 -05:00
Jeff Bolz
e68aa10d8f
vulkan: sort graph to allow more parallel execution ( #15850 )
...
* vulkan: sort graph to allow more parallel execution
Add a backend proc to allow the backend to modify the graph. The
vulkan implementation looks at which nodes depend on each other
and greedily reorders them to group together nodes that don't
depend on each other. It only reorders the nodes, doesn't change
the contents of any of them.
With #15489 , this reduces the number of synchronizations needed.
* call optimize_graph per-split
2025-09-09 02:10:07 +08:00
leejet
0a1b3982cd
ggml: add ops for WAN video model (cuda && cpu) ( #15669 )
...
* add conv3d support
* add ggml_pad_ext for cpu & cuda backend
* cuda/cpu: add im2col_3d support
* cuda: make im2col a little faster
* fix cuda pad/scale/im2col3d
* make im2col_3d faster
* gguf: support loading tensors which n_dims > GGML_MAX_DIMS
* fix cuda get_rows
* avoid ggml_conv_3d conflict
* correct GGML_OP_COUNT assertion
* avoid build failure
* avoid build failure on MacOS
* cuda: remove unnecessary MIN define
* fix cpu im2col_3d
* adjust the code style
* cuda: use simpler loop in get_rows
* add test_im2col_3d to test-backend-ops
* test-backend-ops.cpp: remove trailing whitespace
* cpu: im2col_3d support non continuous src
Co-authored-by: Jeff Bolz <jbolz@nvidia.com >
* fix test_im2col_3d
* remove unused variables
* cuda: get_rows: dfloat2 -> float2
* add test_pad_ext to test-backend-ops.cpp
* add gguf_init_from_file_ext impl
* Revert "gguf: support loading tensors which n_dims > GGML_MAX_DIMS"
This reverts commit d8377a0a37 .
* Revert "add gguf_init_from_file_ext impl"
This reverts commit d9f1d13208 .
* update ggml_backend_vk_device_supports_op
* fix ggml_backend_vk_device_supports_op
* update other backend supports op for ggml_pad_ext
* metal/opencl/sycl/vulkan: fix GGML_OP_PAD check in supports_op
---------
Co-authored-by: Jeff Bolz <jbolz@nvidia.com >
2025-09-04 10:38:49 +02:00
rmatif
820bc98531
opencl: add hs=40 to FA ( #15758 )
2025-09-03 23:30:28 -07:00
rmatif
97669e4073
opencl: add attn sinks support for FA kernels ( #15706 )
2025-09-01 23:26:53 -07:00
rmatif
86076f92de
OpenCL: add fused group_norm/norm, mul, add ( #15314 )
...
* add fused group_norm/norm, mul, add
* fix spacing
* revert rms_norm logic
* fix trailing whitespace
2025-08-26 23:36:05 -07:00
lhez
f7207b0415
opencl: fix support ops condition for rms_norm ( #15560 )
2025-08-25 14:18:09 -07:00
lhez
fb22dd07a6
opencl: mark argsort unsupported if cols exceed workgroup limit ( #15375 )
2025-08-19 11:25:51 -07:00
rmatif
912ff8c119
OpenCL: add initial FA support ( #14987 )
...
* add F16/F16 fa support
* fix kernel init
* use mad instead of fma
* use inline function
* mark FA with sinks as unsupported for now
* add pragma unroll to loops
2025-08-16 01:05:55 -07:00
lhez
e2c1bfff53
opencl: add initial mxfp4 support via mv ( #15270 )
...
* opencl: add reference `mul_mv_mxfp4_f32`
* opencl: add reference `mul_mv_id` for mxfp4
* Q4_0 tranpose fix for Adreno
---------
Co-authored-by: shawngu-quic <shawngu@qti.qualcomm.com >
2025-08-15 09:52:14 -07:00
rmatif
60a7658810
opencl: allow mixed f16/f32 add ( #15140 )
2025-08-12 02:42:41 -07:00
AN Long
cd6983d56d
ggml : fix field name when new ggml_backend ( #14944 )
2025-08-08 14:37:22 +02:00
lhez
aaa3d07ae7
opencl: support sink in soft_max (attn sinks) ( #15152 )
2025-08-07 21:47:03 -07:00
rmatif
756cfea826
fix profiling crash ( #15072 )
2025-08-06 14:17:51 -07:00
lhez
e725a1a982
opencl: add swiglu_oai and add_id ( #15121 )
...
* opencl: add `swiglu-oai`
* opencl: add `add_id`
* opencl: add missing `add_id.cl`
2025-08-06 12:12:17 -07:00
Georgi Gerganov
fd1234cb46
llama : add gpt-oss ( #15091 )
...
* oai moe
* compat with new checkpoint
* add attn sink impl
* add rope scaling yarn
* logits match with latest transformers code
* wip chat template
* rm trailing space
* use ggml_scale_bias
* rm redundant is_swa_all
* convert interleaved gate_up
* graph : fix activation function to match reference (#7 )
* vocab : handle o200k_harmony special tokens
* ggml : add attention sinks support (#1 )
* llama : add attn sinks
* ggml : add attn sinks
* cuda : add attn sinks
* vulkan : add support for sinks in softmax
remove unnecessary return
* ggml : add fused swiglu_oai op (#11 )
* ggml : add fused swiglu_oai op
* Update ggml/src/ggml-cpu/ops.cpp
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com >
* update CUDA impl
* cont : metal impl
* add vulkan impl
* test-backend-ops : more test cases, clean up
* llama : remove unfused impl
* remove extra lines
---------
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com >
---------
Co-authored-by: slaren <slarengh@gmail.com >
* repack mxfp4 upon conversion
* clean up a bit
* enable thinking
* add quick hack to render only some special tokens
* fix bf16 conversion
* remove vocab hack
* webui ok
* support chat parsing for gpt-oss
* fix webui
* direct mapping mxfp4, FINALLY
* force using mxfp4
* properly use lazy tensor
* ggml : add mxfp4
ggml : use e8m0 conversion instead of powf
Co-authored-by: Diego Devesa <slarengh@gmail.com >
change kvalues_mxfp4 table to match e2m1 (#6 )
metal : remove quantization for now (not used)
cuda : fix disabled CUDA graphs due to ffn moe bias
vulkan : add support for mxfp4
cont : add cm2 dequant
* ggml : add ggml_add_id (#13 )
* ggml : add ggml_add_id
* add cuda impl
* llama : add weight support check for add_id
* perf opt
* add vulkan impl
* rename cuda files
* add metal impl
* allow in-place ggml_add_id
* llama : keep biases on CPU with --cpu-moe
* llama : fix compile error
ggml-ci
* cuda : add fallback for __nv_cvt_e8m0_to_bf16raw
ggml-ci
* cleanup
ggml-ci
* sycl : fix supports_op for MXFP4
ggml-ci
* fix Unknown reasoning format
* ggml-cpu : fix AVX build
ggml-ci
* fix hip build
ggml-ci
* cuda : add mxfp4 dequantization support for cuBLAS
ggml-ci
* ggml-cpu : fix mxfp4 fallback definitions for some architectures
ggml-ci
* cuda : fix version required for __nv_cvt_e8m0_to_bf16raw
---------
Co-authored-by: Xuan Son Nguyen <son@huggingface.co >
Co-authored-by: slaren <slarengh@gmail.com >
2025-08-05 22:10:36 +03:00
lhez
5c0eb5ef54
opencl: fix adreno compiler detection logic ( #15029 )
2025-08-02 19:51:18 +02:00
lhez
1c872f71fb
opencl: add f16 for add, sub, mul, div ( #14984 )
2025-08-01 13:15:44 +02:00
lhez
6e6725459a
opencl: add mul_mat_f32_f32_l4_lm and mul_mat_f16_f32_l4_lm ( #14809 )
2025-07-30 14:56:55 -07:00
lhez
ce111d39d6
opencl: add fused rms_norm_mul ( #14841 )
...
* opencl: add fused `rms_norm` + `mul`
* opencl: improve workgroup size for `rms_norm_mul`
2025-07-25 17:12:13 +02:00
lhez
8e6f8bc875
opencl: remove unreachable return ( #14806 )
2025-07-22 08:53:30 +02:00
rmatif
6c9ee3b17e
opencl: add conv2d kernel ( #14403 )
...
* add conv2d kernel
* fix trailing whitespace
* whitespace fixe
* handle f16 input and f16 kernel, more opt
* resolve conflicts
* use enqueue_ndrange_kernel
2025-07-21 10:03:19 -07:00
Georgi Gerganov
05fec5bd29
ggml : add build-time message to remind about ggml_set_rows ( #14661 )
...
ggml-ci
2025-07-13 10:36:33 +03:00
rmatif
6bdda13981
opencl: add tiled mul_mat_f16_f32 ( #14535 )
...
* add tiled mul_mat_f16_f32
* fix trailing whitespace
* add insightful comments
2025-07-10 14:58:12 -07:00
lhez
0b8855775c
opencl: add set_rows for f16 and f32 ( #14547 )
...
* opencl: add `set_rows` for `f16` and `f32`
* opencl: better choose workgroup size for `set_rows`
2025-07-10 11:48:52 -07:00
Xuan-Son Nguyen
98bab638fb
ggml : add ggml_scale_bias ( #14417 )
...
* ggml : add ggml_scale_bias
* ggml_vec_mad1_f32
* add more simd
* add CUDA
* sycl
* vulkan
* cann (placeholder)
* opencl
* will this fix cpu?
* fix cuda
* suggestions from coderabbit
* fix cann compile error
* vDSP_vsmsa
* rm __ARM_FEATURE_SVE
* use memcpy for op params
* make code looks more consistent
* use scalar for __ARM_FEATURE_SVE
* add x param to ggml_vec_mad1_f32
2025-07-09 18:16:12 +02:00
Sigbjørn Skjæret
6681688146
opencl: add GELU_ERF ( #14476 )
2025-07-04 23:24:56 -07:00
Sigbjørn Skjæret
28657a8229
ggml : implement GEGLU_ERF and GEGLU_QUICK ops ( #14445 )
2025-07-03 23:07:22 +02:00
lhez
bee28421be
opencl : broadcast for soft_max ( #14510 )
2025-07-03 20:22:24 +02:00
Georgi Gerganov
a70c8a0c4b
kv-cache : use ggml_set_rows ( #14285 )
...
* kv-cache : use ggml_set_rows
ggml-ci
* graph : separate k and v indices
ggml-ci
* cont : remove redundant ifs
ggml-ci
* kv-cache : improve find_slot impl
* kv-cache : bounds-check when accessing slot_info indices
* kv-cache : add comments
ggml-ci
* ggml : add TODOs for adding GGML_OP_SET_ROWS support in the backends
ggml-ci
2025-07-03 10:53:35 +03:00
zhouwg
307e79d33d
opencl : fix possible buffer overflow in dump_tensor ( #14490 )
2025-07-02 14:38:10 +02:00
Eric Zhang
c8a4e470f6
opencl : skip empty nodes on cgraph compute ( #14491 )
2025-07-02 13:00:04 +02:00
lhez
603e43dc91
opencl : update upscale to support align corners ( #14488 )
2025-07-02 09:07:42 +02:00