Jeff Bolz
e68aa10d8f
vulkan: sort graph to allow more parallel execution ( #15850 )
...
* vulkan: sort graph to allow more parallel execution
Add a backend proc to allow the backend to modify the graph. The
vulkan implementation looks at which nodes depend on each other
and greedily reorders them to group together nodes that don't
depend on each other. It only reorders the nodes, doesn't change
the contents of any of them.
With #15489 , this reduces the number of synchronizations needed.
* call optimize_graph per-split
2025-09-09 02:10:07 +08:00
leejet
0a1b3982cd
ggml: add ops for WAN video model (cuda && cpu) ( #15669 )
...
* add conv3d support
* add ggml_pad_ext for cpu & cuda backend
* cuda/cpu: add im2col_3d support
* cuda: make im2col a little faster
* fix cuda pad/scale/im2col3d
* make im2col_3d faster
* gguf: support loading tensors which n_dims > GGML_MAX_DIMS
* fix cuda get_rows
* avoid ggml_conv_3d conflict
* correct GGML_OP_COUNT assertion
* avoid build failure
* avoid build failure on MacOS
* cuda: remove unnecessary MIN define
* fix cpu im2col_3d
* adjust the code style
* cuda: use simpler loop in get_rows
* add test_im2col_3d to test-backend-ops
* test-backend-ops.cpp: remove trailing whitespace
* cpu: im2col_3d support non continuous src
Co-authored-by: Jeff Bolz <jbolz@nvidia.com >
* fix test_im2col_3d
* remove unused variables
* cuda: get_rows: dfloat2 -> float2
* add test_pad_ext to test-backend-ops.cpp
* add gguf_init_from_file_ext impl
* Revert "gguf: support loading tensors which n_dims > GGML_MAX_DIMS"
This reverts commit d8377a0a37 .
* Revert "add gguf_init_from_file_ext impl"
This reverts commit d9f1d13208 .
* update ggml_backend_vk_device_supports_op
* fix ggml_backend_vk_device_supports_op
* update other backend supports op for ggml_pad_ext
* metal/opencl/sycl/vulkan: fix GGML_OP_PAD check in supports_op
---------
Co-authored-by: Jeff Bolz <jbolz@nvidia.com >
2025-09-04 10:38:49 +02:00
rmatif
820bc98531
opencl: add hs=40 to FA ( #15758 )
2025-09-03 23:30:28 -07:00
rmatif
97669e4073
opencl: add attn sinks support for FA kernels ( #15706 )
2025-09-01 23:26:53 -07:00
rmatif
86076f92de
OpenCL: add fused group_norm/norm, mul, add ( #15314 )
...
* add fused group_norm/norm, mul, add
* fix spacing
* revert rms_norm logic
* fix trailing whitespace
2025-08-26 23:36:05 -07:00
lhez
f7207b0415
opencl: fix support ops condition for rms_norm ( #15560 )
2025-08-25 14:18:09 -07:00
lhez
fb22dd07a6
opencl: mark argsort unsupported if cols exceed workgroup limit ( #15375 )
2025-08-19 11:25:51 -07:00
rmatif
912ff8c119
OpenCL: add initial FA support ( #14987 )
...
* add F16/F16 fa support
* fix kernel init
* use mad instead of fma
* use inline function
* mark FA with sinks as unsupported for now
* add pragma unroll to loops
2025-08-16 01:05:55 -07:00
lhez
e2c1bfff53
opencl: add initial mxfp4 support via mv ( #15270 )
...
* opencl: add reference `mul_mv_mxfp4_f32`
* opencl: add reference `mul_mv_id` for mxfp4
* Q4_0 tranpose fix for Adreno
---------
Co-authored-by: shawngu-quic <shawngu@qti.qualcomm.com >
2025-08-15 09:52:14 -07:00
rmatif
60a7658810
opencl: allow mixed f16/f32 add ( #15140 )
2025-08-12 02:42:41 -07:00
AN Long
cd6983d56d
ggml : fix field name when new ggml_backend ( #14944 )
2025-08-08 14:37:22 +02:00
lhez
aaa3d07ae7
opencl: support sink in soft_max (attn sinks) ( #15152 )
2025-08-07 21:47:03 -07:00
rmatif
756cfea826
fix profiling crash ( #15072 )
2025-08-06 14:17:51 -07:00
lhez
e725a1a982
opencl: add swiglu_oai and add_id ( #15121 )
...
* opencl: add `swiglu-oai`
* opencl: add `add_id`
* opencl: add missing `add_id.cl`
2025-08-06 12:12:17 -07:00
Georgi Gerganov
fd1234cb46
llama : add gpt-oss ( #15091 )
...
* oai moe
* compat with new checkpoint
* add attn sink impl
* add rope scaling yarn
* logits match with latest transformers code
* wip chat template
* rm trailing space
* use ggml_scale_bias
* rm redundant is_swa_all
* convert interleaved gate_up
* graph : fix activation function to match reference (#7 )
* vocab : handle o200k_harmony special tokens
* ggml : add attention sinks support (#1 )
* llama : add attn sinks
* ggml : add attn sinks
* cuda : add attn sinks
* vulkan : add support for sinks in softmax
remove unnecessary return
* ggml : add fused swiglu_oai op (#11 )
* ggml : add fused swiglu_oai op
* Update ggml/src/ggml-cpu/ops.cpp
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com >
* update CUDA impl
* cont : metal impl
* add vulkan impl
* test-backend-ops : more test cases, clean up
* llama : remove unfused impl
* remove extra lines
---------
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com >
---------
Co-authored-by: slaren <slarengh@gmail.com >
* repack mxfp4 upon conversion
* clean up a bit
* enable thinking
* add quick hack to render only some special tokens
* fix bf16 conversion
* remove vocab hack
* webui ok
* support chat parsing for gpt-oss
* fix webui
* direct mapping mxfp4, FINALLY
* force using mxfp4
* properly use lazy tensor
* ggml : add mxfp4
ggml : use e8m0 conversion instead of powf
Co-authored-by: Diego Devesa <slarengh@gmail.com >
change kvalues_mxfp4 table to match e2m1 (#6 )
metal : remove quantization for now (not used)
cuda : fix disabled CUDA graphs due to ffn moe bias
vulkan : add support for mxfp4
cont : add cm2 dequant
* ggml : add ggml_add_id (#13 )
* ggml : add ggml_add_id
* add cuda impl
* llama : add weight support check for add_id
* perf opt
* add vulkan impl
* rename cuda files
* add metal impl
* allow in-place ggml_add_id
* llama : keep biases on CPU with --cpu-moe
* llama : fix compile error
ggml-ci
* cuda : add fallback for __nv_cvt_e8m0_to_bf16raw
ggml-ci
* cleanup
ggml-ci
* sycl : fix supports_op for MXFP4
ggml-ci
* fix Unknown reasoning format
* ggml-cpu : fix AVX build
ggml-ci
* fix hip build
ggml-ci
* cuda : add mxfp4 dequantization support for cuBLAS
ggml-ci
* ggml-cpu : fix mxfp4 fallback definitions for some architectures
ggml-ci
* cuda : fix version required for __nv_cvt_e8m0_to_bf16raw
---------
Co-authored-by: Xuan Son Nguyen <son@huggingface.co >
Co-authored-by: slaren <slarengh@gmail.com >
2025-08-05 22:10:36 +03:00
lhez
5c0eb5ef54
opencl: fix adreno compiler detection logic ( #15029 )
2025-08-02 19:51:18 +02:00
lhez
1c872f71fb
opencl: add f16 for add, sub, mul, div ( #14984 )
2025-08-01 13:15:44 +02:00
lhez
6e6725459a
opencl: add mul_mat_f32_f32_l4_lm and mul_mat_f16_f32_l4_lm ( #14809 )
2025-07-30 14:56:55 -07:00
lhez
ce111d39d6
opencl: add fused rms_norm_mul ( #14841 )
...
* opencl: add fused `rms_norm` + `mul`
* opencl: improve workgroup size for `rms_norm_mul`
2025-07-25 17:12:13 +02:00
lhez
8e6f8bc875
opencl: remove unreachable return ( #14806 )
2025-07-22 08:53:30 +02:00
Sigbjørn Skjæret
38d3af1b73
opencl: fix im2col when KW!=KH ( #14803 )
2025-07-21 13:55:10 -07:00
rmatif
6c9ee3b17e
opencl: add conv2d kernel ( #14403 )
...
* add conv2d kernel
* fix trailing whitespace
* whitespace fixe
* handle f16 input and f16 kernel, more opt
* resolve conflicts
* use enqueue_ndrange_kernel
2025-07-21 10:03:19 -07:00
Georgi Gerganov
05fec5bd29
ggml : add build-time message to remind about ggml_set_rows ( #14661 )
...
ggml-ci
2025-07-13 10:36:33 +03:00
rmatif
6bdda13981
opencl: add tiled mul_mat_f16_f32 ( #14535 )
...
* add tiled mul_mat_f16_f32
* fix trailing whitespace
* add insightful comments
2025-07-10 14:58:12 -07:00
lhez
0b8855775c
opencl: add set_rows for f16 and f32 ( #14547 )
...
* opencl: add `set_rows` for `f16` and `f32`
* opencl: better choose workgroup size for `set_rows`
2025-07-10 11:48:52 -07:00
Xuan-Son Nguyen
98bab638fb
ggml : add ggml_scale_bias ( #14417 )
...
* ggml : add ggml_scale_bias
* ggml_vec_mad1_f32
* add more simd
* add CUDA
* sycl
* vulkan
* cann (placeholder)
* opencl
* will this fix cpu?
* fix cuda
* suggestions from coderabbit
* fix cann compile error
* vDSP_vsmsa
* rm __ARM_FEATURE_SVE
* use memcpy for op params
* make code looks more consistent
* use scalar for __ARM_FEATURE_SVE
* add x param to ggml_vec_mad1_f32
2025-07-09 18:16:12 +02:00
Sigbjørn Skjæret
6681688146
opencl: add GELU_ERF ( #14476 )
2025-07-04 23:24:56 -07:00
Sigbjørn Skjæret
28657a8229
ggml : implement GEGLU_ERF and GEGLU_QUICK ops ( #14445 )
2025-07-03 23:07:22 +02:00
lhez
bee28421be
opencl : broadcast for soft_max ( #14510 )
2025-07-03 20:22:24 +02:00
Georgi Gerganov
a70c8a0c4b
kv-cache : use ggml_set_rows ( #14285 )
...
* kv-cache : use ggml_set_rows
ggml-ci
* graph : separate k and v indices
ggml-ci
* cont : remove redundant ifs
ggml-ci
* kv-cache : improve find_slot impl
* kv-cache : bounds-check when accessing slot_info indices
* kv-cache : add comments
ggml-ci
* ggml : add TODOs for adding GGML_OP_SET_ROWS support in the backends
ggml-ci
2025-07-03 10:53:35 +03:00
zhouwg
307e79d33d
opencl : fix possible buffer overflow in dump_tensor ( #14490 )
2025-07-02 14:38:10 +02:00
Eric Zhang
c8a4e470f6
opencl : skip empty nodes on cgraph compute ( #14491 )
2025-07-02 13:00:04 +02:00
lhez
603e43dc91
opencl : update upscale to support align corners ( #14488 )
2025-07-02 09:07:42 +02:00
lhez
79b33b2317
opencl : add GEGLU, REGLU, SWIGLU ( #14456 )
2025-07-01 09:19:16 +02:00
lhez
73e53dc834
opencl: ref count ggml_backend_opencl_context and refactor profiling ( #14254 )
...
* Move profiling info into `ggml_backend_opencl_context`
* Add `enqueue_ndrange_kernel` to launch kernel
2025-06-24 11:46:25 -07:00
lhez
4c763c8d1b
opencl: add mul_mv_id_q4_0_f32_8x_flat ( #14003 )
2025-06-10 16:55:58 -07:00
lhez
71e74a3ac9
opencl: add backend_synchronize ( #13939 )
...
* This is not needed by the normal use where the result is read
using `tensor_get`, but it allows perf mode of `test-backend-ops`
to properly measure performance.
2025-06-02 16:54:58 -07:00
rmatif
bfb1e012a0
OpenCL: Add concat, tsembd, upscale, tanh, pad and repeat ( #13840 )
...
* add concat, pad, repeat, tsembd, tanh, upscale
* small fixes
2025-06-02 16:53:36 -07:00
lhez
a3c30846e4
opencl: add new ops - argsort, div, sub, addrows, sigmoid, group_norm ( #13787 )
...
* opencl: add `argsort`
* opencl: add `div`
* opencl: add `add_rows`
* opencl: add `sub`
* opencl: add `sigmoid`, both `f16` and `f32`
* opencl: add `group_norm`
2025-05-27 12:56:08 -07:00
lhez
1701d4c54f
opencl: mark mul_mat f32f32 as supporting non-contiguous tensors ( #13790 )
2025-05-27 12:53:14 -07:00
Henry Linjamäki
a4e8912dfd
opencl: Add support for multiple devices ( #12622 )
...
* opencl: Add support for multiple devices
... but limited to one platform. A platform with a GPU will be preferred.
Additionally:
* Filter out devices that lack capabilities needed by the backend
implementation (half support, OpenCL 2.0+, etc).
* Make ggml_backend_opencl_reg() thread-safe.
* fixup: fix an error in sync_with_other_backends
... when there is only one OpenCL device available.
2025-05-21 16:21:45 -07:00
Henry Linjamäki
edbf42edfd
opencl: fix couple crashes ( #12795 )
...
* opencl: fix couple crashes
* fix kernel launches failed on devices which do not support
non-uniform work-groups. When non-uniform work-groups are not
supported, set `local_work_size` to NULL (= let driver choose the
work-group sizes). This patch does not cover everything - just the
cases tested by test-backend-ops.
* fix sub-buffer creation failed due to `cl_buffer_region::origin` not
being aligned to `CL_DEVICE_MEM_BASE_ADDR_ALIGN`.
* OpenCL: query non-uniform WG sizes only on OpenCL 3.0+
2025-05-21 13:21:17 -07:00
lhez
f0d46ef157
opencl: remove unnecessary assert for add ( #13257 )
2025-05-12 13:13:49 -07:00
kimminsu
12b17501e6
opencl: fix incorrect local_size index in profiling log ( #12868 )
2025-04-16 14:25:57 -07:00
lhez
80f19b4186
opencl: split ggml-opencl.cl into multiple files and cleanup ( #12886 )
...
* opencl: refactor - split the kernel files
---------
Co-authored-by: Shangqing Gu <quic_shawngu@quicinc.com >
* opencl: split more kernels into separate files
* opencl: specify subgroup size instead of querying it
* opencl: refine Adreno cl compiler version parsing
* opencl: skip some kernels not used by Adreno on old compilers
* opencl: refine logic for selecting Adreno kernels
* opencl: refine Adreno cl compiler version
* opencl: cleanup preprocessor for kernels
* opencl: consider Adreno CL compiler on Windows
* opencl: add final newline for `mul_mv_f16_f16.cl`
---------
Co-authored-by: Shangqing Gu <quic_shawngu@quicinc.com >
2025-04-15 12:26:00 -07:00
lhez
82974011f3
opencl: better identify Adreno GPU ( #12760 )
2025-04-07 13:22:54 -07:00
lhez
97a20c012b
opencl: use max_alloc_size in backend ctx instead of querying again ( #12705 )
2025-04-02 17:01:42 -07:00
Junil Kim
f423981ac8
opencl : fix memory allocation size ( #12649 )
...
issue:
https://github.com/CodeLinaro/llama.cpp/pull/17#issuecomment-2760611283
This patch fixes the memory allocation size
not exceeding the maximum size of the OpenCL device.
2025-04-01 09:54:34 -07:00
lhez
5dec47dcd4
opencl: add multi and vision rope, gelu_quick and im2col ( #12600 )
...
* opencl: add `im2col`
* opencl: add `gelu_quick`
* opencl: add mrope
* opencl: add vision rope
2025-03-27 08:08:08 -07:00
lhez
2b65ae3029
opencl: simplify kernel embedding logic in cmakefile ( #12503 )
...
Co-authored-by: Max Krasnyansky <quic_maxk@quicinc.com >
2025-03-24 09:20:47 -07:00