* CUDA: Optimize PAD_REFLECT_1D
feat: add more test cases for PAD_REFLECT_1D
* use fast_div to improve performance
* Apply suggestion from JohannesGaessler
Co-authored-by: Johannes Gäßler <johannesg@5d6.de>
* Apply suggestion from JohannesGaessler
Co-authored-by: Johannes Gäßler <johannesg@5d6.de>
* optimize
* use a concise expression to further speedup the cuda kernel
---------
Co-authored-by: Johannes Gäßler <johannesg@5d6.de>
* Add paramater buffer pool, batching of submissions, refactor command building/submission
* Add header for linux builds
* Free staged parameter buffers at once
* Format with clang-format
* Fix thread-safe implementation
* Use device implicit synchronization
* Update workflow to use custom release
* Remove testing branch workflow
* some f32 tests passing
* Disable set_rows until it's implemented
* f32 add all tests passing
* Begin work on set_rows
* Work on set rows
* Add error buffers for reporting unsupported SET_ROWS indices
* Remove extra comments
* Add templated addition, clean up code
* Get addition and multiplication working
* Implement rms_norm
* Add get_rows implementation
* Add new get_rows files
* Refactor use of wg size entry
* Fix compilation
* Try manually unrolled q4_0 quant
* Revert "Try manually unrolled q4_0 quant"
This reverts commit 77f8b96515.
* Move to constant max wg size
* Check for tensor size in supports_op
* Vectorize f32 and change default workgroup size
* Move f32 get_rows from < 4 to % 4 != 0
* fix linter errors
* Add in-place tests
---------
Co-authored-by: Neha Abbas <nehaabbas@ReeseLevines-MacBook-Pro.local>
* metal : improve naming
* metal : refactor device
ggml-ci
* cont : props
ggml-ci
* metal : apply ggml_mem_ranges_t
ggml-ci
* metal : remove GGML_METAL_USE_BF16
ggml-ci
* metal : refactor device buffer
ggml-ci
* cont : fix naming
* metal : sync before destroying the backend
ggml-ci
* metal : refactor context
ggml-ci
* metal : migrate ggml-metal.m to ggml-metal.cpp
ggml-ci
* metal : adjust ops API
ggml-ci
* metal : use C++ to store piplienes
ggml-ci
* metal : migrate ops to separate functions
ggml-ci
* metal : add ggml_metal_library_t
ggml-ci
* metal : improve naming
ggml-ci
* metal : cleanp
ggml-ci
* metal : add support for GGML_OP_LOG
ggml-ci
* metal : fix error handling
ggml-ci
* CANN: Fix ggml_cann_set_device to avoid redundant device switches
- Added a check to skip aclrtSetDevice if the current device is already set.
- Prevents unnecessary context switches while keeping thread/device consistency.
* CANN: add device default id
* ggml : remove adding extra dim timestep embedding
This commit updates the ggml_timestep_embedding function to no longer
add an extra dimension when the specified dimension is odd.
The motivation for this change is that this introduces an unnecessary
dimension when the dimension is odd, which caused an issue in the
kernels which were not expecting this extra dimension and it resulted in
uninitialized memory for the second to last dimension.
* ggml-cuda : fix padding in timestep embedding kernel
This commit removes the zeroing out of the last dimension now that we
are not adding the extra padding dimension.
* ggml-metal : fix padding in timestep embedding kernel
This commit fixes the zero padding for odd dimensions in
the timestep embedding kernel
* ggml-opencl : fix padding in timestep embedding kernel
This commit fixes the zero padding for odd dimensions in
the timestep embedding kernel.
* ggml-sycl : fix padding in timestep embedding kernel
This commit fixes the zero padding for odd dimensions in
the timestep embedding kernel.
* ggml-vulkan : fix padding in timestep embedding kernel
This commit fixes the zero padding for odd dimensions in
the timestep embedding kernel.
* ggml-cpu : fix padding in timestep embedding function
This commit removes the zeroing out of the last dimension now that we
are not adding the extra padding dimension.
* fix im2col_3d to respect non-contiguous inputs (views)
The CUDA 3D im2col kernel computed source addresses assuming compact layout (products of dims), ignoring nb[] strides.
This patch switches im2col_3d source indexing to use true strides derived from src1->nb[] (in elements), mirroring the approach used in the 2D CUDA im2col path. Destination indexing is unchanged.
* use ggml_element_size() for src strides
Co-authored-by: Johannes Gäßler <johannesg@5d6.de>
---------
Co-authored-by: Johannes Gäßler <johannesg@5d6.de>
* SYCL: Add COUNT_EQUAL operator support (rebased on master)
* SYCL: remove duplicate op_count_equal definition
* tests: remove test_count_equal_typed and use test_count_equal for all cases
* tests: keep only I32 case for COUNT_EQUAL as suggested
* tests: keep only I32 case for COUNT_EQUAL as requested
* metal : remove mem pool usage
ggml-ci
* metal : remove mem pool implementation
ggml-ci
* metal : take into account the actual allocated memory of the tensor
ggml-ci
* cont : use ggml_backend_buft_get_alloc_size
ggml-ci
* cont : improve, comments
ggml-ci
* cont : add functions for the extra tensor sizes
* metal : add comments
ggml-ci
* metal : implement .get_alloc_size for the rest of the buffer types
ggml-ci
* metal : remove ggml_metal_heap
ggml-ci
Use this to query register count for shader compiles on NVIDIA. Currently
this is only for performance debug, but it could eventually be used in some
heuristics like split_k.
* metal : refactor bin kernels loading
ggml-ci
* metal : refactor rms kernel loading
ggml-ci
* ci : try to add memory leaks check
ggml-ci
* ci : try to enable memory leak detection for Mac
* cont : seems to be working
* ggml-backend : add GGML_BACKEND_DEVICE_TYPE_IGPU device type
ggml-backend : add device id to device props
llama : only use iGPU devices if there are no GPU devices
llama : do not use multiple devices from different backends with the same device id
This commit adds a check for GGML_MACHINE_SUPPORTS_i8mm when enabling
MATMUL_INT8 features, ensuring that i8mm intrinsics are only used when
the target hardware actually supports them.
The motivation for this is to fix ggml CI build failures where the
feature detection correctly identifies that i8mm is not supported,
adding the +noi8mm flag, but MATMUL_INT8 preprocessor definitions are
still enabled, causing the compiler to attempt to use vmmlaq_s32
intrinsics without i8mm support.
Refs: https://github.com/ggml-org/ggml/actions/runs/17525174120/job/49909199499
Since the prefill length is not fixed, graphs constructed for the
prefill stage cannot be reused. For this reason, ACL graph
execution is disabled by default during prefill.
* Add fastdiv and fastmodulo to k_bin_bcast kernel
* Address review comments
* `prod_` instead of `prod` suffix
* Add test case for `k_bin_bcast_unravel` in CUDA backend
This commit fixes the zero padding for odd dimensions in
ggml_compute_forward_timestep_embedding_f32.
The motivation for this is that currently if an odd dimension is used,
the padding check incorrectly uses the dimension value for indexing.
For example, with dim=15:
Elements 0-6 are set to cosine values
Elements 7-13 are set to sine values
Element 14 is left uninitialized (contains garbage)
Element 15 is correctly set to zero
This fix changes embed_data[dim] to embed_data[2 * half] so that
element 14 (the first unused element) is properly set to zero as well
as the last element.
Resolves: https://github.com/ggml-org/ggml/issues/1324
* metal : make the backend async
ggml-ci
* cont : add comments, extend op offload, clean up
ggml-ci
* metal : fix batch size for MUL_MAT_ID
* metal : remove deprecated ggml_backend_metal_buffer_from_ptr
* metal : create only metal buffers, no wrapping of host memory
ggml-ci
* metal : restore .alloc_buffer for buffer_from_ptr_type
ggml-ci
* metal : remove broken implementation of GGML_OP_SET
ggml-ci
* metal : clean-up loose ends, ready for tests
ggml-ci
* metal : support both private and shared buffers
ggml-ci
* metal : enable private buffers + add global device queue
* metal : disable host buffer to prevent races
ggml-ci
* metal : avoid extra copy during set_tensor
ggml-ci
* metal : use separate buffer types for shread and private Metal buffers
ggml-ci
* metal : simplify synchronization logic
ggml-ci
* metal : fix build
ggml-ci
* metal : do not implement cpy_tensor
ggml-ci
* metal : separate implementations for shared and private buffers
ggml-ci
* CANN: Add ROPE sin/cos cache for reuse
Introduce sin/cos caching mechanism in ROPE to avoid redundant
computation across layers. The cache is built on the first layer
per device and reused by subsequent layers if parameters match.
- Added sin_cache / cos_cache pointers and position_length tracking
- Introduced cache validity flags and properties:
(ext_factor, theta_scale, freq_scale, attn_factor, is_neox)
- Accelerates ROPE by eliminating repeated sin/cos generation
This change reduces overhead in multi-layer scenarios while
preserving correctness by verifying parameter consistency.
Co-authored-by: hipudding <huafengchun@gmail.com>
* fix typo
Signed-off-by: noemotiovon <757486878@qq.com>
---------
Signed-off-by: noemotiovon <757486878@qq.com>
Co-authored-by: hipudding <huafengchun@gmail.com>
* CANN: implement LRU cache for ACL graphs in CANN backend
- Introduce ggml_cann_graph_lru_cache to store multiple ggml_cann_graph objects.
- Graphs are loaded on demand and evicted using LRU policy when capacity is exceeded.
- Updated push, move_to_front, and clear methods to manage cached graphs efficiently.
- Ensures reuse of graphs, reducing graph reconstruction overhead in CANN backend.
* fix typo
* The LRU cache capacity can be configured via an env variable
Signed-off-by: noemotiovon <757486878@qq.com>
* refactory acl graph
* refactory && fix review comments
Signed-off-by: noemotiovon <757486878@qq.com>
---------
Signed-off-by: noemotiovon <757486878@qq.com>
* CUDA: Add mul_mat_id support the mmf
Add support for mul_mat_id for bs < 16
* Review: use warp_size, fix should_use_mmf condition
* Launch one block per expert, stride along n_expert_used
* templatize mul_mat_id
* Pad shmem to 16 bytes, add helper function mul_mat_f_switch_ids
* Reduce compile times by dividing mmf into f16, bf16 and f32 variants
* Divide mmf by ncols_dst
* Add missing files
* Fix MUSA/HIP builds
* vulkan: sort graph to allow more parallel execution
Add a backend proc to allow the backend to modify the graph. The
vulkan implementation looks at which nodes depend on each other
and greedily reorders them to group together nodes that don't
depend on each other. It only reorders the nodes, doesn't change
the contents of any of them.
With #15489, this reduces the number of synchronizations needed.
* call optimize_graph per-split
* cuda : fix supports_op condition for get_rows when src1->ne2 > 1
ggml-ci
* ggml : add comment about ggml_get_rows
ggml-ci
* cuda : add FIXME [no ci]
* cuda : update support condition
ggml-ci