* vulkan: optimize UMA buffer operations and fix driver hangs
The previous implementation was blocking the GPU for extended periods,
causing the i915 driver to reset the context due to the hangcheck
protection.
[32628.443070] i915 0000:00:02.0: [drm] GPU HANG: ecode 12:1:85dffffb, in llama-server [194114]
[32628.443091] i915 0000:00:02.0: [drm] llama-server[194114] context reset due to GPU hang
* vulkan: implement deferred_memset on UMA
---------
Signed-off-by: Giuseppe Scrivano <gscrivan@redhat.com>
* ggml : introduce semantic versioning
This commit introduces semantic versioning for the GGML library.
The motivation for this is that the current versioning, using build
numbers, makes it difficult to track changes and releases for projects
that use ggml.
The release steps are the following:
1. Sync the changes from llama.cpp using sync-llama-am.sh and after the
PR has been approved and merged move to step 2.
2. Run scripts/release.sh and specify the type of release, major, minor,
or patch. This script will handle incrementing the version
(major|minor|patch), create a new commit with the version change,
create a tag for the version, and prepare for the next development
iteration.
3. Inspect the commits/tag and push to master. This will trigger the
github release workflow which is triggered for new tags which will
then publish a new release on github.
Example usage:
```console
$ ./scripts/release.sh major --dry-run
[dry-run] - No changes will be made
Step 1: Reading current version...
Current version: 0.9.0-dev
New release version: 1.0.0
Step 2: Updating version in ggml/CMakeLists.txt...
[dry-run] Would update GGML_VERSION_MAJOR to 1
[dry-run] Would update GGML_VERSION_MINOR to 0
[dry-run] Would update GGML_VERSION_PATCH to 0
[dry-run] Would remove -dev suffix
Step 3: Committing version bump...
[dry-run] Would commit: 'ggml : bump version to 1.0.0'
Step 4: Creating git tag...
[dry-run] Would create tag: v1.0.0 with message 'Release version 1.0.0'
Step 5: Preparing for next development cycle...
[dry-run] Would update GGML_VERSION_MINOR to 1
[dry-run] Would add -dev suffix back
Step 6: Committing development version...
[dry-run] Would commit: 'ggml : prepare for development of 1.1.0-dev'
[dry-run] Summary (no changes were made):
• Would have released version: 1.0.0
• Would have created tag: v1.0.0
• Would have set next development version: 1.1.0-dev
```
Refs: https://github.com/ggml-org/ggml/issues/1333
* ggml: create branch for release candidate and check master
* ggml : sign the git tag
* vulkan: Change the mul_mm shared memory and register caching system to use vec2 instead of scalars, to enable using dot2 instructions
* use fma instead of dot to fix Nvidia and Apple performance issues
Generalize Linux check to `__linux__` to support non-glibc systems (like musl).
Also, return `false` on unknown/untested OS.
Without this commit, the code compiles (with warnings) but fails:
register_backend: registered backend CPU (1 devices)
register_device: registered device CPU (Intel(R) Xeon(R) Platinum 8488C)
build: 6487 (51c4cac6) with x86_64-linux-musl-gcc (GCC) 15.1.0 for x86_64-linux-musl (debug)
system info: n_threads = 8, n_threads_batch = 8, total_threads = 16
....
print_info: n_ctx_orig_yarn = 262144
print_info: rope_finetuned = unknown
print_info: model type = 4B
Illegal instruction (core dumped)
Signed-off-by: Adrien Gallouët <angt@huggingface.co>
When compiling with GGML_STATIC=ON, the build process would produce a
binary that was still dynamically linked to OpenMP. This defeats the
purpose of a static build:
$ cmake -B build \
-DBUILD_SHARED_LIBS=OFF \
-DLLAMA_CURL=OFF \
-DGGML_CCACHE=OFF \
-DGGML_NATIVE=OFF \
-DGGML_STATIC=ON
$ ldd llama-server
linux-vdso.so.1 (0x0000e1a434e3b000)
libgomp.so.1 => /lib/aarch64-linux-gnu/libgomp.so.1 (0x0000e1a4345a0000)
libstdc++.so.6 => /lib/aarch64-linux-gnu/libstdc++.so.6 (0x0000e1a434300000)
libm.so.6 => /lib/aarch64-linux-gnu/libm.so.6 (0x0000e1a434240000)
libgcc_s.so.1 => /lib/aarch64-linux-gnu/libgcc_s.so.1 (0x0000e1a434200000)
libc.so.6 => /lib/aarch64-linux-gnu/libc.so.6 (0x0000e1a434030000)
/lib/ld-linux-aarch64.so.1 (0x0000e1a434df0000)
This commit resolves the issue by modifying `CMAKE_FIND_LIBRARY_SUFFIXES`
to prioritize `.a` files, forcing CMake to link the static version of
the library.
Signed-off-by: Adrien Gallouët <angt@huggingface.co>
- flatten mxfp4 and packed fp4->fp16 bit-wise convert function (replace lut)
- MoE kernel optimizations
---------
Co-authored-by: Li He <lih@qti.qualcomm.com>
* CUDA: Optimize PAD_REFLECT_1D
feat: add more test cases for PAD_REFLECT_1D
* use fast_div to improve performance
* Apply suggestion from JohannesGaessler
Co-authored-by: Johannes Gäßler <johannesg@5d6.de>
* Apply suggestion from JohannesGaessler
Co-authored-by: Johannes Gäßler <johannesg@5d6.de>
* optimize
* use a concise expression to further speedup the cuda kernel
---------
Co-authored-by: Johannes Gäßler <johannesg@5d6.de>
* Add paramater buffer pool, batching of submissions, refactor command building/submission
* Add header for linux builds
* Free staged parameter buffers at once
* Format with clang-format
* Fix thread-safe implementation
* Use device implicit synchronization
* Update workflow to use custom release
* Remove testing branch workflow
* some f32 tests passing
* Disable set_rows until it's implemented
* f32 add all tests passing
* Begin work on set_rows
* Work on set rows
* Add error buffers for reporting unsupported SET_ROWS indices
* Remove extra comments
* Add templated addition, clean up code
* Get addition and multiplication working
* Implement rms_norm
* Add get_rows implementation
* Add new get_rows files
* Refactor use of wg size entry
* Fix compilation
* Try manually unrolled q4_0 quant
* Revert "Try manually unrolled q4_0 quant"
This reverts commit 77f8b96515.
* Move to constant max wg size
* Check for tensor size in supports_op
* Vectorize f32 and change default workgroup size
* Move f32 get_rows from < 4 to % 4 != 0
* fix linter errors
* Add in-place tests
---------
Co-authored-by: Neha Abbas <nehaabbas@ReeseLevines-MacBook-Pro.local>
* metal : improve naming
* metal : refactor device
ggml-ci
* cont : props
ggml-ci
* metal : apply ggml_mem_ranges_t
ggml-ci
* metal : remove GGML_METAL_USE_BF16
ggml-ci
* metal : refactor device buffer
ggml-ci
* cont : fix naming
* metal : sync before destroying the backend
ggml-ci
* metal : refactor context
ggml-ci
* metal : migrate ggml-metal.m to ggml-metal.cpp
ggml-ci
* metal : adjust ops API
ggml-ci
* metal : use C++ to store piplienes
ggml-ci
* metal : migrate ops to separate functions
ggml-ci
* metal : add ggml_metal_library_t
ggml-ci
* metal : improve naming
ggml-ci
* metal : cleanp
ggml-ci
* metal : add support for GGML_OP_LOG
ggml-ci
* metal : fix error handling
ggml-ci
* CANN: Fix ggml_cann_set_device to avoid redundant device switches
- Added a check to skip aclrtSetDevice if the current device is already set.
- Prevents unnecessary context switches while keeping thread/device consistency.
* CANN: add device default id
* ggml : remove adding extra dim timestep embedding
This commit updates the ggml_timestep_embedding function to no longer
add an extra dimension when the specified dimension is odd.
The motivation for this change is that this introduces an unnecessary
dimension when the dimension is odd, which caused an issue in the
kernels which were not expecting this extra dimension and it resulted in
uninitialized memory for the second to last dimension.
* ggml-cuda : fix padding in timestep embedding kernel
This commit removes the zeroing out of the last dimension now that we
are not adding the extra padding dimension.
* ggml-metal : fix padding in timestep embedding kernel
This commit fixes the zero padding for odd dimensions in
the timestep embedding kernel
* ggml-opencl : fix padding in timestep embedding kernel
This commit fixes the zero padding for odd dimensions in
the timestep embedding kernel.
* ggml-sycl : fix padding in timestep embedding kernel
This commit fixes the zero padding for odd dimensions in
the timestep embedding kernel.
* ggml-vulkan : fix padding in timestep embedding kernel
This commit fixes the zero padding for odd dimensions in
the timestep embedding kernel.
* ggml-cpu : fix padding in timestep embedding function
This commit removes the zeroing out of the last dimension now that we
are not adding the extra padding dimension.
* fix im2col_3d to respect non-contiguous inputs (views)
The CUDA 3D im2col kernel computed source addresses assuming compact layout (products of dims), ignoring nb[] strides.
This patch switches im2col_3d source indexing to use true strides derived from src1->nb[] (in elements), mirroring the approach used in the 2D CUDA im2col path. Destination indexing is unchanged.
* use ggml_element_size() for src strides
Co-authored-by: Johannes Gäßler <johannesg@5d6.de>
---------
Co-authored-by: Johannes Gäßler <johannesg@5d6.de>
* SYCL: Add COUNT_EQUAL operator support (rebased on master)
* SYCL: remove duplicate op_count_equal definition
* tests: remove test_count_equal_typed and use test_count_equal for all cases
* tests: keep only I32 case for COUNT_EQUAL as suggested
* tests: keep only I32 case for COUNT_EQUAL as requested
* metal : remove mem pool usage
ggml-ci
* metal : remove mem pool implementation
ggml-ci
* metal : take into account the actual allocated memory of the tensor
ggml-ci
* cont : use ggml_backend_buft_get_alloc_size
ggml-ci
* cont : improve, comments
ggml-ci
* cont : add functions for the extra tensor sizes
* metal : add comments
ggml-ci
* metal : implement .get_alloc_size for the rest of the buffer types
ggml-ci
* metal : remove ggml_metal_heap
ggml-ci
Use this to query register count for shader compiles on NVIDIA. Currently
this is only for performance debug, but it could eventually be used in some
heuristics like split_k.
* metal : refactor bin kernels loading
ggml-ci
* metal : refactor rms kernel loading
ggml-ci
* ci : try to add memory leaks check
ggml-ci
* ci : try to enable memory leak detection for Mac
* cont : seems to be working
* ggml-backend : add GGML_BACKEND_DEVICE_TYPE_IGPU device type
ggml-backend : add device id to device props
llama : only use iGPU devices if there are no GPU devices
llama : do not use multiple devices from different backends with the same device id
This commit adds a check for GGML_MACHINE_SUPPORTS_i8mm when enabling
MATMUL_INT8 features, ensuring that i8mm intrinsics are only used when
the target hardware actually supports them.
The motivation for this is to fix ggml CI build failures where the
feature detection correctly identifies that i8mm is not supported,
adding the +noi8mm flag, but MATMUL_INT8 preprocessor definitions are
still enabled, causing the compiler to attempt to use vmmlaq_s32
intrinsics without i8mm support.
Refs: https://github.com/ggml-org/ggml/actions/runs/17525174120/job/49909199499
Since the prefill length is not fixed, graphs constructed for the
prefill stage cannot be reused. For this reason, ACL graph
execution is disabled by default during prefill.
* Add fastdiv and fastmodulo to k_bin_bcast kernel
* Address review comments
* `prod_` instead of `prod` suffix
* Add test case for `k_bin_bcast_unravel` in CUDA backend