Commit Graph

6294 Commits

Author SHA1 Message Date
Diego Devesa
bcbddcd54f tests : fix test-opt with GGML_BACKEND_DL (#15599) b6294 2025-08-26 22:14:38 +02:00
Akarshan Biswas
8b69686136 SYCL: fix rms_norm_mul_add for tensor dim not a multiple of sg_size (#15592)
The original implementation unconditionally returned true for this operation, leading to a failure when the tensor's first dimension (ne[0]) was not a multiple of WARP_SIZE. This caused an GGML_ASSERT(ncols % WARP_SIZE == 0) failure in ggml-sycl/norm.cpp.

This change updates the ggml_backend_sycl_device_supports_op check to correctly return true for GGML_OP_RMS_NORM only when the first dimension of the tensor is a multiple of WARP_SIZE, ensuring the operation can be performed without error.
b6293
2025-08-27 00:27:49 +05:30
fidoriel
8ce3ff1d91 mtmd : fix mtmd ios build (#15579) b6292 2025-08-26 20:05:50 +02:00
Eve
44b1efa41a tests: add performance test for mul mat id (#15543) b6291 2025-08-26 15:42:49 +00:00
shalinib-ibm
a6a58d6478 llamafile: PowerPC Sgemm Optimization (#15558)
This patch improves GEMM for FP32 Data Type on PowerPC

Implements GEMM on large blocks with configurable block size mc, nc, kc
(default: 256, 256, 256).
Packing Function optimized to access blocks as per memory layout.
GEMM Optimized to work on larger blocks.
Isolated Packing from GEMM Operations for better MMA utilization.

Verified functionality and correctness uing llama-cli and stand alone
test case (performs matmul and compares final mattrix C result with base).

Minor code refactoring changes:
Replace macro with inline function
Code Indent made consistent with 4 spaces

Performance Testing:

Observed 50% ~ 70% improvement in Prompt Processing Speed mesured using
llama-bench with Meta-Llama3-8B FP32 Model.  Similar gains observed with
Mistral-7b-Instruct-v0.3 Model.

model                   Size                Params     Backend       Threads   Test    Patch   Base
llama 8B all F32        29.92 GiB           8.03 B      CPU           20       pp512   98.58   60.3
llama 8B all F32        29.92 GiB           8.03 B      CPU           20       pp1024  95.88   57.36
llama 8B all F32        29.92 GiB           8.03 B      CPU           20       pp2048  85.46   53.26
llama 8B all F32        29.92 GiB           8.03 B      CPU           20       pp4096  68.66   45.78
llama 8B all F32        29.92 GiB           8.03 B      CPU           20       pp6144  57.35   40.44

25 ~ 30% improvement in llama-batched-bench with Metla-Llama3-8B in
Prompt Processing Speed for large prompts (256, 512, 1024, 2048, 4096)tokens with various batch
sizes ( 1, 2, 4, 8, 16)

Signed-off-by: Shalini Salomi Bodapati <Shalini.Salomi.Bodapati@ibm.com>
b6290
2025-08-26 23:35:25 +08:00
Georgi Gerganov
0373486dbc graph : fix assert in memory-less build_attn (#15590)
ggml-ci
b6289
2025-08-26 17:45:17 +03:00
Daniel Bevenius
62cef26ac5 model-conversion : add qat-q4 quantization targets (#15588)
This commit adds two targets to the Makefile for quantizing of
Quantization Aware Trained (QAT) models to Q4_0 format.

The motivation for this is that this sets the token embedding and the
output tensors data types to Q8_0 instead of the default Q6_K. This is
someting that we wish to enforce for QAT Q4_0 models that are to be
uploaded to ggml-org on Huggingface to guarantee the best quality.
2025-08-26 16:12:29 +02:00
Johannes Gäßler
8f5afa94c4 CUDA: return -1 for nonexistent compiled arch (#15587) b6287 2025-08-26 16:01:20 +02:00
Georgi Gerganov
b3964c1e89 metal : optimize FA vec for large sequences and BS <= 8 (#15566)
* metal : optmize FA vec for large heads and sequences

* metal : adjust small-batch mul mv kernels

ggml-ci

* batched-bench : fix total speed computation

ggml-ci

* cont : add comments

ggml-ci
b6286
2025-08-26 14:22:14 +03:00
Xuan-Son Nguyen
79a546220c mtmd : support Kimi VL model (#15458)
* convert : fix tensor naming conflict for llama 4 vision

* convert ok

* support kimi vision model

* clean up

* fix style

* fix calc number of output tokens

* refactor resize_position_embeddings

* add test case

* rename build fn

* correct a small bug
b6285
2025-08-26 12:54:19 +02:00
Georgi Gerganov
85cc1ae998 context : print graph stats for memory-less contexts (#15586)
ggml-ci
b6284
2025-08-26 12:47:00 +03:00
Georgi Gerganov
1d8d83deaa metal : improve MUL_MAT_ID (#15541)
* metal : mul_mm_id remove hdst

* metal : remove mul_mm_id hsrc1

* metal : mul_mm_id simplify + add test

* metal : opt mul_mm_id map0

* metal : optimize mul_mm_id id gathering

* metal : mul/div opt

* metal : optimize mul_mm_id_map0

ggml-ci
b6283
2025-08-26 12:46:15 +03:00
tc-mb
c4e9239064 model : support MiniCPM-V 4.5 (#15575) b6282 2025-08-26 10:05:55 +02:00
Sigbjørn Skjæret
39842a7f73 gguf-py : remove erroneous FFN_GATE entry (#15583) 2025-08-26 09:08:08 +02:00
Sigbjørn Skjæret
0fd90db585 metal : remove contiguous assertion for src0 in IM2COL (#15577)
* remove contiguous assertion for src0 in IM2COL

* add contiguous check in supports_op
b6280
2025-08-26 09:51:43 +03:00
Yoshi_likes_e4
4c37636b3e Add a warning for special devices (#15563)
* Add warning

* Print the devices names

* Add newlines

* Apply suggestions from code review

Co-authored-by: Johannes Gäßler <johannesg@5d6.de>

* Fix vector names

---------

Co-authored-by: Johannes Gäßler <johannesg@5d6.de>
b6279
2025-08-26 08:15:33 +02:00
Jeff Bolz
34bdbbd7c2 vulkan: Remove splitting for mul_mat_id (#15568)
row_ids only needs to hold the BN rows for the current tile.
b6278
2025-08-26 06:42:44 +02:00
Qeeweew
74f52f77f2 CUDA: Accelerate MXFP4 table lookup using __byte_perm (#15451)
* CUDA: optimize get_int_from_table_16

* CUDA: use v_perm_b32 to replace byte_perm on AMD GPUs

* revise documentation

---------

Co-authored-by: xix <xiapc@outlook.com>
Co-authored-by: Johannes Gäßler <johannesg@5d6.de>
b6277
2025-08-25 23:21:22 +02:00
lhez
f7207b0415 opencl: fix support ops condition for rms_norm (#15560) b6276 2025-08-25 14:18:09 -07:00
Ruben Ortlam
4d917cd4f6 vulkan: fix min subgroup 16 condition for mmid subgroup optimization (#15565) b6275 2025-08-25 17:56:59 +02:00
Jeff Bolz
886b97a5d6 tests: Generate unique input values for count_equal (#15487)
This avoids backend-dependent behavior for argmax that leads to intermittent failures.
b6274
2025-08-25 10:47:16 -05:00
Ihar Hrachyshka
111f8d06f0 metal: fix regression when no metal devices are present (#15531) b6273 2025-08-25 18:27:34 +03:00
Johannes Gäßler
5eff6ec9b1 CUDA: MoE helper in device code, better tile sizes (#15525)
* CUDA: MoE helper in device code, better tile sizes

* reduce superfluous CUDA blocks
b6272
2025-08-25 17:23:40 +02:00
Daniel Bevenius
dfd9b5f6c7 model-conversion : set pooling type to none in logits.cpp (#15564)
This commit explicitly sets the pooling type to 'none' in the logits.cpp
to support models that have a pooling type specified.

The motivation for this is that some models may have a pooling type set
in the model file (.gguf file) and for this specific case where we only
want to extract logits, we need to ensure that no pooling is used to
so that we are comparing raw logits and not pooled embeddings.
b6271
2025-08-25 15:00:43 +02:00
Daniel Bevenius
5a6bc6b1a6 model-conversion : add model card template for embeddings [no ci] (#15557)
* model-conversion: add model card template for embeddings [no ci]

This commit adds a separate model card template (model repository
README.md template) for embedding models.

The motivation for this is that there server command for the embedding
model is a little different and some addition information can be useful
in the model card for embedding models which might not be directly
relevant for causal models.

* squash! model-conversion: add model card template for embeddings [no ci]

Fix pyright lint error.

* remove --pooling override and clarify embd_normalize usage
2025-08-25 14:25:25 +02:00
Georgi Gerganov
6b64f74b55 batched-bench : fix unified KV cache handling + pp timing (#15562)
* batched-bench : fix unified KV cache handling + pp timing

* cont : run dummy token only with split KV cache
b6269
2025-08-25 13:56:43 +03:00
Weizhao Ouyang
0d5a470223 convert : update Ernie 4.5 dense architecture name (#15555)
Signed-off-by: Weizhao Ouyang <o451686892@gmail.com>
2025-08-25 11:15:06 +02:00
Georgi Gerganov
b0ba31f525 metal : add FA kernels for HS=40 (#15559)
ggml-ci
b6267
2025-08-25 10:14:48 +03:00
RunningLeon
7da9fed0d6 convert : support interns1-mini (#15412)
* support interns1-mini

* fix comment

* update
2025-08-25 08:32:16 +02:00
Chenguang Li
c247d06f38 CANN: ROPE cache sin/cos repeat (#15501)
Signed-off-by: noemotiovon <757486878@qq.com>
b6265
2025-08-25 10:32:21 +08:00
Ruben Ortlam
043fb27d38 vulkan: apply MUL_MAT_ID subgroup optimization to non-coopmat devices (#15524)
* vulkan: use subgroup function for mul_mat_id shader even without coopmat

* vulkan: fix compile warnings

* vulkan: properly check for subgroup size control and require full subgroups for subgroup mul_mat_id

* vulkan: disable subgroup mul_mat_id on devices with subgroups < 16
b6264
2025-08-24 19:36:36 +02:00
Georgi Gerganov
b730706a49 kv-cache : support layer reuse (#15504)
* kv-cache : support layer reuse

ggml-ci

* cont : update comments [no ci]
2025-08-24 13:07:07 +03:00
Jeff Bolz
c9a24fb932 vulkan: Support FA with any multiple of 8 head sizes (#15537)
The scalar FA shader already handled multiples of 8. The coopmat1 FA
shader assumed 16x16x16 and the shared memory allocations need the HSK
dimensions padded to a multiple of 16. NVIDIA's coopmat2 implementation
requires multiples of 16 for N and K, and needs the matrix dimensions
padded and loads clamped.

Store the FA pipelines in a map, indexed by the pipeline state.
b6262
2025-08-24 11:24:25 +02:00
Ruben Ortlam
a9c6ffcbfa vulkan: enable Conv2D for Apple after MoltenVK fixed the bug (#15526) b6261 2025-08-24 10:48:53 +02:00
Jeff Bolz
e78cf0d4b1 vulkan: workaround MoltenVK compile failure in multi_add (#15506)
* vulkan: workaround MoltenVK compile failure in multi_add

* Update ggml/src/ggml-vulkan/vulkan-shaders/multi_add.comp

Co-authored-by: 0cc4m <picard12@live.de>
2025-08-24 10:48:21 +02:00
Johannes Gäßler
710dfc465a CUDA: fix half2 -> half conversion for HIP (#15529) 2025-08-23 21:37:06 +02:00
Jeff Bolz
611f419cff vulkan: optimize rms_norm, and allow the work to spread across multiple SMs (#15281)
* vulkan: optimize rms_norm, and allow the work to spread across multiple SMs

There are really two parts to this change:
(1) Some optimizations similar to what we have in soft_max, to unroll with
different numbers of iterations.
(2) A fusion optimization where we detect add followed by rms_norm, and make
the add shader atomically accumulate the values^2 into memory. Then the
rms_norm shader can just load that sum. This allows the rms_norm to be
parallelized across multiple workgroups, it just becomes a simple per-element
multiply.

The fusion optimization is currently only applied when the rms_norm is on a
single vector. This previously always ran on a single SM. It could apply more
broadly, but when there are other dimensions the work can already spread across
SMs, and there would be some complexity to tracking multiple atomic sums.

* Change add+rms_norm optimization to write out an array of partial sums
rather than using atomic add, to make it deterministic. The rms_norm
shader fetches a subgroup's worth in parallel and uses subgroupAdd to
add them up.

* complete rebase against fused adds - multi_add shader can also compute partial sums

* fix validation errors

* disable add_rms_fusion for Intel due to possible driver bug

* resolve against #15489, sync after clearing partial sums
b6258
2025-08-23 13:16:17 -05:00
Piotr Wilkin (ilintar)
b1afcab804 model : add support for Seed-OSS (#15490)
* First draft

* Fix linter errors

* Added missing sinks nullptr

* Don't forget the llama-arch!

* We're through to the generation stage.

* Fix post-attention norm

* Apply suggestions from code review

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

* Fix RoPE type

* Fix tensor name and reorder llm_types

* Update gguf-py/gguf/constants.py

Remove nonexistent FFN_POST_NORM tensor

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

* Update src/llama-model.h

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

* Add basic chat template

* Add chat template tests

* Remake chat template test

* Apply suggestions from code review

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

* Update src/llama-chat.cpp

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

* Reorder llm type descriptions

* Update src/llama-model.cpp

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

---------

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
b6257
2025-08-23 15:21:52 +02:00
Johannes Gäßler
9ef536907d scripts: fix compare-llama-bench.py (#15521) 2025-08-23 13:58:58 +03:00
LaffeyNyaa
21dc4ddaf2 chat : fix debug build assertion in trim function (#15520) b6255 2025-08-23 10:38:30 +02:00
Jeff Bolz
289bf4113e vulkan: Rewrite synchronization to allow some overlap between nodes (#15489)
Track a list of nodes that need synchronization, and only sync if the new node
depends on them (or overwrites them). This allows some overlap which can
improve performance, and centralizes a big chunk of the synchronization logic.

The remaining synchronization logic involves writes to memory other than the
nodes, e.g. for dequantization or split_k. Each of these allocations has a bool
indicating whether they were in use and need to be synced. This should be
checked before they are written to, and set to true after they are done being
consumed.
b6254
2025-08-23 09:33:36 +02:00
R0CKSTAR
b55f06e1aa vulkan.Dockerfile: install vulkan SDK using tarball (#15282)
Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com>
2025-08-23 08:58:57 +02:00
Acly
0a9b43e507 vulkan : support ggml_mean (#15393)
* vulkan : support ggml_mean

* vulkan : support sum, sum_rows and mean with non-contiguous tensors

* vulkan : fix subbuffer size not accounting for misalign offset

* tests : add backend-op tests for non-contiguous sum_rows

* cuda : require contiguous src for SUM_ROWS, MEAN support
* sycl : require contiguous src for SUM, SUM_ROWS, ARGSORT support

* require ggml_contiguous_rows in supports_op and expect nb00=1 in the shader
2025-08-23 08:35:21 +02:00
Jeff Bolz
330c3d2d21 vulkan: optimize mul_mat_id loading row ids into shared memory (#15427)
- Spread the work across the whole workgroup. Using more threads seems to
far outweigh the synchronization overhead.
- Specialize the code for when the division is by a power of two.
b6251
2025-08-23 08:31:54 +02:00
Johannes Gäßler
e92734d51b test-opt: allow slight inprecision (#15503) b6250 2025-08-22 23:47:01 +02:00
Reese Levine
45363632cb ggml WebGPU: add support for quantization types (#15440)
* Begin work on set_rows

* Work on set rows

* Add error buffers for reporting unsupported SET_ROWS indices

* Remove extra comments

* Work on templating for different types in shaders

* Work on shader type generation

* Working q4_0 mul_mat and some templating for different types

* Add q4_0_f16 matmul and fix device init

* Add matmul support for basic quantization types

* Add q2_k and q3_k quantization

* Add rest of k-quants

* Get firt i-quant working

* Closer to supporting all i-quants

* Support rest of i-quants

* Cleanup code

* Fix python formatting

* debug

* Bugfix for memset

* Add padding to end of buffers on creation

* Simplify bit-shifting

* Update usage of StringView
b6249
2025-08-22 11:28:03 -07:00
Aldehir Rojas
32732f2459 model : gpt-oss add response_format support (#15494) b6248 2025-08-22 11:04:08 -05:00
rmatif
92f7f0a53c ggml: add conv3d op (#15182)
* add conv3d

* bump GGML_OP_COUNT
b6247
2025-08-22 15:33:15 +02:00
Yavor Ivanov
b1ab91821f cuda : add Pad Reflect 1D support (#14659)
* Add Pad Reflect 1D CUDA support

* Update ggml/src/ggml-cuda/pad_reflect_1d.cu

Co-authored-by: Johannes Gäßler <johannesg@5d6.de>

---------

Co-authored-by: Johannes Gäßler <johannesg@5d6.de>
b6246
2025-08-22 13:06:29 +02:00
Georgi Gerganov
9ebebef62f llama : remove KV cache defragmentation logic (#15473)
ggml-ci
b6245
2025-08-22 12:22:13 +03:00