Commit Graph

6308 Commits

Author SHA1 Message Date
Aman Gupta
55042b3692 scripts: add sqlite3 check for compare-commits.sh (#15633) 2025-08-28 19:23:22 +08:00
Georgi Gerganov
8a4280ce43 kv-cache : remove LLAMA_SET_ROWS checks (#15505)
ggml-ci
b6307
2025-08-28 12:27:02 +03:00
Aleksei Nikiforov
64387f6e95 gguf-py: byteswapping improvements (#12851)
* gguf-py: implement byteswapping for Q4_0

This is needed to byteswap Mistral model.

Also restore original shapes after byteswapping tensors.
It is not needed at the moment, but do it in case
they'd be used in future.

* Rework byteswapping code in gguf-py

Move out details from byteswapping tensor blocks code
2025-08-28 16:56:41 +08:00
Joshua Cogliati
d35a1e8c41 cli : change log to warning to explain reason for stopping (#15604)
* Change to warn instead of debug, to explain reason for stopping.

* Update tools/main/main.cpp

Fix printing --2

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
b6305
2025-08-28 10:48:20 +03:00
Daniel Bevenius
46d9caa27a model-conversion : add mmproj conversion target (#15628)
This commit adds a new target to the Makefile for converting models that
are multimodal. This target will convert the original model and in
addition also create the mmproj GGUF model.

The motivation for this change is that for models that are multimodal,
for example those that contain a vision encoders, we will often want to
upload both the quantized model and the vision encoder model to
HuggingFace.

Example usage:
```console
$ make causal-convert-mm-model MODEL_PATH=~/work/ai/models/gemma-3-4b-it-qat-q4_0-unquantized/
...
The environment variable CONVERTED_MODEL can be set to this path using:
export CONVERTED_MODEL=/home/danbev/work/ai/llama.cpp/models/gemma-3-4b-it-qat-q4_0-unquantized.gguf
The mmproj model was created in /home/danbev/work/ai/llama.cpp/models/mmproj-gemma-3-4b-it-qat-q4_0-unquantized.gguf
```
The converted original model can then be quantized, and after that both
the quantized model and the mmproj file can then be uploaded to
HuggingFace.

Refs: https://huggingface.co/ggml-org/gemma-3-4b-it-qat-GGUF/tree/main
2025-08-28 09:26:48 +02:00
matiaslin
5a0e3ef6f0 cuda: Add cublasLt_static linking when GGML_STATIC is enabled (#15622)
Prior to this change, we faced undefined cublasLt references when
attempting to compile 'llama-cli' with GGML_STATIC=ON on Linux.

We add linking with CUDA::cublasLt_static when CUDA version is greater
than 10.1.
b6303
2025-08-28 02:32:36 +02:00
Johannes Gäßler
fbef0fad7a server: higher timeout for tests (#15621) 2025-08-27 20:58:09 +02:00
Georgi Gerganov
da54f9f1a2 presets : add qwen3-30B-a3b FIM (#15616) b6301 2025-08-27 15:48:07 +03:00
uvos
47373271f9 HIP: Enable support for ggml_backend_cuda_register_host_buffer (#15615) b6300 2025-08-27 13:58:54 +02:00
Georgi Gerganov
1bded5a3b3 kv-cache : better estimate of n_kv for multi-sequence batches (#15610)
ggml-ci
b6299
2025-08-27 13:55:12 +03:00
Chenguang Li
1e7489745a CANN: refactor mask handling and improve performance in FA (#15561)
* CANN(flash-attn): refactor mask handling and improve performance

1. Refactored the mask computation in Flash Attention, unified the logic without separating prefill and decode.
2. Optimized performance in non-alibi scenarios by reducing one repeat operation.
3. Updated operator management to explicitly mark unsupported cases on 310P devices and when dim is not divisible by 16.

Signed-off-by: noemotiovon <757486878@qq.com>

* [CANN]: fix review

Signed-off-by: noemotiovon <757486878@qq.com>

* [CANN]: Optimization FA BNSD to BSND

Signed-off-by: noemotiovon <757486878@qq.com>

---------

Signed-off-by: noemotiovon <757486878@qq.com>
b6298
2025-08-27 17:21:41 +08:00
xctan
1cf123a343 ggml-cpu : add basic RVV support for vector f32 ops (#15057)
* ggml-cpu : add basic RVV support for vector f32 ops

* ggml-cpu : add RVV support for f32 softmax
b6297
2025-08-27 16:44:22 +08:00
Daniel Bevenius
fcca2182a1 common : add -m to bash completion for --model [no ci] (#15591)
This commit updates the bash completion script to include the -m
short option for the --model argument.

The motivation for this is that currently tab completion only works the
full --model option, and it is nice to have it work for the short option
as well.
2025-08-27 10:28:53 +02:00
rmatif
86076f92de OpenCL: add fused group_norm/norm, mul, add (#15314)
* add fused group_norm/norm, mul, add

* fix spacing

* revert rms_norm logic

* fix trailing whitespace
b6295
2025-08-26 23:36:05 -07:00
Diego Devesa
bcbddcd54f tests : fix test-opt with GGML_BACKEND_DL (#15599) b6294 2025-08-26 22:14:38 +02:00
Akarshan Biswas
8b69686136 SYCL: fix rms_norm_mul_add for tensor dim not a multiple of sg_size (#15592)
The original implementation unconditionally returned true for this operation, leading to a failure when the tensor's first dimension (ne[0]) was not a multiple of WARP_SIZE. This caused an GGML_ASSERT(ncols % WARP_SIZE == 0) failure in ggml-sycl/norm.cpp.

This change updates the ggml_backend_sycl_device_supports_op check to correctly return true for GGML_OP_RMS_NORM only when the first dimension of the tensor is a multiple of WARP_SIZE, ensuring the operation can be performed without error.
b6293
2025-08-27 00:27:49 +05:30
fidoriel
8ce3ff1d91 mtmd : fix mtmd ios build (#15579) b6292 2025-08-26 20:05:50 +02:00
Eve
44b1efa41a tests: add performance test for mul mat id (#15543) b6291 2025-08-26 15:42:49 +00:00
shalinib-ibm
a6a58d6478 llamafile: PowerPC Sgemm Optimization (#15558)
This patch improves GEMM for FP32 Data Type on PowerPC

Implements GEMM on large blocks with configurable block size mc, nc, kc
(default: 256, 256, 256).
Packing Function optimized to access blocks as per memory layout.
GEMM Optimized to work on larger blocks.
Isolated Packing from GEMM Operations for better MMA utilization.

Verified functionality and correctness uing llama-cli and stand alone
test case (performs matmul and compares final mattrix C result with base).

Minor code refactoring changes:
Replace macro with inline function
Code Indent made consistent with 4 spaces

Performance Testing:

Observed 50% ~ 70% improvement in Prompt Processing Speed mesured using
llama-bench with Meta-Llama3-8B FP32 Model.  Similar gains observed with
Mistral-7b-Instruct-v0.3 Model.

model                   Size                Params     Backend       Threads   Test    Patch   Base
llama 8B all F32        29.92 GiB           8.03 B      CPU           20       pp512   98.58   60.3
llama 8B all F32        29.92 GiB           8.03 B      CPU           20       pp1024  95.88   57.36
llama 8B all F32        29.92 GiB           8.03 B      CPU           20       pp2048  85.46   53.26
llama 8B all F32        29.92 GiB           8.03 B      CPU           20       pp4096  68.66   45.78
llama 8B all F32        29.92 GiB           8.03 B      CPU           20       pp6144  57.35   40.44

25 ~ 30% improvement in llama-batched-bench with Metla-Llama3-8B in
Prompt Processing Speed for large prompts (256, 512, 1024, 2048, 4096)tokens with various batch
sizes ( 1, 2, 4, 8, 16)

Signed-off-by: Shalini Salomi Bodapati <Shalini.Salomi.Bodapati@ibm.com>
b6290
2025-08-26 23:35:25 +08:00
Georgi Gerganov
0373486dbc graph : fix assert in memory-less build_attn (#15590)
ggml-ci
b6289
2025-08-26 17:45:17 +03:00
Daniel Bevenius
62cef26ac5 model-conversion : add qat-q4 quantization targets (#15588)
This commit adds two targets to the Makefile for quantizing of
Quantization Aware Trained (QAT) models to Q4_0 format.

The motivation for this is that this sets the token embedding and the
output tensors data types to Q8_0 instead of the default Q6_K. This is
someting that we wish to enforce for QAT Q4_0 models that are to be
uploaded to ggml-org on Huggingface to guarantee the best quality.
2025-08-26 16:12:29 +02:00
Johannes Gäßler
8f5afa94c4 CUDA: return -1 for nonexistent compiled arch (#15587) b6287 2025-08-26 16:01:20 +02:00
Georgi Gerganov
b3964c1e89 metal : optimize FA vec for large sequences and BS <= 8 (#15566)
* metal : optmize FA vec for large heads and sequences

* metal : adjust small-batch mul mv kernels

ggml-ci

* batched-bench : fix total speed computation

ggml-ci

* cont : add comments

ggml-ci
b6286
2025-08-26 14:22:14 +03:00
Xuan-Son Nguyen
79a546220c mtmd : support Kimi VL model (#15458)
* convert : fix tensor naming conflict for llama 4 vision

* convert ok

* support kimi vision model

* clean up

* fix style

* fix calc number of output tokens

* refactor resize_position_embeddings

* add test case

* rename build fn

* correct a small bug
b6285
2025-08-26 12:54:19 +02:00
Georgi Gerganov
85cc1ae998 context : print graph stats for memory-less contexts (#15586)
ggml-ci
b6284
2025-08-26 12:47:00 +03:00
Georgi Gerganov
1d8d83deaa metal : improve MUL_MAT_ID (#15541)
* metal : mul_mm_id remove hdst

* metal : remove mul_mm_id hsrc1

* metal : mul_mm_id simplify + add test

* metal : opt mul_mm_id map0

* metal : optimize mul_mm_id id gathering

* metal : mul/div opt

* metal : optimize mul_mm_id_map0

ggml-ci
b6283
2025-08-26 12:46:15 +03:00
tc-mb
c4e9239064 model : support MiniCPM-V 4.5 (#15575) b6282 2025-08-26 10:05:55 +02:00
Sigbjørn Skjæret
39842a7f73 gguf-py : remove erroneous FFN_GATE entry (#15583) 2025-08-26 09:08:08 +02:00
Sigbjørn Skjæret
0fd90db585 metal : remove contiguous assertion for src0 in IM2COL (#15577)
* remove contiguous assertion for src0 in IM2COL

* add contiguous check in supports_op
b6280
2025-08-26 09:51:43 +03:00
Yoshi_likes_e4
4c37636b3e Add a warning for special devices (#15563)
* Add warning

* Print the devices names

* Add newlines

* Apply suggestions from code review

Co-authored-by: Johannes Gäßler <johannesg@5d6.de>

* Fix vector names

---------

Co-authored-by: Johannes Gäßler <johannesg@5d6.de>
b6279
2025-08-26 08:15:33 +02:00
Jeff Bolz
34bdbbd7c2 vulkan: Remove splitting for mul_mat_id (#15568)
row_ids only needs to hold the BN rows for the current tile.
b6278
2025-08-26 06:42:44 +02:00
Qeeweew
74f52f77f2 CUDA: Accelerate MXFP4 table lookup using __byte_perm (#15451)
* CUDA: optimize get_int_from_table_16

* CUDA: use v_perm_b32 to replace byte_perm on AMD GPUs

* revise documentation

---------

Co-authored-by: xix <xiapc@outlook.com>
Co-authored-by: Johannes Gäßler <johannesg@5d6.de>
b6277
2025-08-25 23:21:22 +02:00
lhez
f7207b0415 opencl: fix support ops condition for rms_norm (#15560) b6276 2025-08-25 14:18:09 -07:00
Ruben Ortlam
4d917cd4f6 vulkan: fix min subgroup 16 condition for mmid subgroup optimization (#15565) b6275 2025-08-25 17:56:59 +02:00
Jeff Bolz
886b97a5d6 tests: Generate unique input values for count_equal (#15487)
This avoids backend-dependent behavior for argmax that leads to intermittent failures.
b6274
2025-08-25 10:47:16 -05:00
Ihar Hrachyshka
111f8d06f0 metal: fix regression when no metal devices are present (#15531) b6273 2025-08-25 18:27:34 +03:00
Johannes Gäßler
5eff6ec9b1 CUDA: MoE helper in device code, better tile sizes (#15525)
* CUDA: MoE helper in device code, better tile sizes

* reduce superfluous CUDA blocks
b6272
2025-08-25 17:23:40 +02:00
Daniel Bevenius
dfd9b5f6c7 model-conversion : set pooling type to none in logits.cpp (#15564)
This commit explicitly sets the pooling type to 'none' in the logits.cpp
to support models that have a pooling type specified.

The motivation for this is that some models may have a pooling type set
in the model file (.gguf file) and for this specific case where we only
want to extract logits, we need to ensure that no pooling is used to
so that we are comparing raw logits and not pooled embeddings.
b6271
2025-08-25 15:00:43 +02:00
Daniel Bevenius
5a6bc6b1a6 model-conversion : add model card template for embeddings [no ci] (#15557)
* model-conversion: add model card template for embeddings [no ci]

This commit adds a separate model card template (model repository
README.md template) for embedding models.

The motivation for this is that there server command for the embedding
model is a little different and some addition information can be useful
in the model card for embedding models which might not be directly
relevant for causal models.

* squash! model-conversion: add model card template for embeddings [no ci]

Fix pyright lint error.

* remove --pooling override and clarify embd_normalize usage
2025-08-25 14:25:25 +02:00
Georgi Gerganov
6b64f74b55 batched-bench : fix unified KV cache handling + pp timing (#15562)
* batched-bench : fix unified KV cache handling + pp timing

* cont : run dummy token only with split KV cache
b6269
2025-08-25 13:56:43 +03:00
Weizhao Ouyang
0d5a470223 convert : update Ernie 4.5 dense architecture name (#15555)
Signed-off-by: Weizhao Ouyang <o451686892@gmail.com>
2025-08-25 11:15:06 +02:00
Georgi Gerganov
b0ba31f525 metal : add FA kernels for HS=40 (#15559)
ggml-ci
b6267
2025-08-25 10:14:48 +03:00
RunningLeon
7da9fed0d6 convert : support interns1-mini (#15412)
* support interns1-mini

* fix comment

* update
2025-08-25 08:32:16 +02:00
Chenguang Li
c247d06f38 CANN: ROPE cache sin/cos repeat (#15501)
Signed-off-by: noemotiovon <757486878@qq.com>
b6265
2025-08-25 10:32:21 +08:00
Ruben Ortlam
043fb27d38 vulkan: apply MUL_MAT_ID subgroup optimization to non-coopmat devices (#15524)
* vulkan: use subgroup function for mul_mat_id shader even without coopmat

* vulkan: fix compile warnings

* vulkan: properly check for subgroup size control and require full subgroups for subgroup mul_mat_id

* vulkan: disable subgroup mul_mat_id on devices with subgroups < 16
b6264
2025-08-24 19:36:36 +02:00
Georgi Gerganov
b730706a49 kv-cache : support layer reuse (#15504)
* kv-cache : support layer reuse

ggml-ci

* cont : update comments [no ci]
2025-08-24 13:07:07 +03:00
Jeff Bolz
c9a24fb932 vulkan: Support FA with any multiple of 8 head sizes (#15537)
The scalar FA shader already handled multiples of 8. The coopmat1 FA
shader assumed 16x16x16 and the shared memory allocations need the HSK
dimensions padded to a multiple of 16. NVIDIA's coopmat2 implementation
requires multiples of 16 for N and K, and needs the matrix dimensions
padded and loads clamped.

Store the FA pipelines in a map, indexed by the pipeline state.
b6262
2025-08-24 11:24:25 +02:00
Ruben Ortlam
a9c6ffcbfa vulkan: enable Conv2D for Apple after MoltenVK fixed the bug (#15526) b6261 2025-08-24 10:48:53 +02:00
Jeff Bolz
e78cf0d4b1 vulkan: workaround MoltenVK compile failure in multi_add (#15506)
* vulkan: workaround MoltenVK compile failure in multi_add

* Update ggml/src/ggml-vulkan/vulkan-shaders/multi_add.comp

Co-authored-by: 0cc4m <picard12@live.de>
2025-08-24 10:48:21 +02:00
Johannes Gäßler
710dfc465a CUDA: fix half2 -> half conversion for HIP (#15529) 2025-08-23 21:37:06 +02:00