Commit Graph

6123 Commits

Author SHA1 Message Date
R0CKSTAR
9b8f3c6c77 musa: fix build warnings (unused variable) (#14869)
Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com>
b5995
2025-07-26 10:36:02 +08:00
Aaron Teo
c7f3169cd5 ggml-cpu : disable GGML_NNPA by default due to instability (#14880)
* docs: update s390x document for sentencepiece

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>
(cherry picked from commit e086c5e3a7)

* docs: update huggingface links + reword

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>
(cherry picked from commit 8410b085ea)

* ggml-cpu: disable ggml-nnpa compile flag by default

fixes #14877

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>
(cherry picked from commit 412f4c7c88)

* docs: update s390x build docs to reflect nnpa disable

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>
(cherry picked from commit c1eeae1d0c)

---------

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>
b5994
2025-07-25 19:09:03 +02:00
Gabe Goodhart
793c0d7f46 metal: SSM_SCAN performance (#14743)
* feat: Add s_off as a parameter in the args struct

This may not be necessary, but it more closely mirrors the CUDA kernel

Branch: GraniteFourPerf

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* perf: Parallelize mamba2 SSM_SCAN metal kernel over d_state

This is a first attempt at optimizing the metal kernel. The changes here
are:

- Launch the kernel with a thread group of size d_state
- Use simd groups and shared memory to do the summation for the y
  computation

When tested with G4 tiny preview, this shows roughly a 3x speedup on
prefill and 15% speedup on decode.

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* fix: Update logic to correctly do the multi-layer parallel sum

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* fix: Correctly size the shared memory bufer and assert expected size relationships

Branch: GraniteFourPerf

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* refactor: Compute block offsets once rather than once per token

Branch: GraniteFourPerf

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* feat: Use local variable for state recursion

Branch: GraniteFourPerf

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* feat: Use a secondary simd_sum instead of a for loop

Branch: GraniteFourPerf

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* feat: Add assertion and comment about relationship between simd size and num simd groups

Branch: GraniteFourPerf

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* feat: Parallelize of d_state for mamba-1

Branch: GraniteFourPerf

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* feat: Parallel sum in SSM_CONV

Branch: GraniteFourPerf

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* Revert "feat: Parallel sum in SSM_CONV"

After discussion with @compilade, the size of the parallelism here is
not worth the cost in complexity or overhead of the parallel for.

https://github.com/ggml-org/llama.cpp/pull/14743#discussion_r2223395357

This reverts commit 16bc059660.

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* refactor: Simplify shared memory sizing

Branch: GraniteFourPerf

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
Co-Authored-By: Georgi Gerganov <ggerganov@gmail.com>

---------

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
b5993
2025-07-25 10:47:39 -06:00
lhez
ce111d39d6 opencl: add fused rms_norm_mul (#14841)
* opencl: add fused `rms_norm` + `mul`

* opencl: improve workgroup size for `rms_norm_mul`
b5992
2025-07-25 17:12:13 +02:00
wooksong
e7fecba934 docs : update HOWTO‑add‑model.md for ModelBase and new model classes (#14874)
This patch updates the example in docs/development/HOWTO-add-model.md to
reflect recent changes after `TextModel` and `MmprojModel` were introduced.

It replaces the outdated `Model` base class with `TextModel` or `MmprojModel`
and updates the registration example accordingly.

Signed-off-by: Wook Song <wook16.song@samsung.com>
2025-07-25 16:25:05 +02:00
Oliver Simons
e2b7621e7c ggml : remove invalid portPos specifiers from dot files (#14838)
Neither "g" nor "x" are valid portPos specifiers per the official
[graphviz documents](https://graphviz.org/docs/attr-types/portPos/):

> If a compass point is used, it must have the form "n","ne","e","se","s","sw","w","nw","c","_".

I tested locally for it to fall back to default portPos specifier if an
invalid portPos is specified. As a consequence, we can remove associated
code.
b5990
2025-07-25 14:29:57 +03:00
Georgi Gerganov
c1dbea752a context : restore preemptive sched reset when LLAMA_SET_ROWS=0 (#14870)
ggml-ci
b5989
2025-07-25 14:28:06 +03:00
kiwi
749e0d27f0 mtmd : fix 32-bit narrowing issue in export-lora and mtmd clip (#14503)
* [fix] Fix 32-bit narrowing issue in export-lora and mtmd clip

* Update export-lora.cpp

* Update clip.cpp

* Update export-lora.cpp

* format: use space to replace tab
b5988
2025-07-25 13:08:04 +02:00
Chris Rohlf
64bf1c3744 rpc : check for null buffers in get/set/copy tensor endpoints (#14868) b5987 2025-07-25 12:17:02 +02:00
Diego Devesa
c12bbde372 sched : fix multiple evaluations of the same graph with pipeline parallelism (#14855)
ggml-ci
b5986
2025-07-25 11:07:26 +03:00
R0CKSTAR
3f4fc97f1d musa: upgrade musa sdk to rc4.2.0 (#14498)
* musa: apply mublas API changes

Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com>

* musa: update musa version to 4.2.0

Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com>

* musa: restore MUSA graph settings in CMakeLists.txt

Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com>

* musa: disable mudnnMemcpyAsync by default

Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com>

* musa: switch back to non-mudnn images

Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com>

* minor changes

Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com>

* musa: restore rc in docker image tag

Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com>

---------

Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com>
b5985
2025-07-24 20:05:37 +01:00
Georgi Gerganov
2df255da3c sync : ggml
ggml-ci
b5984
2025-07-24 20:27:23 +03:00
Kai Pastor
60f816a79d cmake : fix usage issues (ggml/1257)
* CMake config: Create target only once

Fix error on repeated find_package(ggml).
For simplicity, check only for the top-level ggml::ggml.

* CMake config: Add CUDA link libs

* CMake config: Add OpenCL link libs

* CMake config: Use canonical find_dependency

Use set and append to control link lib variables.
Apply more $<LINK_ONLY...>.

* CMake config: Wire OpenMP dependency
2025-07-24 20:27:23 +03:00
Daniel Bevenius
5592f278b6 ggml-cpu : remove stdlib include from repack.cpp (ggml/1276)
This commit removes the inclusion of `<cstdlib>`.

The motivation for this change is that this source file does not seem to
use any functions from this header and the comment about `qsort` is a
little misleading/confusing.
2025-07-24 20:27:23 +03:00
Georgi Gerganov
e4868d16d2 context : perform output reorder lazily upon access after sync (#14853)
* context : perform output reorder after lazily upon access after sync

ggml-ci

* cont : add TODO
b5981
2025-07-24 16:31:48 +03:00
Xuan-Son Nguyen
820de57d4f chat : fix kimi-k2 chat template (#14852) b5980 2025-07-24 13:59:56 +02:00
Aaron Teo
f263f5d9ae ggml-zdnn: fix missing data transform call
Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>
2025-07-24 18:30:10 +08:00
Aaron Teo
1c75ed63e5 ggml-zdnn: fix compiler error missing type
Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>
2025-07-24 18:22:34 +08:00
Aaron Teo
a1d8568c14 ggml-zdnn: impl matmul
Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>
2025-07-24 18:13:07 +08:00
Alberto Cabrera Pérez
cb4a63aad6 sycl: fixed semantics of block offset calculation (#14814) b5979 2025-07-24 11:09:57 +01:00
yummy
86f5623d90 llama : fix MiniCPM inference after Granite Four changes (#14850)
MiniCPM models use the llm_build_granite constructor which was changed
in the Granite Four PR to use hparams.rope_finetuned instead of a
use_rope parameter. MiniCPM models need rope enabled by default.

Fixes inference from gibberish to correct responses.
b5978
2025-07-24 11:50:51 +02:00
Pouya
39cffdf188 docs: add libcurl-dev install hint for Linux distros (#14801)
* docs: add libcurl-dev install hint for Linux distros

Signed-off-by: PouyaGhahramanian <PooyaGhahramanian@gmail.com>

* Update docs/build.md

---------

Signed-off-by: PouyaGhahramanian <PooyaGhahramanian@gmail.com>
Co-authored-by: Xuan-Son Nguyen <thichthat@gmail.com>
2025-07-24 11:26:44 +02:00
Aaron Teo
59e9805ab0 ggml-zdnn: code clean up
Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>
2025-07-24 16:26:29 +08:00
Aaron Teo
c1653ab639 ggml-zdnn: fix incorrect ztensor shape, reduce memory padding
Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>
2025-07-24 16:22:06 +08:00
Aaron Teo
828519659b ggml-zdnn: update supports_op matmul matrix
Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>
2025-07-24 16:11:37 +08:00
Georgi Gerganov
065908cb09 metal : fix fusion across different encoders (#14849)
* metal : fix fusion across different encoders

ggml-ci

* cont : add assertion

ggml-ci
b5976
2025-07-24 10:24:05 +03:00
Donghyeon Jeong
4ec6291a24 sycl: fix undefined variable in work group size check (#14843) b5975 2025-07-24 12:50:41 +08:00
Aaron Teo
18658b8607 ggml-zdnn: impl init_tensor
Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>
2025-07-24 12:02:20 +08:00
jacekpoplawski
a12363bbf0 convert : text-only support for GLM-4.1V-9B-Thinking (#14823)
* use language_model part only, ignore visual layers

* fix rope_dim calculation
2025-07-23 23:23:57 +02:00
Johannes Gäßler
a86f52b285 CUDA: fix overflow in FA, tune performance (#14840) b5973 2025-07-23 21:43:25 +02:00
Aaron Teo
da2e0e70ba ggml-zdnn: switch buffers back and set to arbitrary number
Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>
2025-07-24 02:31:22 +08:00
Aaron Teo
63fbc45ed6 ggml-zdnn: switch to std vector instead of array
Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>
2025-07-24 01:09:01 +08:00
Aaron Teo
b7f4b6fde3 ggml-zdnn: rework init_tensor to create new buffers
Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>
2025-07-24 01:03:53 +08:00
Aaron Teo
ee0ed78d54 ggml-zdnn: add check for view tensors to prevent init_tensor
Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>
2025-07-24 00:56:32 +08:00
Aaron Teo
13c64448bd ggml-zdnn: assign tensor->extra to buffer
Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>
2025-07-24 00:48:32 +08:00
Aaron Teo
13c05872f2 ggml-zdnn: implement at least 1 op to test
Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>
2025-07-24 00:44:05 +08:00
Aaron Teo
9e84742e72 ggml-zdnn: test ztensor finding in init_tensor
Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>
2025-07-24 00:40:22 +08:00
Aaron Teo
af9f4f0039 ggml-zdnn: fix compiler warnings and bugfixes
Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>
2025-07-24 00:25:41 +08:00
Johannes Gäßler
b284197df4 CUDA: fix compilation with GGML_CUDA_F16 (#14837) b5972 2025-07-23 18:22:30 +02:00
Aaron Teo
ae2f656d7e ggml-zdnn: bugfix new impl
Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>
2025-07-24 00:18:53 +08:00
Aaron Teo
7c6395f826 ggml-zdnn: rewrite the backend implementation
Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>
2025-07-24 00:14:45 +08:00
Sigbjørn Skjæret
221c0e0c58 ci : correct label refactor->refactoring (#14832) 2025-07-23 14:27:54 +02:00
Aaron Teo
04ddb2ac95 ggml-zdnn: update op out_prod to use tensor->extra
Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>
2025-07-23 19:51:37 +08:00
Aaron Teo
77a753297b ggml-zdnn: support op out_prod
Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>
2025-07-23 19:28:51 +08:00
Johannes Gäßler
07a19e27a2 CUDA: fix quantized KV cache + multiple sequences (#14822)
* CUDA: fix quantized KV cache + multiple sequences

* Update ggml/src/ggml-cuda/fattn-common.cuh

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
b5970
2025-07-23 14:08:09 +03:00
Georgi Gerganov
18f3b5ff9e tests : add non-cont K,V FA tests
ggml-ci
2025-07-23 14:08:09 +03:00
l3utterfly
7233358d29 memory : handle saving/loading null layers in recurrent memory (#14675)
* Update llama-memory-recurrent.cpp

handle saving/loading null layers in recurrent memory

* fixed styling issues and updated comments

* fix styling issue

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

---------

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
b5968
2025-07-23 11:16:41 +03:00
Aaron Teo
11d58d29de ggml-zdnn: add comments to prevent accidentally deleting lines
Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>
2025-07-23 14:54:44 +08:00
lixing-star
6c88b3bb25 ggml: fix loongarch quantize_row_q8_1 error (#14827) b5967 2025-07-23 09:39:51 +03:00
chen fan
14c28dfc50 CANN: weight format to NZ for Ascend310P3 (#14407)
* weight format to nz for 310p

* remove quant weight format to nz

* clean code

* fix

* make the conditions for converting weights to NZ format consistent

* clean code
b5966
2025-07-23 11:58:00 +08:00