Commit Graph

6830 Commits

Author SHA1 Message Date
junchao-zhao
aa719c2f88 ggml : fix loongarch lsx compilation error (#15864) b6580 2025-09-25 12:22:55 +03:00
Johannes Gäßler
4cdd0bb453 docs: fix typo [no ci] (#16244) 2025-09-25 12:12:27 +03:00
Douglas Hanley
b5bd037832 llama : add support for qwen3 reranker (#15824) b6578 2025-09-25 11:53:09 +03:00
Georgi Gerganov
dfcd53f7ec metal : fuse NORM + MUL + ADD, support non-multiples of 4 (#16220)
* metal : fuse NORM + MUL + ADD

* metal : support norms of non-multiple of 4

* cont : fix comment [no ci]
2025-09-25 11:30:16 +03:00
Georgi Gerganov
4ea00794b8 metal : relax reorder conditions (#16216) b6576 2025-09-25 11:29:42 +03:00
Georgi Gerganov
02a6a82ae7 metal : restore im2col perf (#16219) b6575 2025-09-25 11:29:08 +03:00
Radoslav Gerganov
c498fc82fe rpc : use ggml logging facilities
Use RPC_DEBUG environment variable to enable debug messages.
Add helper macro LOG_DBG() which does an early
check of the env var before calling GGML_LOG_DEBUG().
Make sure we log a debug message for every server function.
b6574
2025-09-25 07:20:02 +00:00
Aaron Teo
e7a5130a20 codeowners: add ownership of zdnn backend [no ci] (#16232)
add @Andreas-Krebbel to owners of zDNN backend

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>
2025-09-25 08:06:30 +03:00
Eve
bee378e098 ci: run the x64 and arm ci on the github machines instead (#16183)
* run the x64 ci on regular machines

* set up the same thing for arm

fix test-quantize-perf just like #12306

* try to disable sve

* add another sve run
b6572
2025-09-25 08:06:06 +03:00
Aaron Teo
5fb557653b devops: fix s390x docker release failure (#16231) 2025-09-25 11:36:30 +08:00
Aaron Teo
4ae88d07d0 codeowners: add ownership of zdnn backend [no ci] (#16229)
add @AlekseiNikiforovIBM to owners of zDNN backend

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>
2025-09-25 00:25:04 +08:00
Johannes Gäßler
e789095502 llama: print memory breakdown on exit (#15860)
* llama: print memory breakdown on exit
b6569
2025-09-24 16:53:48 +02:00
Acly
f2a789e334 ggml : split graph allocations according to backend max buffer size (#15815)
* ggml : make gallocr respect the backend's max buffer size

* if the graph requires more memory than can fit into a single allocation, split it into multiple backend buffers
* vulkan: report the actual max  allocation size in buffer type  interface

* fix missing newline, apple-clang warning

* track size of individual chunks in ggml_dyn_tallocr and raise max chunks.
revert to use suballocation_block_size as max chunk size for vulkan.

* track (chunk, offset) pairs instead of "global" offsets through gallocr.

* simpler, don't need loops to map between local/global offsets
* touches more code

* fix dyn_tallocr_max_size and initialization

* fix memory leak when buffers are reused due to same buffer type appearing multiple times

* make vbuffer allocation follow the same logic as backend_buffer did before

* continue to use leftover unallocated space of previous chunks after a new one has been created

* treat free blocks of each chunk as separate list
* they're still allocated together, but start/end of each chunk is tracked, and allocate/free iterate over sub-ranges
* exhaust freed blocks of all chunks before considering their last blocks with unallocated space
* start with 0 chunks/blocks and create chunks as needed
* allow the last chunk to grow beyond max size

* refactor: move adding new free block and new chunk into separate functions

* allocate chunks individually with a separate free-blocks list for each one

* needs a bit more memory/allocations/indirections, but code is simpler

* fix warnings (missing static) & debug checks
b6568
2025-09-24 16:17:49 +02:00
Tarek Dakhran
3a59971967 model : add label for LiquidAI LFM2-2.6B model (#16204)
* model : add label for LiquidAI LFM2-2.6B model

HF link: [LiquidAI/LFM2-2.6B](https://huggingface.co/LiquidAI/LFM2-2.6B).

Support for GGUF conversion and inference is added in #14620.

However, due to similar `n_embd`, it identifies as a 1.2B model.
Fix the label by using `n_ff` to identify the model instead.

Output of `llama-bench`:
```
| model                          |       size |     params | backend    | threads |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | ------: | --------------: | -------------------: |
| lfm2 1.2B F16                  |   2.18 GiB |     1.17 B | CPU        |      10 |           pp512 |        223.97 ± 5.32 |
| lfm2 2.6B F16                  |   4.79 GiB |     2.57 B | CPU        |      10 |           pp512 |         92.53 ± 4.14 |
| lfm2 350M F16                  | 676.25 MiB |   354.48 M | CPU        |      10 |           pp512 |       725.52 ± 11.70 |
| lfm2 700M F16                  |   1.38 GiB |   742.49 M | CPU        |      10 |           pp512 |       336.22 ± 12.93 |
```

* Update src/llama-model.cpp

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

---------

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
b6567
2025-09-24 13:42:26 +02:00
Jie Fu (傅杰)
63b54c81a6 model-conversion : make causal-verify-logits fails with model names containing "." (#16215)
Signed-off-by: Jie Fu <jiefu@tencent.com>
2025-09-24 10:25:26 +02:00
Uilian Ries
152729f884 common : add missing chrono header for common.cpp (#16211)
Signed-off-by: Uilian Ries <uilianries@gmail.com>
b6565
2025-09-24 09:53:47 +03:00
Sigbjørn Skjæret
c0c59c1157 codeowners : match all requirements files (#16214) 2025-09-24 08:53:20 +02:00
Jie Fu (傅杰)
7735706b93 model-conversion : run-org-model.py fails to run on mac m1 (#16213)
Signed-off-by: Jie Fu <jiefu@tencent.com>
2025-09-24 08:46:52 +02:00
Daniel Bevenius
4d9ea03d17 codeowners : use slash prefix for root files [no ci] (#16210)
This commit adds a leading slash to the paths of root-level files
in the CODEOWNERS file.

The motivation for this is that these might otherwise match files
in subdirectories that have other/additional owners will override them.

Refs: https://github.com/ggml-org/llama.cpp/pull/16209#issuecomment-3326434274
2025-09-24 08:10:09 +02:00
Jie Fu (傅杰)
8ba548dae2 model-conversion : fix the make targets in the README.md (#16209)
Fix two incorrect make targets in the readme.

Signed-off-by: Jie Fu <jiefu@tencent.com>
2025-09-24 06:19:23 +02:00
Georgi Gerganov
f505bd83ca ci : disable AMD workflows + update NVIDIA workflows (#16200)
* ci : disable AMD workflows + update NVIDIA workflows

* cont : fixes

* cont : update nvidia vulkan workflows
2025-09-23 20:41:40 +03:00
Georgi Gerganov
0889589dbe ci : enable Vulkan workflow on Mac (#16194) 2025-09-23 13:44:25 +03:00
Xiangyan Sun
4e29084ba4 ggml-cpu: Respect cpumask settings (#16164) b6558 2025-09-23 11:58:12 +03:00
Sigbjørn Skjæret
f6b4af3d04 ggml : fix uninitialized is_on_grid in quantize_row_iq3_xxs_impl (#15928)
* fix uninitialized is_on_grid in quantize_row_iq3_xxs_impl

* change initialization to true
b6557
2025-09-23 10:25:20 +02:00
Aaron Teo
264f1b5187 zdnn: refactor codebase + add docs (#16178)
* zdnn: initial matmul refactor

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

* ggml-zdnn: rm static from funcs

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

* ggml-zdnn: update ggml-zdnn.h

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

* ggml-zdnn: change header files to hpp

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

* ggml-zdnn: switch to common.hpp

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

* ggml-zdnn: move mulmat forward around

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

* ggml-zdnn: rm inline from utils

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

* ggml-zdnn: code cleanup

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

* docs: add zDNN docs

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

---------

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>
b6556
2025-09-23 14:53:05 +08:00
Daniel Bevenius
0bc7cc7154 codeowners : add @danbev to model-conversion example [no ci] (#16190)
This commit adds examples/model-conversion/ to the CODEOWNERS file and
assigns myself (@danbev) as the code owner for this directory.
2025-09-23 09:13:22 +03:00
Aaron Teo
4b9f4cb0f8 devops: add s390x containers (#15915)
* devops: add s390x dockerfile

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

* devops: add missing ninja

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

* devops: move s390x docker into cpu docker

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

* devops: rework s390x docker

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

* devops: copy more tools

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

* devops: add server build step

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

* devops: remove apt clean steps as distroless misses it

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

* devops: remove apt commands from distroless

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

* devops: fix shared libs in distroless

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

* devops: use correct libs path

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

* devops: fix shared libs

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

* devops: add collector stage

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

* devops: fix missing stage ref

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

* devops: fix permission issue

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

* devops: fix unknown model loading failures

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

* devops: attempt at fixing model loading failure

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

* devops: fix missing ggml shared object

failure to load model

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

* devops: remove move shared objects

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

* devops: move libggml-cpu and blas into bin

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

* devops: finalise hardened server stage

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

* devops: add cli target

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

* devops: fix typos

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

* devops: fix missing shared libraries in base

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

* devops: update debian target

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

* devops: formalise llama.cpp loc

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

* Revert "devops: formalise llama.cpp loc"

This reverts commit 0a7664af84.

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

* devops: formalise llama.cpp loc

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>
(cherry picked from commit 0a7664af84)
Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

* devops: attempt at fixing missing dir

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

* devops: attempt at making it cache the build

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

* devops: fix copying process

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

* devops: make build dir an argument

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

* Revert "devops: make build dir an argument"

This reverts commit 438698976b.

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

* devops: add build stage for gguf-py

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

* devops: move gguf-py installation into build stage

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

* devops: break system packages?

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

* devops: add rust compiler installer

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

* devops: fix rustc not found

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

* devops: remove cache mount to allow rustc to persist

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

* devops: move rustc installation to another layer

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

* devops: move gguf-py installation to full stage, fix copying

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

* devops: remove rustc installation in build

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

* devops: disable full target for now

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

* devops: attempting static build

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

* devops: merge s390x dockerfile into cpu for now

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

* devops: switch to gcc image for build step

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

* devops: remove build essentials

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

* devops: install openblas into base target

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

* devops: go back to s390x dockerfile

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

* devops: remove libggml and libblas

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

* devops: add full target

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

* devops: add break system packages

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

* devops: add libjpeg

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

* devops: add missing cmake dep

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

* devops: finalise docker images for s390x

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

* devops: add custom openblas patch

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

* devops: use libopenblas-dev instead of libopenblas-openmp-dev

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

* devops: add s390x docker build

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

---------

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>
2025-09-23 13:59:34 +08:00
Daniel Bevenius
85e72271ba ggml-cpu : fix typo in gemm comments [no ci] (#16189) 2025-09-23 05:59:03 +02:00
Gabe Goodhart
1d0125bcf1 feat: Add conversion support in GraniteHybrid for non-hybrid (all attn) (#16177)
This is a configuration of the hparams in the GraniteHybrid architecture
that devolves to the Granite (or GraniteMoe) architecture (ie Granite 3.x).
It may be used for some models in the Granite 4 family with the
GraniteHybrid architecture acting as a superset arch. Rather than support
it directly in the c++ graph, we simply coerce the architecture flag back
to the correct "granite" or "granitemoe" architecture.

Branch: gabe-l-hart/GraniteNonHybridConversion

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
2025-09-22 20:40:10 +02:00
Haiyue Wang
351f3da39c clang-tidy : disable warning about performance enum size (#16127)
Disable 'performance-enum-size' checking:

Enum 'llama_token_type' uses a larger base type ('unsigned int', size: 4 bytes)
than necessary for its value set, consider using 'std::uint8_t' (1 byte) as the
base type to reduce its size.
2025-09-22 19:57:46 +02:00
Sigbjørn Skjæret
3ecb2f671a ggml : implement set_rows with i32 index (#16159)
* implement set_rows with i32 index

* template fix

* test quantized path

warnings--

* Apply suggestions from code review

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

* forgotten name change

* deduplicate cuda/sycl and test-fix

* indent++

* vulkan: support set_rows with i32 index type (#16162)

* disable i32 index for webgpu for now

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
Co-authored-by: Jeff Bolz <jbolz@nvidia.com>
b6550
2025-09-22 19:13:00 +02:00
Georgi Gerganov
432cf4304c codeowners : update + cleanup (#16174)
---------

Co-authored-by: slaren <slarengh@gmail.com>
b6549
2025-09-22 18:20:21 +03:00
Adrien Gallouët
37a23c17bd common : enable --offline mode without curl support (#16137)
* common : use the json parser

Signed-off-by: Adrien Gallouët <angt@huggingface.co>

* common : enable --offline mode without CURL support

This change refactors the download logic to properly support offline mode
even when the project is built without CURL.

Without this commit, using `--offline` would give the following error:

    error: built without CURL, cannot download model from the internet

even if all the files are already cached.

Signed-off-by: Adrien Gallouët <angt@huggingface.co>

---------

Signed-off-by: Adrien Gallouët <angt@huggingface.co>
b6548
2025-09-22 15:13:51 +03:00
Quentin Bramas
138c87ce8b webui : fix handling incomplete chunks (#16107) 2025-09-22 11:53:13 +03:00
GideonSerf
c6db9a1027 embedding : fix typos in README (#16171) 2025-09-22 11:49:58 +03:00
Haiyue Wang
d05affbab7 common : remove unused local variables (#16140)
These two local variables 'arg' and 'arg_prefix' have been overriden by:

  1. for (const auto & arg : opt.args)

  2. for (int i = 1; i < argc; i++) {
        const std::string arg_prefix = "--";

        std::string arg = argv[i];
b6545
2025-09-22 11:48:42 +03:00
Georgi Gerganov
4f324a556c ggml : extend ggml_can_fuse to work with non-sequential nodes (#16123)
* ggml : extend ggml_can_fuse to work with non-sequential nodes in the graph

* cont : fix wrong bounds check condition

* cont : remove unnecessary overload
b6544
2025-09-22 11:12:37 +03:00
Georgi Gerganov
a71ae3ba7a ggml : add ggml_op_is_empty (#16122)
* ggml : add ggml_op_is_empty

* ggml : move to ggml-impl.h
b6543
2025-09-22 11:12:09 +03:00
Xuan-Son Nguyen
05a2458121 codeowners : update ownership for @ngxson and @allozuar (#16128) 2025-09-22 11:10:58 +03:00
Shin-myoung-serp
96fdca043b Vulkan: add conv_transpose_2d operation (#16022)
* Vulkan: add conv_transpose_2d operation

* Vulkan: fix typo in conv_transpose_2d shader(s0mp, s0L, s1mp, s1L)

* Vulkan: fix incorrect indentation in conv_transpose_2d shader

* Vulkan: add checking the push constants size limit and reuse conv2d_mm.comp for conv_transpose_2d operation

* Vulkan: revert the order of the index calculation and bound check in conv_2d shader

* Vulkan: explicity check push constants limit in supports_op() for conv_transpose_2d operation.

* Vulkan: remove unnecessary lower bound checks for H/W_idx in the conv_2d shader.
b6541
2025-09-22 10:04:01 +02:00
Sigbjørn Skjæret
b2d980fce0 codeowners : claim responsibility for ci, models, gguf-py and convert (#16124)
* claim responsibility for ci, gguf-py and convert

* add myself to various src/llama- files
2025-09-22 10:59:05 +03:00
Georgi Gerganov
5c6106a696 contrib : update roles (#16113)
* contrib : update roles

* contrib : merge PR sections + add link to CI instructions

Updated pull request guidelines for contributors and collaborators, and clarified merging practices for maintainers.
2025-09-22 10:58:02 +03:00
Georgi Gerganov
ec65fb52f0 ci : remove vulkaninfo calls (#16169) 2025-09-22 10:16:05 +03:00
Georgi Gerganov
1d660d2fae ci : use smaller model (#16168)
* ci : switch from gemma to qwen3 0.6b

* ci : use smaller model for some tests
2025-09-22 09:11:39 +03:00
Jeff Bolz
a20d810d79 vulkan: add RTE variants of exp shader (#16165)
This fixes some failures on Turing where "round to zero" rounds to the max f16
value but the CPU reference value is infinite.
b6536
2025-09-22 07:37:17 +02:00
Georgi Gerganov
4d0a7cbc61 ci : adjust params for less runtime (#16167)
* ci : adjust params for less runtime

* ci : gate BF16 on some hardware

* ci : move extra tests to Arm runner
b6535
2025-09-22 08:31:40 +03:00
Ruben Ortlam
9073a73d82 vulkan: vec dot matrix multiplication fix (#16151)
* vulkan: fix matrix multiplication index calculation for odd m/n and odd k in combination with batching

* add odd m/n + odd k test with batching
b6534
2025-09-22 07:22:43 +02:00
lhez
51f5a45fbe opencl: fix concat crash on win arm64 with Adreno (#15944) b6533 2025-09-21 16:42:10 -07:00
lhez
c4510dc937 opencl: initial q8_0 mv support (#15732) b6532 2025-09-21 14:48:44 -07:00
Georgi Gerganov
da30ab5f86 ci : add label for the RISC-V runner (#16150) 2025-09-21 19:00:27 +03:00