Commit Graph

6424 Commits

Author SHA1 Message Date
Xuan-Son Nguyen
945e1f12a6 ggml : fix condition of im2col on Metal backend (#15460) 2025-08-21 08:32:26 +03:00
stduhpf
1b0db8f6e0 server : fix webui (#15462)
* Fix webui crash after streaming

* build webui
2025-08-21 08:19:22 +03:00
Daniel Bevenius
29f538ac63 examples : remove references to make in examples [no ci] (#15457)
This commit removes references to `make` in the examples, as the build
system has been updated to use CMake directly and using `make` will now
generate an error since Commit 37f10f955f
("make : remove make in favor of CMake (#15449)").
2025-08-21 06:12:28 +02:00
R0CKSTAR
8ad038c0fd musa: add GGML_UNUSED_VARS (#15446)
Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com>
2025-08-21 11:06:05 +08:00
Diego Devesa
5682a3745f sched : copy only the used experts when offloading prompt processing (#15346) 2025-08-21 01:35:28 +02:00
teo
1bc664a26a server: fix OpenAI API compatibility for usage statistics in chat streams (#15444) 2025-08-21 00:10:08 +02:00
Johannes Gäßler
13aeb7aef2 CUDA: refactor FA support/selection code (#15454) b6218 2025-08-20 23:14:14 +02:00
Johannes Gäßler
7a6e91ad26 CUDA: replace GGML_CUDA_F16 with CUDA arch checks (#15433) 2025-08-20 16:58:49 +02:00
Jeff Bolz
fec9519802 vulkan: shorten pipeline name strings (#15431)
These detailed strings were causing increased build time on gcc.
2025-08-20 16:33:14 +02:00
Daniel Bevenius
657b8a77bd chat: handle gpt-oss return/end token inconsistency (#15421)
This commit addresses an inconsistency during inference by adding a new
member to the `templates_params` struct to indicate whether the chat is
in inference mode. This allows the gpt-oss specific function
`common_chat_params_init_gpt_oss` to check this flag and the
`add_generation_prompt` flag to determine if it should replace the
`<|return|>` token with the `<|end|>` token in the prompt.

The motivation for this change is to ensure that the formatted prompt of
past messages in `common_chat_format_single` matches the output of the
formatted new message. The issue is that the gpt-oss template returns
different end tags: `<|return|>` when `add_generation_prompt` is false,
and `<|end|>` when `add_generation_prompt` is true. This causes the
substring function to start at an incorrect position, resulting in
tokenization starting with 'tart|>' instead of '<|start|>'.

Resolves: https://github.com/ggml-org/llama.cpp/issues/15417
b6215
2025-08-20 14:26:01 +02:00
Jie Fu (傅杰)
ec5ab1a36c common : fix context shift help message (#15448)
Signed-off-by: Jie Fu <jiefu@tencent.com>
b6214
2025-08-20 13:33:30 +03:00
xiaobing318
1a99c2d948 cmake : fix target include directories (#15450)
* Update docker.yml

修改docker.yml文件中的内容使其停止周期性的运行该workflow,如果想要运行该workflow可以手动启动

* feat:Modify the header file include path

1. There's no llava directory in the tools directory.
2. Because the command `target_include_directories(mtmd PUBLIC .)` is used in the `mtmd` CMakeLists.txt file, other targets that link against `mtmd` automatically include the `mtmd` directory as a search path for header files. Therefore, you can remove `target_include_directories(${TARGET} PRIVATE ../llava`` or use `target_include_directories(${TARGET} PRIVATE ../mtmd`` to explicitly require the `llama-server` target to use header files from `mtmd`.

* Restore the docker.yml file
b6213
2025-08-20 13:32:05 +03:00
Daniel Bevenius
37f10f955f make : remove make in favor of CMake (#15449)
This commit removes the content from the Makefile and updates the
current deprecation message to information that `make` has been
replaced by CMake instead.

The message when `make` is invoked will now be the following:
```console
$ make
Makefile:6: *** Build system changed:
 The Makefile build has been replaced by CMake.

 For build instructions see:
 https://github.com/ggml-org/llama.cpp/blob/master/docs/build.md

.  Stop.
```

The motivation for this is that many, if not all targets fail to build
now, after changes to the system, and `make` has also been deprected for
some time now.
2025-08-20 13:31:16 +03:00
Georgi Gerganov
2f37014073 lookahead : add sample command to readme (#15447)
* lookahead : add sample command to readme

* cont : build-agnostic command
2025-08-20 13:30:46 +03:00
R0CKSTAR
a094f38143 musa: fix build warnings (#15258)
* musa: fix build warnings

Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com>

* fix warning: comparison of integers of different signs: 'const int' and 'unsigned int' [-Wsign-compare]

Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com>

---------

Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com>
b6210
2025-08-20 10:17:37 +08:00
lhez
fb22dd07a6 opencl: mark argsort unsupported if cols exceed workgroup limit (#15375) b6209 2025-08-19 11:25:51 -07:00
Georgi Gerganov
9ef6b0b835 model : add gpt-oss type strings (#15424) b6208 2025-08-19 19:58:28 +03:00
Gian-Carlo Pascutto
1e19f5d462 common : Add top-nsigma sampler to help globally (#15428)
Fixes #15423.
b6207
2025-08-19 19:58:14 +03:00
Georgi Gerganov
d2fcd91cf9 server : disable context shift by default (#15416)
* server : disable context shift by default

ggml-ci

* server : make scopr of test parameters local
2025-08-19 16:46:37 +03:00
SHUAI YANG
a6d3cfe7fa CANN: optimize rope operator (#15335)
* optimize rope ops

* amendment

* delete trailing whitespace

* change the variable name
b6205
2025-08-19 21:28:22 +08:00
R0CKSTAR
67f09a3a27 musa: handle __hgt2_mask, available starting from MUSA SDK rc4.3.0 (#15413)
Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com>
b6204
2025-08-19 12:33:47 +02:00
Marvin Gießing
6424594c56 ggml-cpu: add mxfp4 VSX intrinsics for Power9+ (ppc64le) hardware (#15385)
* Added VSX intrinsics for Power9+ systems

Signed-off-by: mgiessing <marvin.giessing@gmail.com>

* Manual unrolling for minor perf improvement

Signed-off-by: mgiessing <marvin.giessing@gmail.com>

* Update ggml/src/ggml-cpu/arch/powerpc/quants.c

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

---------

Signed-off-by: mgiessing <marvin.giessing@gmail.com>
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2025-08-19 11:54:31 +03:00
Xuan-Son Nguyen
e9288e8869 chat : clarify the meaning of reasoning_format (#15408)
* chat : clarify the meaning of reasoning_format

* add link to this PR
b6202
2025-08-19 10:29:36 +02:00
Georgi Gerganov
9d262f4bad server : remove swa_full warning (#15399) b6201 2025-08-19 08:45:26 +03:00
Georgi Gerganov
f0d3c7405c batched-bench : use rand tokens (#15398) 2025-08-19 08:45:12 +03:00
Xuan-Son Nguyen
f08c4c0d8d mtmd : clean up clip_n_output_tokens (#15391) b6199 2025-08-18 22:53:52 +02:00
Georgi Gerganov
6d7f1117e3 codeowners : remove mmv.* 2025-08-18 22:06:44 +03:00
Georgi Gerganov
60212f1ead sync : ggml 2025-08-18 22:06:44 +03:00
Georgi Gerganov
f0c541d315 scripts : update sync scripts 2025-08-18 22:06:44 +03:00
Sigbjørn Skjæret
baa9255a45 llama : merge conts and reshapes and remove unnecessary cont (#15380)
* remove unnecessary conts and merge reshapes

* restore necessary conts

* merge more conts and reshapes

* merge even more conts and reshapes
b6195
2025-08-18 19:30:17 +02:00
Georgi Gerganov
3007baf201 readme : update hot topics (#15397) 2025-08-18 18:11:44 +03:00
davidef
d1d8241600 server : fix incoming tasks not process in order (#15395) b6193 2025-08-18 17:51:42 +03:00
Dobri Danchev
618575c582 Fix broken build: require updated pip to support --break-system-packages (#15357)
* Revert "devops : fix compile bug when the BASE_CUDA_DEV_CONTAINER is based on Ubuntu 24.04 (#15005)"

This reverts commit e4e915912c.

* devops: Allow pip to modify externally-managed python environment (system installation)

- Updated pip install commands to include the --break-system-packages
  flag, ensuring compatibility when working with system-managed Python
  environments (PEP 668).

- Note: The --break-system-packages option was introduced in 2023.
  Ensure pip is updated to a recent version before using this flag.

fixes [#15004](https://github.com/danchev/llama.cpp/issues/15004)
2025-08-18 12:50:48 +02:00
compilade
f44f793172 ggml-quants : fix make_qp_quants NANs and IQ1 assertion errors (#15379)
* ggml-quants : fix make_qp_quants NANs and IQ1 assertion errors

* ggml-quants : avoid division by zero in make_q3_quants
b6191
2025-08-18 09:23:56 +02:00
Jeff Bolz
ae532eac2c vulkan: disable spirv-opt for bfloat16 shaders (#15352) b6190 2025-08-18 07:56:29 +02:00
Oleksandr Kuvshynov
e5155e6986 server : export max observed n_past value (#15361)
Add tracking for high watermark cache usage and make it available in /metrics endpoint.

Use-case: Tracking largest needed cache usage under realistic workload
to better understand memory requirements and be able to adjust
cache size/quantization for model/cache accordingly.
b6189
2025-08-18 00:28:58 +02:00
Jeff Bolz
21c17b5bef vulkan: Use larger workgroups for mul_mat_vec when M is small (#15355)
* vulkan: Use larger workgroups for mul_mat_vec when M is small

Also use subgroup instructions for (part of) the reduction when supported.
Without this, the more expensive reductions would eat into the benefits of
the larger workgroups.

* update heuristic for amd/intel

Co-authored-by: 0cc4m <picard12@live.de>

---------

Co-authored-by: 0cc4m <picard12@live.de>
b6188
2025-08-17 18:08:57 +02:00
Dong Won Kim
19f4decae0 vulkan: support sqrt (#15370) b6187 2025-08-17 16:03:09 +02:00
Sigbjørn Skjæret
4d196981d4 convert : force patch_embd weights to F16 or F32 to avoid broken GGUFs (#15367)
* force patch_embd weights to f32

* use MmprojModel base tensor_force_quant instead
2025-08-17 14:47:42 +02:00
Sigbjørn Skjæret
b143fbc87a ci : fix hang in windows-hip build/release (#15365)
* fix hang in windows-latest-cmake-hip

* apply fix to release as well
b6185
2025-08-17 13:30:23 +02:00
Jeff Bolz
de5627910d vulkan: Optimize argsort (#15354)
- Launch an appropriate number of invocations (next larger power of two).
32 invocations is common and the barrier is much cheaper there.
- Specialize for "needs bounds checking" vs not.
- Make the code less branchy and [[unroll]] the loops. In the final code,
I see no branches inside the main loop (only predicated stores) when
needs_bounds_check is false.
- Always sort ascending, then apply the ascending vs descending option when
doing the final stores to memory.
- Copy the values into shared memory, makes them slightly cheaper to access.
b6184
2025-08-17 10:41:45 +02:00
Tarek Dakhran
65349f26f2 model : support vision LiquidAI LFM2-VL family (#15347)
* wip lfm2 vision model

* Fix conv weight

* Implement dynamic resolution

* Fix cuda

* support LFM2-VL-450M

* happy CI

* Remove extra `ggml_conv` and put others into the right place

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

---------

Co-authored-by: Xuan Son Nguyen <son@huggingface.co>
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
b6183
2025-08-16 23:33:54 +02:00
Jeff Bolz
1fe00296f5 vulkan: fuse adds (#15252)
* vulkan: fuse adds

Fuse adds that have the same shape, which are common in MoE models.
It will currently fuse up to 6 adds, because we assume no more than
8 descriptors per dispatch. But this could be changed.

* check runtimeDescriptorArray feature

* disable multi_add for Intel due to likely driver bug
b6182
2025-08-16 11:48:22 -05:00
Jeff Bolz
de2192794f vulkan: Support mul_mat_id with f32 accumulators (#15337)
* vulkan: Add missing bounds checking to scalar/coopmat1 mul_mat_id

* vulkan: Support mul_mat_id with f32 accumulators, but they are not hooked up

- There's no explicit way to request f32 precision for mul_mat_id, but there
probably should be, and this gets the code in place for that.
- A couple fixes to check_results.
- Remove casts to fp16 in coopmat1 FA shader (found by inspection).
b6181
2025-08-16 11:18:31 +02:00
Jeff Bolz
2e2b22ba66 vulkan: Add missing bounds checking to scalar/coopmat1 mul_mat_id (#15334) b6180 2025-08-16 10:58:38 +02:00
rmatif
912ff8c119 OpenCL: add initial FA support (#14987)
* add F16/F16 fa support

* fix kernel init

* use mad instead of fma

* use inline function

* mark FA with sinks as unsupported for now

* add pragma unroll to loops
b6179
2025-08-16 01:05:55 -07:00
Daniel Bevenius
5e6229a840 common : fix double bos, use common_chat_templates for add_bos and add_eos (#15326)
This commit updates common_chat_templates_apply_jinja to use the
the add_bos and add_eos parameters from the chat template instead of
the inputs.

The motivation for this is that currently if the `add_bos` and `add_eos`
from the input parameters are used it is possible to there will be a
missmatch between the model and the chat template which can lead to the
the removal of duplicate BOS/EOS tokens in chat.cpp `apply` to not
happen leading to two BOS tokens being added to the template.
b6178
2025-08-15 19:50:52 +02:00
lhez
e2c1bfff53 opencl: add initial mxfp4 support via mv (#15270)
* opencl: add reference `mul_mv_mxfp4_f32`

* opencl: add reference `mul_mv_id` for mxfp4

* Q4_0 tranpose fix for Adreno

---------

Co-authored-by: shawngu-quic <shawngu@qti.qualcomm.com>
b6177
2025-08-15 09:52:14 -07:00
Georgi Gerganov
5edf1592fd vulkan : fix out-of-bounds access in argmax kernel (#15342)
ggml-ci
b6176
2025-08-15 16:16:36 +02:00
Georgi Gerganov
db3010bd23 vulkan : fix compile warnings on macos (#15340)
ggml-ci
b6175
2025-08-15 15:28:28 +02:00