Commit Graph

6196 Commits

Author SHA1 Message Date
Georgi Gerganov
f0c541d315 scripts : update sync scripts 2025-08-18 22:06:44 +03:00
Sigbjørn Skjæret
baa9255a45 llama : merge conts and reshapes and remove unnecessary cont (#15380)
* remove unnecessary conts and merge reshapes

* restore necessary conts

* merge more conts and reshapes

* merge even more conts and reshapes
b6195
2025-08-18 19:30:17 +02:00
Georgi Gerganov
3007baf201 readme : update hot topics (#15397) 2025-08-18 18:11:44 +03:00
davidef
d1d8241600 server : fix incoming tasks not process in order (#15395) b6193 2025-08-18 17:51:42 +03:00
Dobri Danchev
618575c582 Fix broken build: require updated pip to support --break-system-packages (#15357)
* Revert "devops : fix compile bug when the BASE_CUDA_DEV_CONTAINER is based on Ubuntu 24.04 (#15005)"

This reverts commit e4e915912c.

* devops: Allow pip to modify externally-managed python environment (system installation)

- Updated pip install commands to include the --break-system-packages
  flag, ensuring compatibility when working with system-managed Python
  environments (PEP 668).

- Note: The --break-system-packages option was introduced in 2023.
  Ensure pip is updated to a recent version before using this flag.

fixes [#15004](https://github.com/danchev/llama.cpp/issues/15004)
2025-08-18 12:50:48 +02:00
compilade
f44f793172 ggml-quants : fix make_qp_quants NANs and IQ1 assertion errors (#15379)
* ggml-quants : fix make_qp_quants NANs and IQ1 assertion errors

* ggml-quants : avoid division by zero in make_q3_quants
b6191
2025-08-18 09:23:56 +02:00
Jeff Bolz
ae532eac2c vulkan: disable spirv-opt for bfloat16 shaders (#15352) b6190 2025-08-18 07:56:29 +02:00
Oleksandr Kuvshynov
e5155e6986 server : export max observed n_past value (#15361)
Add tracking for high watermark cache usage and make it available in /metrics endpoint.

Use-case: Tracking largest needed cache usage under realistic workload
to better understand memory requirements and be able to adjust
cache size/quantization for model/cache accordingly.
b6189
2025-08-18 00:28:58 +02:00
Jeff Bolz
21c17b5bef vulkan: Use larger workgroups for mul_mat_vec when M is small (#15355)
* vulkan: Use larger workgroups for mul_mat_vec when M is small

Also use subgroup instructions for (part of) the reduction when supported.
Without this, the more expensive reductions would eat into the benefits of
the larger workgroups.

* update heuristic for amd/intel

Co-authored-by: 0cc4m <picard12@live.de>

---------

Co-authored-by: 0cc4m <picard12@live.de>
b6188
2025-08-17 18:08:57 +02:00
Dong Won Kim
19f4decae0 vulkan: support sqrt (#15370) b6187 2025-08-17 16:03:09 +02:00
Sigbjørn Skjæret
4d196981d4 convert : force patch_embd weights to F16 or F32 to avoid broken GGUFs (#15367)
* force patch_embd weights to f32

* use MmprojModel base tensor_force_quant instead
2025-08-17 14:47:42 +02:00
Sigbjørn Skjæret
b143fbc87a ci : fix hang in windows-hip build/release (#15365)
* fix hang in windows-latest-cmake-hip

* apply fix to release as well
b6185
2025-08-17 13:30:23 +02:00
Jeff Bolz
de5627910d vulkan: Optimize argsort (#15354)
- Launch an appropriate number of invocations (next larger power of two).
32 invocations is common and the barrier is much cheaper there.
- Specialize for "needs bounds checking" vs not.
- Make the code less branchy and [[unroll]] the loops. In the final code,
I see no branches inside the main loop (only predicated stores) when
needs_bounds_check is false.
- Always sort ascending, then apply the ascending vs descending option when
doing the final stores to memory.
- Copy the values into shared memory, makes them slightly cheaper to access.
b6184
2025-08-17 10:41:45 +02:00
Tarek Dakhran
65349f26f2 model : support vision LiquidAI LFM2-VL family (#15347)
* wip lfm2 vision model

* Fix conv weight

* Implement dynamic resolution

* Fix cuda

* support LFM2-VL-450M

* happy CI

* Remove extra `ggml_conv` and put others into the right place

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

---------

Co-authored-by: Xuan Son Nguyen <son@huggingface.co>
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
b6183
2025-08-16 23:33:54 +02:00
Jeff Bolz
1fe00296f5 vulkan: fuse adds (#15252)
* vulkan: fuse adds

Fuse adds that have the same shape, which are common in MoE models.
It will currently fuse up to 6 adds, because we assume no more than
8 descriptors per dispatch. But this could be changed.

* check runtimeDescriptorArray feature

* disable multi_add for Intel due to likely driver bug
b6182
2025-08-16 11:48:22 -05:00
Jeff Bolz
de2192794f vulkan: Support mul_mat_id with f32 accumulators (#15337)
* vulkan: Add missing bounds checking to scalar/coopmat1 mul_mat_id

* vulkan: Support mul_mat_id with f32 accumulators, but they are not hooked up

- There's no explicit way to request f32 precision for mul_mat_id, but there
probably should be, and this gets the code in place for that.
- A couple fixes to check_results.
- Remove casts to fp16 in coopmat1 FA shader (found by inspection).
b6181
2025-08-16 11:18:31 +02:00
Jeff Bolz
2e2b22ba66 vulkan: Add missing bounds checking to scalar/coopmat1 mul_mat_id (#15334) b6180 2025-08-16 10:58:38 +02:00
rmatif
912ff8c119 OpenCL: add initial FA support (#14987)
* add F16/F16 fa support

* fix kernel init

* use mad instead of fma

* use inline function

* mark FA with sinks as unsupported for now

* add pragma unroll to loops
b6179
2025-08-16 01:05:55 -07:00
Daniel Bevenius
5e6229a840 common : fix double bos, use common_chat_templates for add_bos and add_eos (#15326)
This commit updates common_chat_templates_apply_jinja to use the
the add_bos and add_eos parameters from the chat template instead of
the inputs.

The motivation for this is that currently if the `add_bos` and `add_eos`
from the input parameters are used it is possible to there will be a
missmatch between the model and the chat template which can lead to the
the removal of duplicate BOS/EOS tokens in chat.cpp `apply` to not
happen leading to two BOS tokens being added to the template.
b6178
2025-08-15 19:50:52 +02:00
lhez
e2c1bfff53 opencl: add initial mxfp4 support via mv (#15270)
* opencl: add reference `mul_mv_mxfp4_f32`

* opencl: add reference `mul_mv_id` for mxfp4

* Q4_0 tranpose fix for Adreno

---------

Co-authored-by: shawngu-quic <shawngu@qti.qualcomm.com>
b6177
2025-08-15 09:52:14 -07:00
Georgi Gerganov
5edf1592fd vulkan : fix out-of-bounds access in argmax kernel (#15342)
ggml-ci
b6176
2025-08-15 16:16:36 +02:00
Georgi Gerganov
db3010bd23 vulkan : fix compile warnings on macos (#15340)
ggml-ci
b6175
2025-08-15 15:28:28 +02:00
Aaron Teo
ff27f80a74 ggml: initial IBM zDNN backend (#14975)
* ggml-zdnn: inital backend impl

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

ggml-zdnn: temp change z17 to arch15

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

ggml-zdnn: fix build bugs

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

* ggml-zdnn: tensor->extra logging check

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

ggml-zdnn: add layout name mapping, ztensor information

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

ggml-zdnn: separate logging into its own line

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

ggml-zdnn: add shape comparison

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

ggml-zdnn: add ggml_tensor shape log

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

ggml-zdnn: fix incorrect shape logging

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

* ggml-zdnn: add output buffer check

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

* ggml-zdnn: run compute and store into tensor->extra

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

* ggml-zdnn: add set_tensor

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

* ggml-zdnn: add more loggers

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

* ggml-zdnn: update set_tensor logging to check only for matmul

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

* ggml-zdnn: last working matmul version

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

* ggml-zdnn: add comments to prevent accidentally deleting lines

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

* ggml-zdnn: support op out_prod

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

* ggml-zdnn: update op out_prod to use tensor->extra

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

* ggml-zdnn: rewrite the backend implementation

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

* ggml-zdnn: bugfix new impl

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

* ggml-zdnn: fix compiler warnings and bugfixes

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

* ggml-zdnn: test ztensor finding in init_tensor

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

* ggml-zdnn: implement at least 1 op to test

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

* ggml-zdnn: assign tensor->extra to buffer

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

* ggml-zdnn: add check for view tensors to prevent init_tensor

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

* ggml-zdnn: rework init_tensor to create new buffers

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

* ggml-zdnn: switch to std vector instead of array

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

* ggml-zdnn: switch buffers back and set to arbitrary number

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

* ggml-zdnn: impl init_tensor

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

* ggml-zdnn: update supports_op matmul matrix

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

* ggml-zdnn: fix incorrect ztensor shape, reduce memory padding

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

* ggml-zdnn: code clean up

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

* ggml-zdnn: impl matmul

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

* ggml-zdnn: fix compiler error missing type

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

* ggml-zdnn: fix missing data transform call

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

* ggml-zdnn: add bias init_tensor

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

* ggml-zdnn: tighten memory usage, change string allocation

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

* ggml-zdnn: add bias ztensor and data free

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

* ggml-zdnn: add bias data transform

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

* ggml-zdnn: add more debug info for extra buffer transform

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

* ggml-zdnn: add logger to check if mat mul ops go through set_tensor

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

* ggml-zdnn: activate bias transform in matmul

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

* ggml-zdnn: move weights transform into mulmat

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

* ggml-zdnn: add more safeguards in matmul

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

* ggml-zdnn: fix sequencing of transforms

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

* ggml-zdnn: bugfix transform ztensor vs origtensor

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

* ggml-zdnn: figure out why sigtrap is happening

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

* ggml-zdnn: fix sigsegv

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

* ggml-zdnn: move everything back to local declaration

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

* ggml-zdnn: move bias data to local also

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

* ggml-zdnn: bring back working matmul

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

* ggml-zdnn: rewrite into mre

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

* ggml-zdnn: fix missing vector import

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

* ggml-zdnn: fix missing vector import in header

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

* ggml-zdnn: attempt to fix sigsegv

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

* ggml-zdnn: fix missing load tensor

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

* ggml-zdnn: fix invalid ztensor buffer release

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

* ggml-zdnn: add logging to debug free buffer

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

* ggml-zdnn: remove free_buffer debug info

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

* ggml-zdnn: add parmblkformat detections

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

* ggml-zdnn: add nnpa installed detection

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

* ggml-zdnn: add zdnn_init call for static libs

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

* ggml-zdnn: add init_tensor

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

* ggml-zdnn: attempt at fixing invalid buffer

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

* ggml-zdnn: switch to using deque to fix pointer deref problem

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

* ggml-zdnn: add weights logging to check

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

* ggml-zdnn: attempt to use unique ptr

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

* ggml-zdnn: add tensor to pre_tfm_desc logging

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

* ggml-zdnn: add inputs logging

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

* ggml-zdnn: disable op_none initialisation for testing

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

* ggml-zdnn: fix missing return from init_tensor

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

* ggml-zdnn: load ztensors in cgraph exec

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

* ggml-zdnn: work on moving output ztensor as well

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

* ggml-zdnn: disable logging and breakpoints for full test

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

* ggml-zdnn: attempt at manually changing the layout

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

* ggml-zdnn: attempt at using default nwhc format instead

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

* ggml-zdnn: disable global load ztensor for now

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

* ggml-zdnn: fix errorenous output load tensor

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

* ggml-zdnn: add guards to prevent loading ztensor if transformed

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

* ggml-zdnn: code cleanup

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

* ggml-zdnn: bring load ztensor back to init routine

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

* ggml-zdnn: code clean up

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

* ggml-zdnn: fix ztensor deallocation abort

stabilise ggml <-> zdnn api

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

* ggml-zdnn: clean up matmul selection

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

* ggml-zdnn: clean up project structure

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

* ggml-zdnn: update documentation, prepare for upstream

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

* chore: add codeowners

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

* ggml-zdnn: disable batched matmul

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

* ggml-zdnn: attempt at fixing tensor views during matmul

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

* ggml-zdnn: deny all view tensors directly

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

* ggml-zdnn: fix pr comments

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

* docs: update ops docs for zdnn

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

* ggml-zdnn: redo test-backend-ops for ops.md

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

* ggml-zdnn: fix typo in build-s390x.md

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

* codeowners: remove taronaeo for now

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

* Revert "codeowners: remove taronaeo for now"

This reverts commit 411ea4ed78.

* ggml-zdnn: remove unused ggml_zdnn macro

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

---------

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>
b6174
2025-08-15 21:11:22 +08:00
Sigbjørn Skjæret
d3248d9b65 ci : fix ios-xcode-build (#15324)
* fix ios-xcode-build

* use xcode-select with fixed version

* switch to macos-15 to get xcode 16.4
b6173
2025-08-15 14:02:39 +02:00
Diego Devesa
7aeee88cfe ci : move ccache action to ggml-org fork (#15328) 2025-08-15 12:27:02 +02:00
Johannes Gäßler
b07791aa1d test-opt: fix backend support check (#15317)
* test-opt: fix backend support check

* Update tests/test-opt.cpp

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2025-08-15 11:23:17 +02:00
Johannes Gäßler
4227c9be42 CUDA: fix negative KV_max values in FA (#15321) 2025-08-14 23:21:24 +02:00
Georgi Gerganov
df36bce667 eval-callback : stop on first NaN (#15320)
* eval-callback : stop on first NaN

* cont : log error
2025-08-14 22:10:51 +03:00
Diego Devesa
f75b830647 chat : include kwargs in template example (#15309) 2025-08-14 10:28:29 -07:00
Daniel Bevenius
7a0de96045 llama : add 18-layer model type for Gemma 3-270m (#15319)
This commit adds support for the 18-layer model type in the Gemma3
series, which is the size of the Gemma3-270m model.

The motivation for this commit is was the only change required for
Gemma3-270m to be converted to GGUF format and used with llama.cpp.

Once the model has been converted and uploaded to Huggingface it can be
used like this:
```console
$ ./build/bin/llama-cli -hf ggml-org/gemma-3-270m-GGUF:Q8_0
```
2025-08-14 17:56:26 +02:00
simevo
e4e915912c devops : fix compile bug when the BASE_CUDA_DEV_CONTAINER is based on Ubuntu 24.04 (#15005)
fixes #15004

Co-authored-by: Paolo Greppi <paolo.greppi@libpf.com>
2025-08-14 18:45:27 +03:00
uvos
5ba36f6103 HIP: Cleanup hipification header (#15285)
add expicit conversion operator to support older versions of rocm
Switch over to hip_bf16 from legacy hip_bfloat16
Simplify RDNA3 define
Reduce swap over of new hipblas api to rocm 6.5 as this version is used for rocm 7.0 previews

---------

Co-authored-by: Johannes Gäßler <johannesg@5d6.de>
2025-08-14 16:23:56 +02:00
Aldehir Rojas
b204a5a234 gpt-oss: implement harmony parsing (#15181)
* model : add harmony parser for gpt-oss

* gpt-oss : fix grammar trigger from causing empty stack

* gpt-oss: tweak the grammar trigger again

* gpt-oss : add support for recipient in role header

* gpt-oss : fix ungrouped tool calls in grammar

* gpt-oss : loosen function name matching during parse

* gpt-oss : clean up workarounds

* gpt-oss : add template tests

* gpt-oss : simulate thinking and tool call tags

* gpt-oss : undo think tags when reasoning_format is none

* gpt-oss : set special tokens back to user defined

* gpt-oss : update openai-gpt-oss template

* server : filter out harmony thought messages

* gpt-oss : simplify parsing
2025-08-14 17:23:11 +03:00
Christian Kastner
646944cfa8 docker : Enable GGML_CPU_ALL_VARIANTS for ARM (#15267) 2025-08-14 16:22:58 +02:00
Georgi Gerganov
1a01899b61 readme : update hot topics (#15315) 2025-08-14 17:16:03 +03:00
Jeff Bolz
863d341eeb vulkan: perf_logger improvements (#15246)
* vulkan: perf_logger improvements

- Account for batch dimension in flops calculation.
- Fix how "_VEC" is detected for mat_mul_id.
- Fix "n" dimension for mat_mul_id (in case of broadcasting).
- Include a->type in name.

* use <=mul_mat_vec_max_cols rather than ==1
2025-08-14 08:38:10 -05:00
Georgi Gerganov
d32e03f449 server : add SWA checkpoints (#15293)
* server : add SWA checkpoints

ggml-ci

* cont : server clean-up

* server : handle state restore fails

* llama : add extended llama_state_seq_ API

* server : do not make checkpoints if --swa-full

ggml-ci

* llama : remove flags value for NONE

* server : configure number of SWA checkpoints with CLI arg

ggml-ci

* args : fix scope of new argument
2025-08-14 14:59:50 +03:00
Georgi Gerganov
3973163bff sync : ggml
ggml-ci
2025-08-14 14:59:27 +03:00
Jason Ni
5ade3000bd ggml: fix ggml_conv_1d_dw bug (ggml/1323)
* ggml: fix ggml_conv_1d_dw bug

* Fixed conv1d_dw weight tensor dimension.
2025-08-14 14:59:27 +03:00
Georgi Gerganov
8b2483730f tests : remove unused includes (ggml/0) 2025-08-14 14:59:27 +03:00
kallewoof
810b9fc8b9 perplexity : provide a helpful hint for has_cpl case in split_equal error. (#15304)
When attempting to do llama-perplexity on certain tasks which have coupled sequences there is a cryptic error that does not tell you what to do, which is to set the -kvu flag. This adds a hint about that fact.
2025-08-14 14:03:30 +03:00
Sigbjørn Skjæret
4ebd0c125b cuda : fix GGML_CUDA_GRAPHS=OFF (#15300)
* fix USE_CUDA_GRAPH=OFF

ggml-ci

* check capture status

* completely disable capturing check instead
2025-08-14 13:22:07 +03:00
Jonathan Graehl
5cdb27e091 finetune: SGD optimizer, more CLI args (#13873)
* examples/finetune -opt SGD (stochastic gradient descent) memory opt

add unit tested GGML_OPT_OPTIMIZER_SGD to ggml - avoids allocating
m, v tensors.

support finetune.cpp arg -opt SGD (or sgd). (default adamw as before)

llama 3.2-1b-F32 result: observed 11gb gpu ram (41 sec/epoch)
when using SGD instead of 19gb (55 sec/epoch) using adamw.
(wikipedia 100 lines finetune)

(
using the same GPU memory, adamw can only do before OOM 512
batch/context, reaching:
train: [███████▉] data=0000140/0000140 loss=0.02575±0.00099 acc=99.52±0.03% t=00:00:47 ETA=00:00:00
val:   [███████▉] data=0000008/0000008 loss=4.76565±0.28810 acc=41.46±0.77% t=00:00:00 ETA=00:00:00

SGD is superior, though it converges slower, with max before OOM 1728
batch/context (esp see the better validation perf):
train: [███████▉] data=0000039/0000039 loss=0.00371±0.00010 acc=99.96±0.01% t=00:00:41 ETA=00:00:00
val:   [███████▉] data=0000003/0000003 loss=5.11406±0.76034 acc=48.01±0.69% t=00:00:01 ETA=00:00:00
)

note: when finetuning long enough (or w/ enough -lr),
validation accuracy *eventually* drops ('catastrophic forgetting')

-lr-half (halflife) option useful for SGD to avoid oscillation or
super slow underdamped learning (makes setting -lr more forgiving).
terminal -lr for now is set by lr-halvings i.e. if you want at most
1/8 the inital -lr you set -lr-halvings 3.

note: objective loss not directly comparable between adamw, sgd? -
check perplexity or accuracy or consider relative improvements
for convergence

new finetune args -wd 1e-9 to enable weight decay in sgd or adamw,
and max -epochs N (default 2 as before)

cache (1 - wd*alpha) in 'adamw' opt struct -
no noticeable perf benefit, disabled (still done
for new SGD though)

since opt. memory is pre-allocated, the ggml_opt_get_optimizer_params
would probably be able to change between SGD and AdamW with each epoch
but would need to use adamw for the first (unconfirmed - no cmdline arg
to set such a policy yet)

test-opt checks adamw as before and now sgd (except for a few disabled
tests for sgd only; probably just needs logging values and adding
alternate reference values);  tolerance on the 'regression'
test is broader for sgd (so we don't need many more epochs)

* Vulkan: Implement GGML_OP_OPT_STEP_SGD

* tests: Fix OPT_STEP_SGD test-backend-ops

* SGD op param store weight-decay and not 1-alpha*wd

* minor + cosmetic changes

* fix vulkan sgd

* try CI fix

---------

Co-authored-by: 0cc4m <picard12@live.de>
Co-authored-by: Johannes Gäßler <johannesg@5d6.de>
2025-08-14 12:03:57 +02:00
kallewoof
3ea913f1ce perplexity: give more information about constraints on failure (#15303)
* perplexity: give more information about constraints on failure

This checks whether -np is insufficient vs context, and provides clues as to how much is needed for each.

* log formatting

* log error and return instead of storing max_seq_exceeded int

* check if s0 is zero for -np check
b6153
2025-08-14 09:16:32 +03:00
uvos
29c8fbe4e0 HIP: bump requirement to rocm 6.1 (#15296) b6152 2025-08-13 20:44:30 +02:00
Bas Nijholt
1adc9812bd fix(nix): remove non-functional llama-cpp cachix cache from flake.nix (#15295)
The flake.nix included references to llama-cpp.cachix.org cache with a comment
claiming it's 'Populated by the CI in ggml-org/llama.cpp', but:

1. No visible CI workflow populates this cache
2. The cache is empty for recent builds (tested b6150, etc.)
3. This misleads users into expecting pre-built binaries that don't exist

This change removes the non-functional cache references entirely, leaving only
the working cuda-maintainers cache that actually provides CUDA dependencies.

Users can still manually add the llama-cpp cache if it becomes functional in the future.
2025-08-13 11:21:31 -07:00
Sigbjørn Skjæret
b3e16665e1 server : enable -td and -tbd parameters (#15172) b6150 2025-08-13 15:43:00 +02:00
Judd
c24f4e2688 ggml : update ggml_rope_multi (#12665)
* update `rope_multi`:

1. add `ggml_rope_multi_inplace`;
1. use `GGML_MROPE_SECTIONS` instead of 4.

* Apply suggestions from code review

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
b6149
2025-08-13 13:45:15 +03:00
Copilot
d8914fc47e common : add --override-tensor-draft, --cpu-moe-draft and --n-cpu-moe-draft parameters (#15191)
* Checkpoint from VS Code for coding agent session

* Initial plan

* Fix typo in --override-tensor-draft flag implementation

* Add null termination for speculative tensor buffer overrides

* Apply suggestions from code review

* Apply suggestions from code review

* Extract tensor override parsing logic to common function (addresses @slaren's feedback)

* Apply suggestions from code review

* Apply suggestions

---------

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
Co-authored-by: Diego Devesa <slarengh@gmail.com>
b6148
2025-08-13 12:44:40 +02:00
Aldehir Rojas
e885445bc1 server : filter out harmony thought messages (#15278) 2025-08-13 12:28:21 +02:00