Commit Graph

7039 Commits

Author SHA1 Message Date
Georgi Gerganov
374fe09cdd ggml : use std::sort in ggml_argsort CPU implementation (#17211)
* ggml : use std::sort in ggml_argsort CPU implementation

* cont : add missing header
b7039
2025-11-12 20:43:38 +02:00
Aleksander Grygier
8e878f0cb4 Update packages + upgrade Storybook to v10 (#17201)
* chore: Update packages + upgrade Storybook to v10

* fix: Increase timeout for UI tests
2025-11-12 19:01:48 +01:00
Xuan-Son Nguyen
00c94083b3 server: (refactor) implement generator-based API for task results (#17174)
* server: (refactor) implement generator-based API for task results

* improve

* moving some code

* fix "Response ended prematurely"

* add sink.done before return false

* rm redundant check

* rm unused var

* rename generator --> reader
b7037
2025-11-12 18:50:52 +01:00
Xuan-Son Nguyen
017eceed61 ci: add check vendor job (#17179)
* ci: add check vendor job

* use dev version of miniaudio

* move to dedicated workflow, only run on related files changed
2025-11-12 14:56:02 +01:00
Xuan-Son Nguyen
ee8dd5c658 server: move res_error/res_ok to static function (#17167) b7035 2025-11-12 14:17:24 +01:00
Alberto Cabrera Pérez
1c398dc9ec ggml-cpu: handle 3d tensors in repack mat_mul (#17030)
* ggml-cpu: handle 3d tensors in repack mul_mat

* Removed unnecessary branch, removed need for <algorithm>

* Fixed dst_ptr pointer in chunk + clang_format

* GGML_ASSERT to check wdata within bounds

* Accidental ggml.h inclusion

* Improved GGML_ASSERT on wdata boundaries
b7034
2025-11-12 14:52:19 +02:00
Adrien Gallouët
52cf111b31 cmake : cleanup (#17199) b7033 2025-11-12 14:48:30 +02:00
Adrien Gallouët
78010a0d52 cmake : move OpenSSL linking to vendor/cpp-httplib (#17177)
* cmake : move OpenSSL linking to vendor/cpp-httplib

Signed-off-by: Adrien Gallouët <angt@huggingface.co>

* bring back httplib 0.27.0

* add -DLLAMA_HTTPLIB

* update cmake config for visionos

---------

Signed-off-by: Adrien Gallouët <angt@huggingface.co>
Co-authored-by: Xuan Son Nguyen <son@huggingface.co>
b7032
2025-11-12 12:32:50 +01:00
TecJesh
655cddd174 CANN: Add L2_NORM op support (#16856)
* update L2_NORM op support

* update L2_NORM op support

* remove extra whitespace
b7031
2025-11-12 15:11:42 +08:00
Neo Zhang Jianyu
5da7664960 [SYCL]fix ci crash about SSM_CONV (#17169)
* fix ci crash

* Update ggml-sycl.cpp

* Update ggml/src/ggml-sycl/ggml-sycl.cpp

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

---------

Co-authored-by: Zhang Jianyu <zhang.jianyu@outlook.com>
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
b7030
2025-11-12 14:44:29 +08:00
Raul Torres
23a46ce972 CANN: GGML_CANN_ACL_GRAPH works only USE_ACL_GRAPH enabled (#16861)
The documentation should state that `GGML_CANN_ACL_GRAPH` is only effective if `USE_ACL_GRAPH` was enabled at compilation time.
2025-11-12 14:37:52 +08:00
Max Krasnyansky
c273d75375 hexagon: various Op fixes (#17135)
* hexagon: explicitly check for ops with zero nrows

llm_graph_context::build_inp_out_ids() can generate tensors with zero nrows.
Somehow other backends seems to handle this without obvious explicit checks.
In the hexagon case we need to check explicitly and skip them.

* hexagon: introduce fastdiv, fix test-backend-ops for ADD/SUB/MUL

Co-authored-by: chraac <chraac@gmail.com>

* hexagon: use fastdiv in ADD_ID

* hexagon: use ggml_op_is_empty and ggml_is_empty to check for NOPs

---------

Co-authored-by: chraac <chraac@gmail.com>
b7028
2025-11-11 15:25:04 -08:00
Eve
7d019cff74 disable rms norm mul rope for chips with no fp16 rte (#17134) b7027 2025-11-11 12:53:30 -06:00
sudhiarm
3fe36c3238 ci: add Arm-hosted Graviton4 runner (#17021)
* ci: add Arm-hosted Graviton4 runner

* ci: add missing dependencies for graviton4 build

* ci: enable LFS checkout on graviton4

* ci: move git-lfs install to dependencies in Graviton4 workflow
2025-11-11 17:58:05 +02:00
Xuan-Son Nguyen
1d45b4228f vendor: split httplib to cpp/h files (#17150)
* vendor: split httplib to cpp/h files

* move defines

* include httplib if curl is not used

* add TODO

* fix build ios

* fix build visionos instead
b7025
2025-11-11 13:32:58 +01:00
ixgbe
ca4844062b ggml-cpu : add RISC-V RVV (Zvfh) optimization for FP16 to FP32 conversion (#17161)
Signed-off-by: Wang Yang <yangwang@iscas.ac.cn>
b7024
2025-11-11 13:41:51 +02:00
duduta
73460f6278 ggml-cpu: templateify ggml_compute_forward_rope_f32 and _f16 (#16805)
* extract rotate_pairs logic from ggml_compute_forward_rope_f32

* templateify ggml_compute_forward_rope_f32 and _f16

* abort when rope type not supported, remove GLM from test-rope

* add imrope branch to switch

* add rope tests for perf

* Update ggml/src/ggml-cpu/ops.cpp

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

* Update ggml/src/ggml-cpu/ops.cpp

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
b7023
2025-11-11 13:33:24 +02:00
Charles Xu
8c583242ad kleidiai: add optimized per-channel kernels for Q8_0 (#16993) b7022 2025-11-11 13:20:31 +02:00
Mike Abbott
4a5b8aff40 cmake : add version to all shared object files (#17091)
When compiling llama.cpp in Yocto, it fails QA checks because the generated so files aren't versioned.  This applies a version to all generated so files, allowing the package to build without errors.
b7021
2025-11-11 13:19:50 +02:00
Nicolas B. Pierron
d2d626938a Install rpc-server when GGML_RPC is ON. (#17149) b7020 2025-11-11 10:53:59 +00:00
levkropp
2fc392ce35 convert : register UMT5Model architecture for T5 conversion (#17160)
Register UMT5Model as a supported architecture variant for T5 model conversion.
This allows the conversion to work for models downloaded with AutoModel.
2025-11-11 09:38:30 +01:00
lhez
ece0f5c177 opencl: add fastdiv and use it in set_rows, ported from cuda (#17090)
* opencl: add fastdiv for mm q8_0

* opencl: use uint4 for fastdiv vals

* opencl: use fastdiv for set_rows

* opencl: do not use fastdiv for q8_0 mm
b7018
2025-11-10 15:00:13 -08:00
Sigbjørn Skjæret
7bef684118 models : move build_inp_out_ids outside loop (#17151)
* move build_inp_out_ids outside loop

* realign
b7017
2025-11-10 22:55:30 +01:00
Max Krasnyansky
395e286bc9 cpu: skip NOPs to avoid barriers (#17133)
* cpu: skip NOPs to avoid barriers

* cpu: use ggml_op_is_empty
b7016
2025-11-10 12:44:49 -08:00
Georgi Gerganov
13730c183b metal : cap threadgroups size of set_rows (#17146) b7015 2025-11-10 21:33:35 +02:00
Adrien Gallouët
967eb4b2bf ggml-cpu : inspect -march and -mcpu to found the CPU (#16333)
Signed-off-by: Adrien Gallouët <angt@huggingface.co>
b7014
2025-11-10 21:03:36 +02:00
Ruben Ortlam
f117be185e vulkan: check glslc executable string (#17144) b7013 2025-11-10 16:59:26 +01:00
Ruben Ortlam
85234a4b3a vulkan: fix validation issue introduced by #16868 (#17145) b7012 2025-11-10 16:59:10 +01:00
Gabe Goodhart
0c74f32632 memory: Hybrid context shift (#17009)
* feat(memory): Only fail partial erasure of recurrent tail

The recurrent state is always assumed to be the state as of the last update
from the final token in the sequence. When doing a partial erasure, if the
range does not include the final token, the erasure can be considered a
success since any memory used for the sequence prior to the final token
(which is no memory) has been successfully removed.

There is one potential case that this doesn't address which is the pruning
of cache to remove sensitive data from the context. This wouldn't work for
attention cache partial removal (in the middle) either since the KV state
is linearly-dependent and states in later sequence positions would still be
based on the state from the sensitive data, even if that data is no longer
cached, so I don't think this is relevant, but it is worth noting that the
semantics of this change for a partial erasure in the middle of the cache
are essentially "my context is already compressed" and not "all trace of
the removed tokens has been removed."

https://github.com/ggml-org/llama.cpp/issues/16768
Branch: HybridContextShift-16768

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* fix(main): Check the output of seq_rm for prefix matching

This prefix matching is explicitly attempting to remove the tokens at the
end of the sequence that don't match. This is the operation that can't be
performed on a recurrent cache due to the state being updated in place, so
if this removal fails, we need to clear the whole cache.

https://github.com/ggml-org/llama.cpp/issues/16768
Branch: HybridContextShift-16768

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* fix(memory): Fix condition for partial erasure failure if p0 > pos

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

Co-authored-by: compilade <git@compilade.net>

* style: Fix extra parens

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

* fix(main.cpp): Set n_matching_session_tokens to 0 on cache clear

https://github.com/ggml-org/llama.cpp/issues/16768
Branch: HybridContextShift-16768

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

---------

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
Co-authored-by: compilade <git@compilade.net>
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
b7011
2025-11-10 17:14:23 +02:00
Georgi Gerganov
c27efd2bd1 metal : enable tensor API for A19 (#17087) b7010 2025-11-10 15:38:42 +02:00
fj-y-saito
df70bedda7 arm64: add i8mm route with SVE ggml_vec_dot_q4_K_q8_K and ggml_vec_dot_q6_K_… (#15277)
* add i8mm route with SVE ggml_vec_dot_q4_K_q8_K and ggml_vec_dot_q6_K_q8_K

* Surround SVE function with compiler directive

* fix compile switch

* fix coding style

* ggml : fix indent

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
b7009
2025-11-10 15:12:59 +02:00
Georgi Gerganov
f914544b16 batched-bench : add "separate text gen" mode (#17103) b7008 2025-11-10 12:59:29 +02:00
Xuan-Son Nguyen
4b13a684c5 mtmd: fix patch_size initialized to random value in audio models (#17128)
* mtmd: fix patch_size initialized to random value in audio models

* add default hparams
b7007
2025-11-10 11:41:05 +01:00
Georgi Gerganov
9898b57cbe editorconfig : ignore benches/ (#17140)
[no ci]
2025-11-10 12:17:19 +02:00
Acly
1032256ec9 cuda/vulkan : bicubic interpolation (#17022)
* vulkan : implement upscale with bicubic interpolation

* cuda : implement upscale with bicubic interpolation

* tests : add ggml_interpolate with GGML_SCALE_MODE_BICUBIC to backend tests

* adapt OpenCL backend to not support the OP in that case so tests don't fail

* print scale mode & flags in test-backend-ops
b7005
2025-11-10 10:19:39 +01:00
Georgi Gerganov
15274c0c50 benches : add eval results (#17139)
[no ci]
2025-11-10 10:44:10 +02:00
Georgi Gerganov
b8595b16e6 mtmd : fix embedding size for image input (#17123) b7003 2025-11-09 18:31:02 +02:00
Ruben Ortlam
392e09a608 vulkan: fix memory allocations (#17122) b7002 2025-11-09 16:14:41 +01:00
compilade
802cef44bf convert : parse safetensors directly (#15667)
* convert : parse safetensors directly

* gguf-py : order safetensors tensors by name

Applies to both local and remote safetensors custom parsing.
This matches the behavior of the official safetensors implementation.

* convert : rename from_safetensors_meta to from_local_tensor

For consistency with from_remote_tensor

* convert : fix no-lazy dtypes from direct safetensors
2025-11-09 09:49:40 -05:00
compilade
1c07c0c68c convert : handle compressed-tensors quant method (#17069)
* convert : handle compressed-tensors quant method

* convert : handle int-quantized models

* convert : handle naive-quantized models

* gguf-py : __pos__ is also unary

* convert : fix flake8 lint

* convert : use F32 for dequant of pack-quantized tensors
2025-11-09 09:45:50 -05:00
Georgi Gerganov
cb1adf8851 server : handle failures to restore host cache (#17078)
* server : handle failures to restore host cache

* server : add tests for the prompt cache
b6999
2025-11-09 14:27:05 +02:00
Georgi Gerganov
ef1d826997 benches : add folder with benchmarks (#16931)
* benches : add folder with benchmarks

* benches : update dgx-spark bench
2025-11-09 12:53:29 +02:00
Eric Curtin
86fde91e62 Switch to using Ubuntu 25.10 vulkan/mesa (#16497)
Because "Ubuntu packages to be discontinued in Vulkan SDK"

Signed-off-by: Eric Curtin <eric.curtin@docker.com>
2025-11-09 10:25:38 +01:00
Ruben Ortlam
7f3e9d339c vulkan: iGPU memory reporting fix (#17110)
* vulkan: use all device-local heaps for memory availability reporting

Co-authored-by: Giuseppe Scrivano <gscrivan@redhat.com>

* use all available heaps for iGPU memory reporting

* Allow multiple memory types per buffer request for devices with split heaps

---------

Co-authored-by: Giuseppe Scrivano <gscrivan@redhat.com>
b6996
2025-11-09 09:54:47 +01:00
Ruben Ortlam
8a3519b708 vulkan: fix mmq out of bounds reads (#17108)
* vulkan: fix mmq out of bounds reads, streamline outdated matmul host code

* fix mul_mat_id quantization call

* Fix compiler warnings
b6995
2025-11-09 09:52:57 +01:00
Jeff Bolz
80a6cf6347 vulkan: fuse mul_mat_id + mul (#17095)
* vulkan: fuse mul_mat_id + mul

This comes up in qwen3 moe.

* split mul_mat_id fusion tests into a separate class
b6994
2025-11-09 09:48:42 +01:00
Georgi Gerganov
0750a59903 metal : retain src and dst buffers during async ops (#17101) b6993 2025-11-09 08:28:51 +02:00
Xuan-Son Nguyen
aa3b7a90b4 arg: add --cache-list argument to list cached models (#17073)
* arg: add --cache-list argument to list cached models

* new manifest naming format

* improve naming

* Update common/arg.cpp

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
b6992
2025-11-08 21:54:14 +01:00
chansikpark
333f2595a3 webui: fix keyboard shortcuts for new chat & edit chat title (#17007) 2025-11-08 20:52:35 +01:00
Jeff Bolz
53d7d21e61 vulkan: Use spec constants for conv2d s/d/p and kernel W/H (#16978)
* vulkan: Use spec constants for conv2d s/d/p and kernel W/H

Also add some additional unroll hints, which seems to help.

* lock around map lookup
b6990
2025-11-08 13:24:29 -06:00