Commit Graph

7003 Commits

Author SHA1 Message Date
Georgi Gerganov
b8595b16e6 mtmd : fix embedding size for image input (#17123) b7003 2025-11-09 18:31:02 +02:00
Ruben Ortlam
392e09a608 vulkan: fix memory allocations (#17122) b7002 2025-11-09 16:14:41 +01:00
compilade
802cef44bf convert : parse safetensors directly (#15667)
* convert : parse safetensors directly

* gguf-py : order safetensors tensors by name

Applies to both local and remote safetensors custom parsing.
This matches the behavior of the official safetensors implementation.

* convert : rename from_safetensors_meta to from_local_tensor

For consistency with from_remote_tensor

* convert : fix no-lazy dtypes from direct safetensors
2025-11-09 09:49:40 -05:00
compilade
1c07c0c68c convert : handle compressed-tensors quant method (#17069)
* convert : handle compressed-tensors quant method

* convert : handle int-quantized models

* convert : handle naive-quantized models

* gguf-py : __pos__ is also unary

* convert : fix flake8 lint

* convert : use F32 for dequant of pack-quantized tensors
2025-11-09 09:45:50 -05:00
Georgi Gerganov
cb1adf8851 server : handle failures to restore host cache (#17078)
* server : handle failures to restore host cache

* server : add tests for the prompt cache
b6999
2025-11-09 14:27:05 +02:00
Georgi Gerganov
ef1d826997 benches : add folder with benchmarks (#16931)
* benches : add folder with benchmarks

* benches : update dgx-spark bench
2025-11-09 12:53:29 +02:00
Eric Curtin
86fde91e62 Switch to using Ubuntu 25.10 vulkan/mesa (#16497)
Because "Ubuntu packages to be discontinued in Vulkan SDK"

Signed-off-by: Eric Curtin <eric.curtin@docker.com>
2025-11-09 10:25:38 +01:00
Ruben Ortlam
7f3e9d339c vulkan: iGPU memory reporting fix (#17110)
* vulkan: use all device-local heaps for memory availability reporting

Co-authored-by: Giuseppe Scrivano <gscrivan@redhat.com>

* use all available heaps for iGPU memory reporting

* Allow multiple memory types per buffer request for devices with split heaps

---------

Co-authored-by: Giuseppe Scrivano <gscrivan@redhat.com>
b6996
2025-11-09 09:54:47 +01:00
Ruben Ortlam
8a3519b708 vulkan: fix mmq out of bounds reads (#17108)
* vulkan: fix mmq out of bounds reads, streamline outdated matmul host code

* fix mul_mat_id quantization call

* Fix compiler warnings
b6995
2025-11-09 09:52:57 +01:00
Jeff Bolz
80a6cf6347 vulkan: fuse mul_mat_id + mul (#17095)
* vulkan: fuse mul_mat_id + mul

This comes up in qwen3 moe.

* split mul_mat_id fusion tests into a separate class
b6994
2025-11-09 09:48:42 +01:00
Georgi Gerganov
0750a59903 metal : retain src and dst buffers during async ops (#17101) b6993 2025-11-09 08:28:51 +02:00
Xuan-Son Nguyen
aa3b7a90b4 arg: add --cache-list argument to list cached models (#17073)
* arg: add --cache-list argument to list cached models

* new manifest naming format

* improve naming

* Update common/arg.cpp

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
b6992
2025-11-08 21:54:14 +01:00
chansikpark
333f2595a3 webui: fix keyboard shortcuts for new chat & edit chat title (#17007) 2025-11-08 20:52:35 +01:00
Jeff Bolz
53d7d21e61 vulkan: Use spec constants for conv2d s/d/p and kernel W/H (#16978)
* vulkan: Use spec constants for conv2d s/d/p and kernel W/H

Also add some additional unroll hints, which seems to help.

* lock around map lookup
b6990
2025-11-08 13:24:29 -06:00
Aidan
eeee367de5 server: fix correct time_ms calculation in prompt_progress (#17093)
* fix: correct time_ms calculation in send_partial_response

The time_ms field was incorrectly calculated. The division was happening
before the subtraction leading to incorrect values.

Before: (ggml_time_us() - slot.t_start_process_prompt / 1000) After:
(ggml_time_us() - slot.t_start_process_prompt) / 1000

* docs : document time_ms field in prompt_progress
b6989
2025-11-08 15:12:11 +02:00
Aman Gupta
64fe17fbb8 Revert "CUDA: add expert reduce kernel (#16857)" (#17100) b6988 2025-11-08 21:05:19 +08:00
Aman Gupta
c1b187688d CUDA: skip fusion for repeating adds in bias (#17080) b6987 2025-11-08 16:58:05 +08:00
SavicStefan
b8a5cfd11a vulkan: Increase BK to 32; use BK/4 for non-CM mul_mm.comp (#16636)
Signed-off-by: Stefan Savic <stefan.savic@huawei.com>
Co-authored-by: Stefan Savic <stefan.savic@huawei.com>
b6986
2025-11-08 09:28:22 +01:00
Aleksei Nikiforov
08416ebe7f ggml: disable vxe for cross-compilation by default (#16966)
Otherwise compilation will fail due to enabling -mvx -mzvector
and not setting corresponding -march options.
b6985
2025-11-08 16:00:20 +08:00
Jeff Bolz
b4e335d8dc vulkan: fuse rms_norm + mul + rope (+ view + set_rows) (#16977)
This change combines the rms_norm+mul and rope+view+set_rows fusions to
allow fusing the whole sequence together. This comes up in Qwen3, Bailing,
and some other models.
b6984
2025-11-08 08:52:15 +01:00
Jeff Bolz
d6fe40fa00 vulkan: Fix test-thread-safety crashes (#17024)
The std::map pipeline_flash_attn_f32_f16 could be searched and inserted at the
same time, which needs to hold the lock. To be safe, hold the lock for all of
ggml_vk_load_shaders.
b6983
2025-11-08 08:39:45 +01:00
Johannes Gäßler
e14e842e87 CUDA: fix MMQ stream-k fixup ne1 indices (#17089) b6982 2025-11-08 08:26:18 +01:00
Reese Levine
647b960bd8 ggml webgpu: faster matrix multiplication/matrix-vector multiplication (#17031)
* Faster tensors (#8)

Add fast matrix and matrix/vector multiplication.

* Use map for shader replacements instead of pair of strings
b6981
2025-11-07 19:27:20 -08:00
bssrdf
299f5d782c CUDA: properly handle nb00=nb02 case for cpy (#17081) b6980 2025-11-07 23:41:58 +01:00
Acly
ac76d36201 vulkan : refactor buffer handling in vk_op_f32 (#16840)
* vulkan : refactor/simplify buffer handling in vk_op_* functions

* Combine UMA handling into ggml_vk_tensor_subbuffer
b6979
2025-11-07 21:08:50 +01:00
Johannes Gäßler
6515610506 CUDA: fix should_use_mmvf for ne11 == 1 (#17085)
* CUDA: fix should_use_mmvf for ne11 == 1

* Apply suggestion from @am17an

Co-authored-by: Aman Gupta <amangupta052@gmail.com>

---------

Co-authored-by: Aman Gupta <amangupta052@gmail.com>
b6978
2025-11-07 20:53:14 +01:00
Georgi Gerganov
7956bb4d7f bench : cache the llama_context state at computed depth (#16944)
* bench : cache llama_context state at depth

* cont : handle failures to restore the old state

* cont : print information when the state is being reused
b6977
2025-11-07 21:23:11 +02:00
Sigbjørn Skjæret
9008027aa3 hparams : add n_embd_inp() to support extended embed (#16928)
* add n_embd_full to support extended embed

* don't change output

* rename to n_embd_inp

* restore n_embd where applicable
b6976
2025-11-07 19:27:58 +01:00
Georgi Gerganov
16bcc1259d kv-cache : pad the cache size to 256 for performance (#17046)
* kv-cache : pad the size of the small SWA cache for performance

* context : pad the total context to 256

* cont : future-proof the swa pad

* server : adjust test params to new logic
b6975
2025-11-07 20:03:25 +02:00
Adrien Gallouët
9eb9a1331d Revert "ggml-cpu: detect correct cpu flags for arm64 (#16229) (#16239)" (#17084)
This reverts commit 7c23f3f0d4.
b6974
2025-11-07 18:34:05 +02:00
iron
7c23f3f0d4 ggml-cpu: detect correct cpu flags for arm64 (#16229) (#16239)
When using GCC 9 and GCC 12 on the arm64 platform of ubuntu 2004,
the command "gcc -mcpu=native -E -v -" fails to detect the correct CPU flags,
which results in compilation failures for certain extended instructions,
but the correct CPU flags can be obtained by using gcc -march.

Signed-off-by: lizhenneng <lizhenneng@kylinos.cn>
Co-authored-by: lizhenneng <lizhenneng@kylinos.cn>
b6973
2025-11-07 08:18:14 -08:00
Georgi Gerganov
8c0d6bb455 server : print the samplers chain for each request (#17070) b6972 2025-11-07 12:24:47 +02:00
Xuan-Son Nguyen
5c9a18e674 common: move download functions to download.(cpp|h) (#17059)
* common: move download functions to download.(cpp|h)

* rm unused includes

* minor cleanup

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
b6971
2025-11-07 11:23:34 +01:00
xctan
7f09a680af ggml-cpu : optimize RVV q2_k and q3_k kernels (#16887) b6970 2025-11-06 18:12:45 +02:00
Johannes Gäßler
aa374175c3 CUDA: fix crash on uneven context without FA (#16988) b6969 2025-11-06 14:05:47 +01:00
Georgi Gerganov
5b180c3d60 metal : initial Metal4 tensor API support (#16634)
* metal : rework mat-mat multiplication

* metal : initial Metal4 support

* cont

* metal : detect tensor support

* cont : better ifdefs

* metal : support tensors in mul_mm_id

* metal : add env for disabling tensor API

* tests : restore

* metal : remove unused constants

* metal : fix check for bfloat tensor support

* cont : handle API incompatibilities

* cont : handle even more incompatibilities

* metal : use tensor API only on M5 and later
b6968
2025-11-06 14:45:10 +02:00
Georgi Gerganov
b7f9010d24 server : disable checkpoints with mtmd (#17045) b6967 2025-11-06 12:09:29 +02:00
Xuan-Son Nguyen
4882f0ff78 clip: implement minicpm-v sinusoidal embd using GGML (#17036)
* clip: implement minicpm-v sinusoidal embd using GGML

* fix repeat op
b6966
2025-11-06 11:02:54 +01:00
YehuditE
9d7c518d64 sycl: add CONCAT operator support (#16047)
* sycl: add CONCAT operator support

* cleanup: remove stray lines added by mistake

* fix: code format issues in concat.cpp and tests/test-backend-ops.cpp

* chore: fix editorconfig violations

* cleanup: drop unnecessary i16 type support

* docs: update sycl-csv and regenerate ops.md

* update docs/ops.md

* fix: adapt to upstream master changes after rebase

* fix: remove empty files

* fix: drop whitespace

---------

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
b6965
2025-11-06 11:02:33 +01:00
Johannes Gäßler
22c8c3c6ad docs: explain CUDA 11 compilation [no ci] (#16824) 2025-11-06 08:14:35 +01:00
l3utterfly
6db3d1ffe6 ggml-hexagon: graceful fallback for older socs where rpcmem_alloc2 and FASTRPC_GET_URI is unsupported (#16987)
* support older socs where FASTRPC_GET_URI is unsupported

* added graceful fallback when FASTRPC_GET_URI call fails

* use weak symbols instead of loading libcdsprpc.so dynamically

* Add weak pragma for rpcmem_alloc2

* Remove weak declaration for rpcmem_alloc2 in ggml-hexagon.cpp

Removed weak declaration for rpcmem_alloc2.

* Enforce ndev to 1 for archs below v75

Force ndev to 1 for SoCs architectures lower than v75.
b6963
2025-11-05 21:46:38 -08:00
bssrdf
230d1169e5 improve CUDA cpy memory bandwidth when copying transposed tensor (#16841)
* WIP

* added a cpy kernel specific to transposed tensor which uses smem to avoid uncoalesced access; test cases also added shwoing improved memory bandwidth

* added BF16 support

* more strict check to make sure src0 is a transpose

* reformulated to handle more complicated transpose cases

* bring back 2D transpose for higher performance

* allow build on windows

* tranpose copy more shapes

* minor tweak

* final clean up

* restore some test cases

* keep only the kernel for true tranposed case; updated with review suggestions

* make CI happy

* remove headers not needed

* reduced bank conflicts for fp16 and bf16

* add missing const*

* now bank conflicts free

* use padding instead of swizzling

---------

Co-authored-by: bssrdf <bssrdf@gmail.com>
b6962
2025-11-05 21:55:04 +01:00
Jeff Bolz
a44d77126c vulkan: Fix GGML_VULKAN_CHECK_RESULTS to better handle fusion (#16919) b6961 2025-11-05 19:51:03 +01:00
Gabe Goodhart
5886f4f545 examples(gguf): GGUF example outputs (#17025)
* feat(llama-gguf): Print out the tensor type in llama-gguf r

Branch: Mamba2Perf

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* feat(off-topic): print the number of elements in tensors with llama-gguf

Branch: Mamba2SSD

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* style: valign

Branch: GGUFToolOutputs

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* Update examples/gguf/gguf.cpp

---------

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
b6960
2025-11-05 19:58:16 +02:00
Xuan-Son Nguyen
92bb84f775 mtmd: allow QwenVL to process larger image by default (#17020) b6959 2025-11-05 14:26:49 +01:00
Georgi Gerganov
13b339bcd9 server : do not default to multiple slots with speculative decoding (#17017)
* server : do not default to multiple slots with speculative decoding

* cont : fix
b6958
2025-11-05 14:32:55 +02:00
Xuan-Son Nguyen
2f0c2db43e mtmd: improve struct initialization (#16981) b6957 2025-11-05 11:26:37 +01:00
손희준
fd2f84f468 docs: Clarify the endpoint that webui uses (#17001) 2025-11-05 11:20:28 +01:00
Li Pengzhan
9f052478c2 model : add openPangu-Embedded (#16941)
* Model: add openPangu-Embedded

* fixed according to reviewer's comments

* fixed the chat template check condition

* Apply suggestions from code review

change the chat-template check condition and some formatting issue

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

* whitespace cleanup

---------

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
b6955
2025-11-05 10:28:58 +01:00
Reese Levine
03ea04175d ggml webgpu: minor set rows optimization (#16810)
* Add buffer label and enable dawn-specific toggles to turn off some checks

* Minor set_rows optimization (#4)

* updated optimization, fixed errors

* non vectorized version now dispatches one thread per element

* Simplify

* Change logic for set_rows pipelines

---------

Co-authored-by: Neha Abbas <nehaabbas@macbookpro.lan>
Co-authored-by: Neha Abbas <nehaabbas@ReeseLevines-MacBook-Pro.local>
Co-authored-by: Reese Levine <reeselevine1@gmail.com>

* Comment on dawn toggles

* Remove some comments

* Implement overlap binary operators

* Revert "Implement overlap binary operators"

This reverts commit ed710b36f5.

* Disable support for non-contiguous binary_op tensors and leave note for future support

---------

Co-authored-by: neha-ha <137219201+neha-ha@users.noreply.github.com>
Co-authored-by: Neha Abbas <nehaabbas@macbookpro.lan>
Co-authored-by: Neha Abbas <nehaabbas@ReeseLevines-MacBook-Pro.local>
b6954
2025-11-05 10:27:42 +01:00