Commit Graph

7095 Commits

Author SHA1 Message Date
Georgi Gerganov
6fd5f0ec4e graph : reuse hybrid graphs 2025-11-18 11:27:11 +02:00
Chenguang Li
bc4064cfea CANN: fix acl_tensor_ptr usage in ASCEND_310P ROPE (#17347)
* cann: fix acl_tensor_ptr usage in ASCEND_310P ROPE implementation

Fix compilation errors in the ASCEND_310P-specific ROPE operation code
by adding .get() calls when passing acl_tensor_ptr smart pointers to
functions expecting raw aclTensor* pointers.

This fixes the code that was missed in the previous refactoring commit
(8981848) which changed ggml_cann_create_tensor() return type from
aclTensor* to acl_tensor_ptr.

* cann: format code
2025-11-18 16:41:52 +08:00
o7si
97cb3fd5ae fix: resolve undefined variable 'svr' compilation error (#17348) 2025-11-18 10:10:47 +02:00
jiahao su
ffa277a54c CANN: Add openEuler-cann in build and release (#17192)
Update openEuler version

Remove variable ASCEND_SOC_TYPE

Modify the chip type

Fix case in zip filename

Change "device" to "chip_type"

Modify the value of chip_type
2025-11-18 16:08:55 +08:00
Jeff Bolz
da95bf2a85 vulkan: support noncontig i32 copy (#17328) b7091 2025-11-18 07:41:24 +01:00
Xuan-Son Nguyen
0de8878c96 server: split HTTP into its own interface (#17216)
* server: split HTTP into its own interface

* move server-http and httplib to its own file

* add the remaining endpoints

* fix exception/error handling

* renaming

* missing header

* fix missing windows header

* fix error responses from http layer

* fix slot save/restore handler

* fix case where only one stream chunk is returned

* add NOMINMAX

* do not call sink.write on empty data

* use safe_json_to_str for SSE

* clean up

* add some comments

* improve usage of next()

* bring back the "server is listening on" message

* more generic handler

* add req.headers

* move the chat template print to init()

* add req.path

* cont : minor

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
b7090
2025-11-17 22:05:44 +01:00
Ruben Ortlam
38e2c1b412 vulkan: add log RTE support to fix Nvidia CI (#17320)
* vulkan: add log RTE support to fix Nvidia CI

* actually use the rte shader
b7089
2025-11-17 14:37:49 -06:00
Adrien Gallouët
cb44fc84e8 cmake : fix ARM feature verification (#17170)
* cmake : fix ARM feature verification

Use check_cxx_source_compiles to prevent conflicts with
the existing GGML_NATIVE detection code.

Signed-off-by: Adrien Gallouët <angt@huggingface.co>

* cmake : unset __ARM_FEATURE when feature is disabled

Signed-off-by: Adrien Gallouët <angt@huggingface.co>

* cmake : fix scope, this is really a macro

Signed-off-by: Adrien Gallouët <angt@huggingface.co>

* arm_neon.h is useless

Signed-off-by: Adrien Gallouët <angt@huggingface.co>

---------

Signed-off-by: Adrien Gallouët <angt@huggingface.co>
b7088
2025-11-17 21:37:29 +01:00
Adrien Gallouët
cb623de3fc ggml : add missing AVX512 feature checks (#17270)
_mm512_cvtepu8_epi16        requires  __AVX512BW__
_mm512_srli_epi16           requires  __AVX512BW__
__builtin_ia32_inserti32x8  requires  __AVX512DQ__

Signed-off-by: Adrien Gallouët <angt@huggingface.co>
b7087
2025-11-17 12:12:00 +01:00
Georgi Gerganov
7aaeedc098 metal : support I32 -> I32 copy (#17317) b7086 2025-11-17 11:52:00 +02:00
Georgi Gerganov
3347e6d904 metal : faster argsort (#17315)
* metal : faster argsort

* cont : keep data in registers
b7085
2025-11-17 11:51:48 +02:00
Georgi Gerganov
1a139644a8 metal : add cumsum (#17305) b7084 2025-11-17 11:51:13 +02:00
hipudding
2376b7758c CANN: Use smart pointers to manage ACL objects (#17238)
* CANN: Use smart pointers to manage ACL objects

Previously, ACL objects were managed via manual destruction, which
led to multiple memory-leak issues during runtime. This patch replaces
manual memory management with smart pointers so that ACL objects
are properly released and ownership is clearly defined.

Note that the ownership of an ACL object belongs to the function
that creates it. Other internal functions should operate on these ACL
objects using raw pointers to avoid unintended ownership transfers.

Additionally, since aclTensorList automatically frees its contained
aclTensor objects, any aclTensor added to a tensor list must release
ownership to avoid double free operations.

This PR also removes the asynchronous task submission mechanism.
Due to changes in recent CANN versions, tiling time has significantly
decreased. Even with a dual-thread submission model, the dispatch
overhead still falls on the critical path, making async submission
less beneficial. Moreover, aclGraph support provides a much better
path to reducing operator dispatch latency.

* CANN: resolve review comments
b7083
2025-11-17 08:43:59 +08:00
Pavels Zaicenkovs
dbed61294a vulkan: add LOG operation support for F32 and F16 (#17183)
* vulkan: add LOG operation support for F32 and F16

Part of #14909.

* vulkan: Fix LOG operation types

* docs: Update operation support documentation for Vulkan LOG operation

* vulkan: fix log_f16 shader

* docs: restore missing LOG test cases and regenerate ops.md
b7082
2025-11-16 22:50:09 +01:00
Ruben Ortlam
80deff3648 vulkan: fix MMQ quantize_y condition (#17301) b7081 2025-11-16 19:38:17 +01:00
Eve
8b1c339bd2 ci : revert #16249 (#17303)
* Delete .github/workflows/build-amd.yml

* Update build.yml
2025-11-16 19:09:17 +01:00
Georgi Gerganov
416e7c7f47 metal : remove obosolete asserts (#17295) b7079 2025-11-16 09:50:26 +02:00
Georgi Gerganov
5b2093becc server : handle context overflow during decode (#17267)
* server : handle context overflow during decode

* server : minor refactor
b7078
2025-11-16 09:23:37 +02:00
lhez
52e5d421f1 opencl: fix rms_norm_mul (#17250)
* opencl: use subgrroup reduce for reduction in rms_norm_mul

* opencl: add comment about workgroup size
b7077
2025-11-15 17:40:14 -08:00
shaofeiqi
4db5641210 opencl: add kernel to handle mat mul in attention to improve encoding speed (#17181)
* Add mul_mm_f16_f32_kq_kqv kernel

* Add ggml_cl_mul_mat_kq_kqv_adreno func

* fix whitespace

* remove unused variable

* remove redundant

* refactor and clean up

* remove trailing whitespace
b7076
2025-11-15 17:33:10 -08:00
shani-f
72bd7321a7 sycl : unify unary kernels with a generic implementation and enable wide operator support (#17213)
* SYCL: add generic unary op implementation for multiple ops (ABS/SGN/…); unify non-contiguous access

* SYCL: update documentation and sycl.csv to reflect new unary op support

* update ops.md after syncing SYCL.csv changes

* Fix SYCL.csv merge conflict

* Update ops.md after fixing SYCL.csv conflicts

* Fix SYCL.csv tail after merge conflict and regenerate ops.md

* Fix line endings and final newline in SYCL.csv

* Remove TOPK_MOE entries from SYCL.csv as requested

* Update ops.md after removing TOPK_MOE from SYCL.csv

* Regenerated SYCL.csv and synced ops.md with upstream

* Update ops.md using create_ops_docs.py
b7075
2025-11-16 00:52:42 +01:00
Aleksander Grygier
22e1ce2f81 webui: Fix clickability around chat processing statistics UI (#17278)
* fix: Better pointer events handling in chat processing info elements

* chore: update webui build output
2025-11-15 22:41:41 +01:00
Pascal
1411d9275a webui: add OAI-Compat Harmony tool-call streaming visualization and persistence in chat UI (#16618)
* webui: add OAI-Compat Harmony tool-call live streaming visualization and persistence in chat UI

- Purely visual and diagnostic change, no effect on model context, prompt
  construction, or inference behavior

- Captured assistant tool call payloads during streaming and non-streaming
  completions, and persisted them in chat state and storage for downstream use

- Exposed parsed tool call labels beneath the assistant's model info line
  with graceful fallback when parsing fails

- Added tool call badges beneath assistant responses that expose JSON tooltips
  and copy their payloads when clicked, matching the existing model badge styling

- Added a user-facing setting to toggle tool call visibility to the Developer
  settings section directly under the model selector option

* webui: remove scroll listener causing unnecessary layout updates (model selector)

* Update tools/server/webui/src/lib/components/app/chat/ChatMessages/ChatMessageAssistant.svelte

Co-authored-by: Aleksander Grygier <aleksander.grygier@gmail.com>

* Update tools/server/webui/src/lib/components/app/chat/ChatMessages/ChatMessageAssistant.svelte

Co-authored-by: Aleksander Grygier <aleksander.grygier@gmail.com>

* chore: npm run format & update webui build output

* chore: update webui build output

---------

Co-authored-by: Aleksander Grygier <aleksander.grygier@gmail.com>
2025-11-15 21:09:32 +01:00
Sigbjørn Skjæret
662192e1dc convert : remove unnecessary chat template patching (#17289) 2025-11-15 20:58:59 +01:00
Jeff Bolz
24dc769f1b vulkan: Fuse mul_mat_id+add_id+mul and mul_mat+add+add. (#17287)
These both show up in gpt-oss. Also, cleanup the mul_mat_vec fusion code a bit.
b7071
2025-11-15 19:54:23 +01:00
Ruben Ortlam
4dca015b7e vulkan: Replace 16-bit unpack8 calls to work around legacy Windows AMD driver bug (#17285) b7070 2025-11-15 15:18:58 +01:00
Sigbjørn Skjæret
9a8860cf5d convert : use all parts in safetensors index (#17286) 2025-11-15 14:12:39 +01:00
Sigbjørn Skjæret
9d3ef4809f convert : set expert gating func in base class (#17279) 2025-11-15 14:06:24 +01:00
Ankur Verma
c7b7db0445 mtmd-cli: Avoid logging to stdout for model loading messages in mtmd-cli (#17277) b7067 2025-11-15 12:41:16 +01:00
Giuseppe Scrivano
1568d13c2c vulkan: implement ABS and NEG (#17245)
* docs: update Vulkan ops

* vulkan: add NEG op

* vulkan: add ABS op

---------

Signed-off-by: Giuseppe Scrivano <gscrivan@redhat.com>
b7066
2025-11-15 12:00:29 +01:00
Jeff Bolz
439342ea0b vulkan: Use ggml_vk_tensor_subbuffer in mul_mat_vec(id) paths (#17244)
* vulkan: Use ggml_vk_tensor_subbuffer in mul_mat_vec(id) paths

* set allow_misalign
b7065
2025-11-15 11:56:15 +01:00
Jeff Bolz
234ae7d7bd vulkan: skip all-negative-inf blocks in FA (#17186) b7064 2025-11-15 10:37:25 +01:00
Jeff Bolz
38eaf32af1 vulkan: change graph_compute to be async and enable get_tensor_async (#17158)
* vulkan: change graph_compute to be async and enable get_tensor_async

This allows some additional CPU/GPU overlap for large pp workloads. Also seems
to help a bit for token gen, maybe getting rid of a small bubble between
graph_compute and get_tensor.

Async set and copy functions seem to be very rarely used, so I didn't enable
them because I didn't have a good way to test them.

The async commands need to be ordered against each other, so put them all on
the compute queue. The non-async commands still use the transfer queue.

The fence for graph_compute/get_tensor_async is submitted and waited on in
ggml_vk_synchronize.

* fix thread safety errors

* teardown context cleanly

* Handle async read to non-pinned dst
b7063
2025-11-15 09:06:41 +01:00
Xuan-Son Nguyen
9b17d74ab7 mtmd: add mtmd_log_set (#17268) b7062 2025-11-14 15:56:19 +01:00
Bartowski
e1fcf8b09b model : add AfmoeForCausalLM support (#16477)
* Add AFMOE model support

* Update to vocab

* Add model sizing

* Undo Rope change for ARCEE model

* Address review comments

* Update modeling code is_sliding -> use_rope, replace hard-coded logic

* Fix AFMOE tokenizer

* Update convert_hf_to_gguf.py

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

* Update convert_hf_to_gguf.py

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

* Update AFMoE tokenizer class identification to be more unique

---------

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
b7061
2025-11-14 13:54:10 +01:00
Marek Hradil jr.
6cd0cf72ce fix : Dangling pointer for non-empty trigger words in lazy grammar construction (#17048)
* fix : Dangling pointer for non-empty trigger words in llama_sampler_init_grammar_impl (#17047)

* Replace 'static' workaround, with keeping variable in scope for longer

* Create std::array directly and pass into llama_grammar_init_impl

* Add back the trigger pattern

* Missed array include
b7060
2025-11-14 14:35:26 +02:00
Georgi Gerganov
d396b43748 server : fix "can batch with" bug (#17263) b7059 2025-11-14 14:03:45 +02:00
Georgi Gerganov
45c6ef7307 metal : support argsort for ne00 > 1024 (#17247)
* metal : refactor argsort

* cont : sort chunks

* cont : merge sorted buckets

* cont : cleanup
b7058
2025-11-14 09:36:06 +02:00
Georgi Gerganov
2606b0adab metal : make the FA extra sizes consistent (#17143) b7057 2025-11-14 09:13:34 +02:00
ixgbe
307772fcda readme : add RVV,ZVFH,ZFH,ZICBOP support for RISC-V (#17259)
Signed-off-by: Wang Yang <yangwang@iscas.ac.cn>
2025-11-14 09:12:56 +02:00
Aleksander Grygier
f1bad23f88 Better UX for handling multiple attachments in WebUI (#17246) b7055 2025-11-14 01:19:08 +01:00
Alberto Cabrera Pérez
becc4816dd ggml-cpu: handle 3d tensors in repack mat_mul (#17241)
* ggml-cpu: handle 3d tensors in repack mul_mat

* Removed unnecessary branch, removed need for <algorithm>

* Fixed dst_ptr pointer in chunk + clang_format

* GGML_ASSERT to check wdata within bounds

* Accidental ggml.h inclusion

* Improved GGML_ASSERT on wdata boundaries

* Address performance regression in Qwen and llama.cpp due to chunking
b7054
2025-11-13 12:53:00 -08:00
Xuan-Son Nguyen
c4abcb2457 server: fixing naming conflict res_error (#17243) b7053 2025-11-13 20:53:47 +01:00
Piotr Wilkin (ilintar)
389ac78b26 ggml : add ops SOFTPLUS, EXPM1, TRI, SOLVE_TRI, CUMSUM (#17063)
* Add ops needed for new hybrid models: SOFTPLUS, EXPM1, TRI, SOLVE_TRI, CUMSUM

* Update ggml/include/ggml.h

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

* Update tests/test-backend-ops.cpp

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

* Code review

* Whitespace

* Update tests/test-backend-ops.cpp

Co-authored-by: Diego Devesa <slarengh@gmail.com>

* This is actually sigmoid, duh.

* Add CONST, remove TRI_KEEP, other changes from review

* Update tests/test-backend-ops.cpp

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

* Update ggml/src/ggml.c

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

* Update ggml/src/ggml.c

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

* Update ggml/src/ggml-cuda/unary.cu

Co-authored-by: Aman Gupta <amangupta052@gmail.com>

* Remove extra script

* Update ggml/src/ggml.c

Co-authored-by: Diego Devesa <slarengh@gmail.com>

* Update tests/test-backend-ops.cpp

Co-authored-by: Diego Devesa <slarengh@gmail.com>

* moving changes from laptop [no ci]

* pre-rebase

* Update tests/test-backend-ops.cpp

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

* Update tests/test-backend-ops.cpp

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

* Refactor tests

* ggml : cleanup

* cont : fix ggml_fill srcs

* tests : add note

* ggml : add ggml_fill_inplace

* ggml : add asserts

* ggml : fix ggml_fill constant cast

* cont : ggml_tri minor

* Use TENSOR_LOCALS

* Fix regression from #14596, regenerate

* Don't make commits at night...

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
Co-authored-by: Diego Devesa <slarengh@gmail.com>
Co-authored-by: Aman Gupta <amangupta052@gmail.com>
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
b7052
2025-11-13 20:54:47 +02:00
Ruben Ortlam
a19bd6f7ce vulkan: remove shell call from vulkan-shaders-gen tool, revert file check (#17219)
* vulkan: remove shell call from vulkan-shaders-gen tool

* use string vector for command execution

* Fix condition

* use string, remove const_cast

* Fix dependency file quotation on Windows

---------

Co-authored-by: Jeff Bolz <jbolz@nvidia.com>
b7051
2025-11-13 14:51:21 +01:00
Diego Devesa
dd091e52f8 sched : fix reserve ignoring user tensor assignments (#17232) b7050 2025-11-13 13:14:02 +01:00
ixgbe
1215dde7b0 ggml-cpu : add RISC-V vector intrinsic support for silu and cvar operations (#17227)
Signed-off-by: Wang Yang <yangwang@iscas.ac.cn>
b7049
2025-11-13 13:13:32 +01:00
bagheera
0cfb19166b metal: accelerated conv2d (#17175)
* metal: accelerated conv2d

* cont : cleanup

---------

Co-authored-by: bghira <bghira@users.github.com>
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
b7048
2025-11-13 13:32:44 +02:00
Georgi Gerganov
2776db6c81 Revert "ggml-cpu: handle 3d tensors in repack mat_mul (#17030)" (#17233)
This reverts commit 1c398dc9ec.
b7047
2025-11-13 12:59:37 +02:00
Diego Devesa
879dec341a ggml-cpu : use template for argsort (#17222) b7046 2025-11-13 10:59:05 +02:00