* Add buffer label and enable dawn-specific toggles to turn off some checks
* Minor set_rows optimization (#4)
* updated optimization, fixed errors
* non vectorized version now dispatches one thread per element
* Simplify
* Change logic for set_rows pipelines
---------
Co-authored-by: Neha Abbas <nehaabbas@macbookpro.lan>
Co-authored-by: Neha Abbas <nehaabbas@ReeseLevines-MacBook-Pro.local>
Co-authored-by: Reese Levine <reeselevine1@gmail.com>
* Comment on dawn toggles
* Remove some comments
* Implement overlap binary operators
* Revert "Implement overlap binary operators"
This reverts commit ed710b36f5.
* Disable support for non-contiguous binary_op tensors and leave note for future support
---------
Co-authored-by: neha-ha <137219201+neha-ha@users.noreply.github.com>
Co-authored-by: Neha Abbas <nehaabbas@macbookpro.lan>
Co-authored-by: Neha Abbas <nehaabbas@ReeseLevines-MacBook-Pro.local>
* Fix test-conv2d-dw failure on ARM SVE by using runtime vector length
The ggml_compute_forward_conv_2d_dw_cwhn function was using a hardcoded GGML_F32_EPR (8) for SIMD vectorization, but on ARM SVE the actual vector length varies by hardware. This caused incorrect computation when processing CWHN layout tensors on ARM machines.
Fix by using svcntw() to get the runtime SVE vector length instead of the compile-time constant.
Co-authored-by: ggerganov <1991296+ggerganov@users.noreply.github.com>
* ci : reduce sam score threshold
* ci : update bbox checks for sam test
---------
Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com>
Co-authored-by: ggerganov <1991296+ggerganov@users.noreply.github.com>
* vulkan: remove the need for the dryrun
Allocate pipelines and descriptor sets when requested.
Reallocate the prealloc buffers when needed, and flush any pending work
before reallocating.
For rms_partials and total_mul_mat_bytes, use the sizes computed the last time
the graph was executed.
* remove dryrun parameters
* Fix garbled output with REPACK at high thread counts
Fixed a race condition in the REPACK matrix multiplication code that caused garbled output when using 26+ threads (model-dependent threshold). The issue occurred because with high thread counts, the code forced chunk count to equal thread count, creating many small chunks. After aligning these chunks to NB_COLS boundaries, adjacent chunks could overlap, causing data corruption and race conditions. The fix enforces minimum chunk sizes based on NB_COLS and caps maximum chunk count to prevent creating too many tiny chunks, ensuring proper alignment without overlaps.
* Update ggml/src/ggml-cpu/repack.cpp
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
* Update ggml/src/ggml-cpu/repack.cpp
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
---------
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
This commit modifies the script `run-org-model.py` to ensure that the
model configuration is explicitly passed to the `from_pretrained` method
when loading the model. It also removes a duplicate configuration
loading which was a mistake.
The motivation for this change is that enables the config object to be
modified and then passed to the model loading function, which can be
useful when testing new models.
* SYCL repeat_back v1 — add core op + switch case
* Implement repeat_back SYCL operation and minor fixes
* SYCL: optimize repeat_back kernel
* Remove Hebrew comment from repeat_back.cpp
* Remove comments for code clarity
Removed comments to clean up the code.
* Fix formatting in ggml-sycl.cpp
* Formatted lambda according to legacy style. No logic changes
* Remove blank line in repeat_back.cpp
Remove unnecessary blank line before assigning acc to dst_dd.
* tests: fix segfault in moe-expert-reduce test in support mode and --show-coverage
* tests: init gf and filter out fusion tests for support mode
* tests: filter out fusion cases before calling eval_support
* tests: filter out fusion cases from show_test_coverage as well, fix lint
* clip : use FA
* cont : add warning about unsupported ops
* implement "auto" mode for clip flash attn
* clip : print more detailed op support info during warmup
* cont : remove obsolete comment [no ci]
* improve debugging message
* trailing space
* metal : remove stray return
---------
Co-authored-by: Xuan Son Nguyen <son@huggingface.co>
* server : support unified context across slots
* cont : fix speculative decoding initialization
* context : fix n_ctx_per_seq computation
* server : purge slots one by one
* tests : add unified cache server tests
* llama : update per-seq context computation
* test-thread-safety : handle tiny training context of the input model
* server : fix server_tokens clear()
* server : use 4 slots + unified KV by default
* llama : add note about context size queries
* cont : update todos [no ci]
* context : do not cap the size of the context
* tests : adjust parameters to be CI friendlier
* context : add warning
commit 5fb5e24811 (llama : minor
sampling refactor (2) (#9386)) moved the llama_sampler_accept call
into llama_sampler_sample, but the sampling sample usage in llama.h
was forgotten to be updated accordingly.
* webui: auto-refresh /props on inference start to resync model metadata
- Add no-cache headers to /props and /slots
- Throttle slot checks to 30s
- Prevent concurrent fetches with promise guard
- Trigger refresh from chat streaming for legacy and ModelSelector
- Show dynamic serverWarning when using cached data
* fix: restore proper legacy behavior in webui by using unified /props refresh
Updated assistant message bubbles to show each message's stored model when available,
falling back to the current server model only when the per-message value is missing
When the model selector is disabled, now fetches /props and prioritizes that model name
over chunk metadata, then persists it with the streamed message so legacy mode properly
reflects the backend configuration
* fix: detect first valid SSE chunk and refresh server props once
* fix: removed the slots availability throttle constant and state
* webui: purge ai-generated cruft
* chore: update webui static build
* webui: add HTML/JS preview support to MarkdownContent with sandboxed iframe dialog
Extended MarkdownContent to flag previewable code languages,
add a preview button alongside copy controls, manage preview
dialog state, and share styling for the new button group
Introduced CodePreviewDialog.svelte, a sandboxed iframe modal
for rendering HTML/JS previews with consistent dialog controls
* webui: fullscreen HTML preview dialog using bits-ui
* Update tools/server/webui/src/lib/components/app/misc/CodePreviewDialog.svelte
Co-authored-by: Aleksander Grygier <aleksander.grygier@gmail.com>
* Update tools/server/webui/src/lib/components/app/misc/MarkdownContent.svelte
Co-authored-by: Aleksander Grygier <aleksander.grygier@gmail.com>
* webui: pedantic style tweak for CodePreviewDialog close button
* webui: remove overengineered preview language logic
* chore: update webui static build
---------
Co-authored-by: Aleksander Grygier <aleksander.grygier@gmail.com>
* webui: recognize AsciiDoc files as valid text files
* webui: add an updated static webui build
* webui: add the updated dependency list
* webui: re-add an updated static webui build
This also reverts commit 742dbb8379.
* vulkan: fuse mul_mat+add and mul_mat_id+add_id
The fusion is only applied for the mat-vec mul paths.
* Apply suggestions from code review
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
* fix 32b build
---------
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
* CUDA: Remove unneded bias/gate dims in fused mmvq
Pointed out
[here](https://github.com/ggml-org/llama.cpp/pull/16847#discussion_r2476798989)
that only a single value is needed per target col per thread
* Apply suggestions from code review
Co-authored-by: Johannes Gäßler <johannesg@5d6.de>
* Fix "Error 991-D: extra braces are nonstandard" during compilation
---------
Co-authored-by: Johannes Gäßler <johannesg@5d6.de>