251 Commits

Author SHA1 Message Date
Georgi Gerganov
df1b612e29 server : add /v1/health endpoint (#16461)
* server : add /v1/health endpoint

* cont : update readme
2025-10-07 15:57:14 +03:00
Sascha Rogmann
4e0388aa8a webui : added download action (#13552) (#16282)
* webui : added download action (#13552)

* webui : import and export (for all conversations)

* webui : fixed download-format, import of one conversation

* webui : add ExportedConversations type for chat import/export

* feat: Update naming & order

* chore: Linting

* webui : Updated static build output

---------

Co-authored-by: Aleksander Grygier <aleksander.grygier@gmail.com>
2025-10-07 11:11:08 +02:00
Radoslav Gerganov
c61ae20d05 rpc : update documentation (#16441)
Update the README file to match the newly added functionality of
exposing multiple devices from a single server.

Co-authored-by: Diego Devesa <slarengh@gmail.com>
2025-10-07 06:59:13 +00:00
Gadflyii
3df2244df4 llama : add --no-host to disable host buffers (#16310)
* implement --no-host to disable host buffer

* fix equal_mparams

* move no-host enumeration order together with other model params

---------

Co-authored-by: slaren <slarengh@gmail.com>
2025-10-06 19:55:53 +02:00
Gabe Goodhart
c08002a198 chat : Granite Docling stopping (#16438)
* fix: Fix duplicate fake image before token on first slice

Branch: GraniteDoclingStopping

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* fix: Use double-newline before overview image

Branch: GraniteDoclingStopping

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* fix: Remove incorrect newline at the end of granite chat template gen prompt

There should not be one, even for the language models.

Branch: GraniteDoclingStopping

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* tests: Remove bad newline from granite chat template test (legacy)

Branch: GraniteDoclingStopping

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

---------

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
2025-10-06 18:59:40 +02:00
Oleksandr Kuvshynov
c5fef0fcea server: update readme to mention n_past_max metric (#16436)
https://github.com/ggml-org/llama.cpp/pull/15361 added new metric
exported, but I've missed this doc.
2025-10-06 10:53:31 +03:00
Gabe Goodhart
ca71fb9b36 model : Granite docling + Idefics3 preprocessing (SmolVLM) (#16206)
* feat: Add granite-docling conversion using trillion pretokenizer

Branch: gabe-l-hart/GraniteDocling

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* feat: Add granite-docling vocab pre enum

Branch: gabe-l-hart/GraniteDocling

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* fix: Use granite-docling pre

Branch: gabe-l-hart/GraniteDocling

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* feat: Add clip_is_idefics3

Branch: gabe-l-hart/GraniteDocling

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* feat: Allow multi-token boundary sequences for image templating

Branch: gabe-l-hart/GraniteDocling

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* feat: Add tiling support for idefices3 in clip.cpp

This should likely be moved into llava_uhd::get_slice_instructions, but for
now this avoids disrupting the logic there.

Branch: gabe-l-hart/GraniteDocling

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* feat: Partial support for full templating for idefics3 in mtmd

There are still errors encoding some of the image chunks, but the token
sequence now matches transformers _almost_ perfectly, except for the double
newline before the global image which shows up as two consecutive newline
tokens instead of a single double-newline token. I think this is happening
because the blocks are tokenized separately then concatenated.

Branch: gabe-l-hart/GraniteDocling

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* feat: Fully working image preprocessing for idefics3 w/ resize and slicing

Branch: gabe-l-hart/GraniteDocling

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* feat: Parse the preprocessor config's longest side and add it to the mmproj hparams

Branch: GraniteDocling

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* fix: Use the longest side instead of size * scale_factor

For Granite Docling, these come out to the same value, but that was just a
conicidence.

Branch: GraniteDocling

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* fix: Allow batch encoding and remove clip_is_idefics3

Branch: GraniteDocling

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* refactor: Remove unnecessary conditionals for empty token vectors

Branch: GraniteDocling

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* refactor: Use image_manipulation util

Branch: GraniteDocling

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* add test model

---------

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
Co-authored-by: Xuan Son Nguyen <son@huggingface.co>
2025-10-05 14:57:47 +02:00
Radoslav Gerganov
898acba681 rpc : add support for multiple devices (#16276)
* rpc : add support for multiple devices

Allow rpc-server to expose multiple devices from a single endpoint.
Change RPC protocol to include device identifier where needed.

closes: #15210

* fixes

* use ggml_backend_reg_t

* address review comments

* fix llama-bench backend report

* address review comments, change device naming

* fix cmd order
2025-10-04 12:49:16 +03:00
ddh0
f6dcda3900 server : context checkpointing for hybrid and recurrent models (#16382)
* initial commit for branch 3

* generalize `swa_checkpoint` to `ctx_checkpoint`

this extends `llama-server`'s SWA checkpointing logic to include
hybrid/recurrent models such as Jamba, Granite

* oops

* disable debug prints

* keep backwards compat with `--swa-checkpoints`

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

* update prompt re-processing message

* fix off-by-one error per GG

* keep `seq_rm` log per GG

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

* server : fix checkpoint logic to support recurrent caches

* server : cleanup and fixes

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2025-10-03 21:34:51 +03:00
Aleksander Grygier
84c8e305e8 Fix missing messages on sibling navigation (#16408)
* fix: resolve message disappearing issue when navigating between regenerated siblings by using current leaf nodes instead of cached sibling IDs

* chore: update webui build output

* chore: update webui build output
2025-10-03 12:51:40 +02:00
Aleksander Grygier
77233277c9 Capture model name only after first token (streaming) or completed request (#16405)
* feat: Capture model name only after first token (streaming) or completed request (non-streaming)

* chore: update webui build output

* chore: update webui build output
2025-10-03 11:30:39 +02:00
Aleksander Grygier
136bda78c5 webui : Fix messages payload sent to chat completions (#16402)
* fix: Include just the currently active message branches instead of all in chat completions request

* chore: Build webui static output

* chore: Formatting

* chore: update webui build output
2025-10-03 10:11:34 +03:00
Pascal
5113efd34c fix: track viewportHeight via window.innerHeight to avoid unwanted scrolling (#16356)
Use <svelte:window bind:innerHeight> instead of manual resize listener

Co-authored-by: Aleksander Grygier <aleksander.grygier@gmail.com>
2025-10-03 08:01:31 +02:00
Piotr Wilkin (ilintar)
34fcc5a4ac model : Apertus model implementation (#15852)
* First attempt

* No permute during convert (fixes qk tensors), proper norm application.

* RoPE = NeoX

* Coherence!

* Migrate xielu params from tensors to hyperparameters

* Simple CUDA kernel

* Revert stupid LLM refactorings

* Chat template support

* configchecker / flake8 errors

* Reorder unary.cu

* I do conclude that LLMs are, in fact, stupid.

* Fix after merge

* Final newline

* Make xIELU an UNARY_OP

* Final newline

* Correctly account for parameter shift

* Argh.

* Update ggml/src/ggml-cpu/unary-ops.cpp

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

* Refactor: remove unused methods, inline and factorize softplus, add const modifiers

* Revert CUDA changes, implement xIELU as a separate OP

* Pesky newline

* Add float2half / half2float for F16 inputs/outputs

* CUDA variants, attempt 2

* Actually, attempt 3

* Update ggml/src/ggml-cuda/unary.cu

Co-authored-by: Johannes Gäßler <johannesg@5d6.de>

* Missing convert header

* Proper formula and reference for xIELU in the comments.

* Modify unary-ops.cpp to add the functor-based logic besides the template system to retain optimizations

* Apply suggestions from code review

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

* Add tensor mappings for Apertus to global list instead

* Fix lazy on scalars

* Update ggml/src/ggml-cuda/unary.cu

Co-authored-by: Johannes Gäßler <johannesg@5d6.de>

* Add comment about the constraints on positive/negative alpha

* Change `softplus` to `ggml_softplus`

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
Co-authored-by: Johannes Gäßler <johannesg@5d6.de>
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
2025-10-02 20:43:22 +03:00
Adrien Gallouët
4201deae9c common: introduce http.h for httplib-based client (#16373)
* common: introduce http.h for httplib-based client

This change moves cpp-httplib based URL parsing and client setup into
a new header `common/http.h`, and integrates it in `arg.cpp` and `run.cpp`.

It is an iteration towards removing libcurl, while intentionally
minimizing changes to existing code to guarantee the same behavior when
`LLAMA_CURL` is used.

Signed-off-by: Adrien Gallouët <angt@huggingface.co>

* tools : add missing WIN32_LEAN_AND_MEAN

Signed-off-by: Adrien Gallouët <adrien@gallouet.fr>

---------

Signed-off-by: Adrien Gallouët <angt@huggingface.co>
Signed-off-by: Adrien Gallouët <adrien@gallouet.fr>
2025-10-01 20:22:18 +03:00
Aleksander Grygier
764799279f Conversation action dialogs as singletons from Chat Sidebar + apply conditional rendering for Actions Dropdown for Chat Conversation Items (#16369)
* fix: Render Conversation action dialogs as singletons from Chat Sidebar level

* chore: update webui build output

* fix: Render Actions Dropdown conditionally only when user hovers conversation item + remove unused markup

* chore: Update webui static build

* fix: Always truncate conversation names

* chore: Update webui static build
2025-10-01 18:18:10 +02:00
Aleksander Grygier
2a9b63383a Improve code block color theming (#16325)
* feat: Improve code block theming

* chore: update webui build output

* chore: Update webui static build
2025-10-01 15:54:42 +02:00
Aleksander Grygier
4f1575921c Add optional setting for showing "Model used:" information (#16337)
* feat: Add a setting to include model name used to generate the message

* feat: UI improvements

* feat: Save model info along with the database message entry creation

* chore: Build webui static output
2025-10-01 12:08:16 +02:00
Aleksander Grygier
aa9538a63a webui: Remove running llama-server within WebUI dev.sh script (#16363) 2025-10-01 08:40:26 +03:00
Pascal
16b0ca0d2e Chatapi ignore empty sampling (#16330)
* fix: skip empty sampling fields instead of coercing to 0 in chat API options

* chore: update webui build output
2025-09-30 19:18:54 +02:00
Pascal
5f7e166cbf Fix thinking blocks with quotes + add handling [THINK]...[/THINK] blocks (#16326)
* fix: prevent reasoning blocks with quotes from being truncated

* chore: update webui build output

* feat: Improve thinking content parsing

* test: Adds ChatMessage component stories for different thinking blocks

* chore: update webui build output

* fix: ChatMessage story fix

---------

Co-authored-by: Aleksander Grygier <aleksander.grygier@gmail.com>
2025-09-29 18:49:47 +02:00
Aleksander Grygier
3a2bdcda0b Improve Mobile UI for dialogs and action dropdowns (#16222)
* fix: Always show conversation item actions

* feat: Improve Alert Dialog and Dialog mobile UI

* feat: Add settings reset to default confirmation

* fix: Close Edit dialog on save

* chore: update webui build output

* webui: implement proper z-index system and scroll management

- Add CSS variable for centralized z-index control
- Fix dropdown positioning with Settings dialog conflicts
- Prevent external scroll interference with proper event handling
- Clean up hardcoded z-index values for maintainable architecture

* webui: ensured the settings dialog enforces dynamic viewport height on mobile while retaining existing desktop sizing overrides

* feat: Use `dvh` instead of computed px height for dialogs max height on mobile

* chore: update webui build output

* feat: Improve Settings fields UI

* chore: update webui build output

* chore: update webui build output

---------

Co-authored-by: Pascal <admin@serveurperso.com>
2025-09-29 10:37:20 +02:00
Pascal
66bb7985c3 fix: preserved zero values in chat settings inputs and textareas by switching to nullish coalescing for field values and default placeholders (#16312) 2025-09-29 09:08:41 +02:00
Vinkal
2f61c0f5bf llama-cli: prevent spurious assistant token (#16202)
* tools/main: llama-cli: prevent spurious assistant token (#13402)

During prompt ingestion, prompt tokens are accepted into the sampler history (for repetition penalties). The conversation-mode path then appended `common_sampler_last(smpl)` to `assistant_ss` before any new token was sampled. At that point, "last" was a prompt-side token (e.g., an input prefix), so the assistant chat message began with an extra piece.

Fix: append to `assistant_ss` only for a newly sampled (non-EOG) token. This affects only chat message assembly (`assistant_ss` / `chat_msgs` / `common_chat_format_single`); terminal stdout is unchanged. Sampling order/logits are unchanged.

Fixes #13402.

Signed-off-by: Vinkal Chudgar <vinkal.chudgar@gmail.com>

* Update tools/main/main.cpp

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

* tools/main: remove outdated comment

Signed-off-by: Vinkal Chudgar <vinkal.chudgar@gmail.com>

---------

Signed-off-by: Vinkal Chudgar <vinkal.chudgar@gmail.com>
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
2025-09-29 10:03:12 +03:00
ddh0
3ffd0fae47 perplexity : show more kl-divergence data (#16321)
Adds additional percentile data for displayed in the output of `llama-perplexity --kl-divergence`:
- Added 95 percentile (mirroring existing 5 percentile)
- Added 0.1 percentile (mirroring existing 99.9 percentile)
2025-09-29 09:30:45 +03:00
Imad Saddik
2811c65286 Fixed a few typos in the README of the LLaMA.cpp HTTP Server [no ci] (#16297) 2025-09-28 13:04:46 +02:00
Aleksander Grygier
4807e8f96a Show message actions by default (#16289) 2025-09-27 19:56:40 +02:00
Adrien Gallouët
234e2ff8ed server : remove old LLAMA_SERVER_SSL (#16290)
Signed-off-by: Adrien Gallouët <angt@huggingface.co>
2025-09-27 19:17:08 +03:00
Aleksander Grygier
807e8c6d31 Enhance text file detection logic for file attachments (#16199)
* feat: Enhances text file detection logic

* chore: Build static `webui` output

* chore: update webui build output
2025-09-26 19:25:29 +02:00
Aleksander Grygier
1a18927894 Allow viewing conversations even when llama server is down (#16255)
* webui: allow viewing conversations and sending messages even if llama-server is down

- Cached llama.cpp server properties in browser localStorage on startup, persisting successful fetches and reloading them when refresh attempts fail so the chat UI continues to render while the backend is unavailable.
- Cleared the stored server properties when resetting the store to prevent stale capability data after cache-backed operation.
- Kept the original error-splash behavior when no cached props exist so fresh installs still surface a clear failure state instead of rendering stale data.

* feat: Add UI for `props` endpoint unavailable + cleanup logic

* webui: extend cached props fallback to offline errors

Treat connection failures (refused, DNS, timeout, fetch) the same way as
server 5xx so the warning banner shows up when cache is available, instead
of falling back to a full error screen.

* webui: Left the chat form enabled when a server warning is present so operators can keep sending messages

e.g., to restart the backend over llama-swap, even while cached /props data is in use

* chore: update webui build output

---------

Co-authored-by: Pascal <admin@serveurperso.com>
2025-09-26 18:35:42 +02:00
Isaac McFadyen
e0539eb6ae webui: switch to hash-based routing (alternative of #16079) (#16157)
* Switched web UI to hash-based routing

* Added hash to missed goto function call

* Removed outdated SPA handling code

* Fixed broken sidebar home link
2025-09-26 18:36:48 +03:00
Aleksander Grygier
5d0a40f390 Always show message actions for mobile UI + improvements for user message sizing (#16076) 2025-09-26 15:59:07 +02:00
Aleksei Nikiforov
cc1cfa277b mtmd : fix uninitialized variable in bicubic_resize (#16275)
Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>
Co-authored-by: Aaron Teo <aaron.teo1@ibm.com>
2025-09-26 15:00:44 +02:00
Daniel Bevenius
d0991da39d server : add support for external server for tests (#16243)
This commit adds support for using an externally started llama-server
instance for the server tests. This can be enabled by setting the
DEBUG_EXTERNAL environment variable.

The motivation for this is to allow debugging of the server itself
when investigating a test failure. Instructions for how to do this are
added to the README.md file in the tests directory.
2025-09-25 11:36:47 +02:00
Douglas Hanley
b5bd037832 llama : add support for qwen3 reranker (#15824) 2025-09-25 11:53:09 +03:00
Johannes Gäßler
e789095502 llama: print memory breakdown on exit (#15860)
* llama: print memory breakdown on exit
2025-09-24 16:53:48 +02:00
Quentin Bramas
138c87ce8b webui : fix handling incomplete chunks (#16107) 2025-09-22 11:53:13 +03:00
Georgi Gerganov
1d660d2fae ci : use smaller model (#16168)
* ci : switch from gemma to qwen3 0.6b

* ci : use smaller model for some tests
2025-09-22 09:11:39 +03:00
Georgi Gerganov
4d0a7cbc61 ci : adjust params for less runtime (#16167)
* ci : adjust params for less runtime

* ci : gate BF16 on some hardware

* ci : move extra tests to Arm runner
2025-09-22 08:31:40 +03:00
Benni
459c0c2c1a server: fix SSE and OpenAI compatibility for error messages when streaming (#16109)
* server: fix SSE and OpenAI compatibility for error messages when streaming

* server: remove obsolete event parameter and use required data fieldname instead
2025-09-20 07:56:30 +02:00
ssweens
be79d9fdd9 llama-bench: add --devices and --list-devices support (#16039)
* * llama-bench: add --devices support
- Support --devices same as llama-server
- Provide for benchmarking different device combinations
- Include --list-devices like llama-server for convenience

* fix: field display ordering restored

* fix: integrated the rpc devices
- aimed to mimic the server as much as possible

* cleanup: defaults for list-devices
- handle dup device listing with RPC

* cleanup: remove dup device load calls

* docs: update llama-bench
- added the recently added n-cpu-moe option to the docs while in there

* llama-bench: rpc device simplification
* rpc servers unify with other devices earlier, simplifying code
* --list-devices made stateless and simpler
* various cleanup
2025-09-20 00:15:21 +02:00
Aleksander Grygier
4067f07fc5 feat: Improve mobile UI for Settings Dialog (#16084)
* feat: Improve mobile UI for Settings Dialog

* chore: update webui build output

* fix: Linting errors

* chore: update webui build output
2025-09-19 09:52:27 +02:00
Radoslav Gerganov
2b6b55a59f server : include usage statistics only when user request them (#16052)
* server : include usage statistics only when user request them

When serving the OpenAI compatible API, we should check if
{"stream_options": {"include_usage": true} is set in the request when
deciding whether we should send usage statistics

closes: #16048

* add unit test
2025-09-18 10:36:57 +00:00
Aleksander Grygier
a7a98e0fff SvelteKit-based WebUI (#14839) 2025-09-17 19:29:13 +02:00
jacekpoplawski
8ff206097c llama-bench: add --n-cpu-moe support (#15952)
* llama-bench: add --n-cpu-moe support

Support --n-cpu-moe in llama-bench the same way it is supported by
llama-server.
2025-09-16 16:17:08 +02:00
Yuri Khrustalev
07808ebb07 cmake : Do not install tools on iOS targets (#15903) 2025-09-16 09:54:44 +07:00
Nikolay Popov
28c39da7c6 llama-run: Fix model download on Windows (#15988)
* llama-run: Fix model download on Windows
 * fix SSL error (SSL peer certificate or SSH remote key was not OK)
 * fix program crash on std::filesystem::rename

* llama-run: create a separate method to utilize RAII

* llama-run: handle rename exception
2025-09-15 11:08:30 +01:00
ddh0
a68f31edd7 fix KLD percentile output (#15999)
In `llama-perplexity`, when using `--kl-divergence`, the KL divergence statistics output mistakenly displays the 99th percentile twice. This change fixes that and correctly displays the 90th percentile as originally intended (presumably).
2025-09-15 09:54:57 +02:00
Sigbjørn Skjæret
6c019cb04e server : only attempt to enable thinking if using jinja (#15967) 2025-09-14 21:17:04 +02:00
Radoslav Gerganov
918b26f197 rpc : fix regression when --device is used (#15981)
Fix regression introduced with commit 50f4281a6
2025-09-14 12:28:18 +03:00