llama.cpp/tools/server/utils.hpp at gg/graph-mamba-reuse

mirror of https://github.com/ggml-org/llama.cpp.git synced 2025-11-05 09:36:52 +00:00

Files

Georgi Gerganov cd5e3b5754 server : support unified cache across slots (#16736 )

* server : support unified context across slots

* cont : fix speculative decoding initialization

* context : fix n_ctx_per_seq computation

* server : purge slots one by one

* tests : add unified cache server tests

* llama : update per-seq context computation

* test-thread-safety : handle tiny training context of the input model

* server : fix server_tokens clear()

* server : use 4 slots + unified KV by default

* llama : add note about context size queries

* cont : update todos [no ci]

* context : do not cap the size of the context

* tests : adjust parameters to be CI friendlier

* context : add warning

2025-11-02 18:14:04 +02:00

57 KiB

Raw Permalink Blame History

View Raw

57 KiB Raw Permalink Blame History

57 KiB

Raw Permalink Blame History