kv-cache : avoid modifying recurrent cells when setting inputs (#13834)

mirror of https://github.com/ggml-org/llama.cpp.git synced 2025-10-30 08:42:00 +00:00

* kv-cache : avoid modifying recurrent cells when setting inputs

* kv-cache : remove inp_s_mask

It was replaced with equivalent and simpler functionality
with rs_z (the first zeroed state) and the already-existing inp_s_copy.

* kv-cache : fix non-consecutive token pos warning for recurrent models

The problem was apparently caused by how the tail cells were swapped.

* graph : simplify logic for recurrent state copies

* kv-cache : use cell without src refs for rs_z in recurrent cache

* llama-graph : fix recurrent state copy

The `state_copy` shuffle assumes everything is moved at once,
which is not true when `states_extra` is copied back to the cache
before copying the range of states between `head` and `head + n_seqs`.
This is only a problem if any of the cells in [`head`, `head + n_seqs`)
have an `src` in [`head + n_seqs`, `head + n_kv`),
which does happen when `n_ubatch > 1` in the `llama-parallel` example.

Changing the order of the operations avoids the potential overwrite
before use, although when copies are avoided (like with Mamba2),
this will require further changes.

* llama-graph : rename n_state to state_size in build_recurrent_state

This naming should reduce confusion between the state size
and the number of states.

This commit is contained in:

compilade

2025-06-10 18:20:14 -04:00

committed by

GitHub

parent 55f6b9fa65

commit dad5c44398

6 changed files with 117 additions and 180 deletions

									
										2

src/llama-kv-cache-unified.cpp
									
												View File
												
				@@ -512,8 +512,6 @@ int32_t llama_kv_cache_unified::find_slot(const llama_ubatch & ubatch) const {

				        head_cur = 0;

				    }

				    // otherwise, one cell per token.

				    if (n_tokens > cells.size()) {

				        LLAMA_LOG_ERROR("%s: n_tokens = %d > size = %u\n", __func__, n_tokens, cells.size());

				        return -1;

kv-cache : avoid modifying recurrent cells when setting inputs (#13834)

2 src/llama-kv-cache-unified.cpp Unescape Escape View File

2

src/llama-kv-cache-unified.cpp

View File