Commit Graph

11 Commits

Author SHA1 Message Date
Georgi Gerganov
59fee24c72 recurrent : rework graph inputs + add TODOs
ggml-ci
2025-06-18 09:29:51 +03:00
Gabe Goodhart
5046d412ef fix: Fix initialization of child states
Since initially writing this PR, the logic in the child state types changed
such that using the "init full" signature and keeping the ubatches on the
parent struct no longer worked.

Branch: HybridRecurrentCache

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
2025-06-17 14:54:19 -06:00
Gabe Goodhart
4ec4e6a801 refactor: Use llama_memory_state_ptr for child states in hybrid memory state
Branch: HybridRecurrentCache

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
2025-06-17 14:54:19 -06:00
Gabe Goodhart
1510016ea4 fix: Remove logits_all after rebase
Branch: HybridRecurrentCache

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
2025-06-17 14:54:19 -06:00
Gabe Goodhart
d8c929ff5d feat: Allow custom layer filters for hybrid recurrent
This should help support architectures like Falcon H1 where there is
overlap between layers that need attention and recurrent caches.

https://github.com/ggml-org/llama.cpp/pull/13979#discussion_r2140748922

Branch: HybridRecurrentCache

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
2025-06-17 14:54:19 -06:00
Gabe Goodhart
9c1a604af8 fix: Update clear signature for data argument after rebase
Branch: HybridRecurrentCache

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
2025-06-17 14:54:18 -06:00
Gabe Goodhart
911e694476 fix: Fix status for init_update sig for recurrent cache state
Branch: GraniteFour

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
2025-06-17 14:54:18 -06:00
Gabe Goodhart
d3699366e6 fix: Update recurrent cache for changes to remove intermediate kv_cache interface
Branch: HybridRecurrentCache

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
2025-06-17 14:54:18 -06:00
Gabe Goodhart
cf03d4ae5c fix: Fix shift logic to defer to unified cache
Branch: HybridRecurrentCache

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
2025-06-17 14:54:18 -06:00
Gabe Goodhart
6c6ec0003a fix: Fix wrong bool condition for split equal in hybrid cache
Branch: HybridRecurrentCache

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
2025-06-17 14:54:18 -06:00
Gabe Goodhart
c71eaa37a0 feat: First pass at llama_kv_cache_hybrid_recurrent
This follows the pattern in iswa where the two child caches are held
explicitly to support the case where a model requires a single attention
cache and a single recurrent cache where each layer uses exactly one of the
caches.

This is a rewrite of the more generic approach in the original hybrid cache
PR: https://github.com/ggml-org/llama.cpp/pull/13276

Branch: HybridRecurrentCache

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
2025-06-17 14:54:18 -06:00