Georgi Gerganov 
							
						 
					 
					
						
						
							
						
						ba42794c9e 
					 
					
						
						
							
							graph : fix equal_seq() check ( #14986 )  
						
						... 
						
						
						
						ggml-ci 
						
						
					 
					
						2025-08-01 06:38:12 +03:00 
						 
				 
			
				
					
						
							
							
								Daniel Bevenius 
							
						 
					 
					
						
						
							
						
						ca0ef2dddb 
					 
					
						
						
							
							llama : clarify comment about pp and tg graphs [no ci] ( #14895 )  
						
						... 
						
						
						
						* llama : clarify comment about pp and tg graphs [no ci]
This commit clarifies the comment in `llama-context.cpp` regarding the
prefill prompt (pp), and token generation (tg) graphs.
The motivation for this is that I've struggled to remember these and had
to look them up more than once, so I thought it would be helpful to add
a comment that makes it clear what these stand for.
* squash! llama : clarify comment about pp and tg graphs [no ci]
Change "pp" to "prompt processing". 
						
						
					 
					
						2025-07-27 12:10:51 +02:00 
						 
				 
			
				
					
						
							
							
								Georgi Gerganov 
							
						 
					 
					
						
						
							
						
						c1dbea752a 
					 
					
						
						
							
							context : restore preemptive sched reset when LLAMA_SET_ROWS=0 ( #14870 )  
						
						... 
						
						
						
						ggml-ci 
						
						
					 
					
						2025-07-25 14:28:06 +03:00 
						 
				 
			
				
					
						
							
							
								Georgi Gerganov 
							
						 
					 
					
						
						
							
						
						e4868d16d2 
					 
					
						
						
							
							context : perform output reorder lazily upon access after sync ( #14853 )  
						
						... 
						
						
						
						* context : perform output reorder after lazily upon access after sync
ggml-ci
* cont : add TODO 
						
						
					 
					
						2025-07-24 16:31:48 +03:00 
						 
				 
			
				
					
						
							
							
								Georgi Gerganov 
							
						 
					 
					
						
						
							
						
						d498af3d5a 
					 
					
						
						
							
							graph : avoid huge warm-up graphs for MoE models ( #14753 )  
						
						... 
						
						
						
						* graph : avoid huge warm-up graphs for MoE models
ggml-ci
* cont : bump max nodes to 8x model tensors 
						
						
					 
					
						2025-07-18 14:31:15 +03:00 
						 
				 
			
				
					
						
							
							
								Georgi Gerganov 
							
						 
					 
					
						
						
							
						
						8f974bc1e9 
					 
					
						
						
							
							graph : refactor context to not pass gf explicitly ( #14629 )  
						
						... 
						
						
						
						ggml-ci 
						
						
					 
					
						2025-07-18 08:29:28 +03:00 
						 
				 
			
				
					
						
							
							
								Georgi Gerganov 
							
						 
					 
					
						
						
							
						
						01612b7409 
					 
					
						
						
							
							llama : reuse compute graphs ( #14482 )  
						
						... 
						
						
						
						* llama : reuse compute graphs
ggml-ci
* llama-bench : add graph reuse parameter
ggml-ci
* cont : remove the parameter and the sched resets
ggml-ci
* graph : rename update() to can_reuse()
ggml-ci
* params : remove is_same()
ggml-ci
* graph : set res->params in llm_graph_context constructor
ggml-ci
* graph : avoid set_max_nodes in llm_graph_result
ggml-ci
* kv-cache : reuse llama_context's graph result instance
ggml-ci
* context : reset the previous graph result upon memory updates
ggml-ci
* batch : llama_ubatch now carries its data instead of pointing to balloc
ggml-ci
* merge : fix build
ggml-ci
* graph : fix can_reuse() checks when flash-attention is disabled
* graph : move llm_graph_result impl in source file + debug env
ggml-ci 
						
						
					 
					
						2025-07-17 19:08:33 +03:00 
						 
				 
			
				
					
						
							
							
								Georgi Gerganov 
							
						 
					 
					
						
						
							
						
						225e7a1438 
					 
					
						
						
							
							llama : add high-throughput mode ( #14363 )  
						
						... 
						
						
						
						* kv-cache : prepare K/V buffers for separation
ggml-ci
* batched-bench : fix oob write
ggml-ci
* llama : add "virtual sequences"
ggml-ci
* llama : use "stream" vs "virtual sequence"
ggml-ci
* graph : fix stream splitting when KV cache is not used
ggml-ci
* kv-cache : add multi-stream save/load support
ggml-ci
* llama : add "--attn-streams" flag
ggml-ci
* kv-cache : fix handling when find_slot fails
ggml-ci
* kv-cache : restore find_slot impl
ggml-ci
* kv-cache : add comments
* kv-cache : add bounds checks for sequence id
ggml-ci
* cont : add n_seq_max to batch allocr
ggml-ci
* kv-cache : perform stream copies lazily after llama_synchronize
ggml-ci
* kv-cache : avoid throwing exceptions across the C boundary
ggml-ci
* CUDA: 4D FlashAttention support (#14628 )
* CUDA: 4D FlashAttention support
* CUDA: fix WMMA FA kernel
* llama : rename attn_streams -> kv_unified
ggml-ci
* common : rename kv_split -> kv_unified
ggml-ci
---------
Co-authored-by: Johannes Gäßler <johannesg@5d6.de > 
						
						
					 
					
						2025-07-16 16:35:42 +03:00 
						 
				 
			
				
					
						
							
							
								Aman Gupta 
							
						 
					 
					
						
						
							
						
						9c9e4fc635 
					 
					
						
						
							
							llama-context: add ability to get logits ( #14672 )  
						
						
						
						
					 
					
						2025-07-14 21:01:41 +08:00 
						 
				 
			
				
					
						
							
							
								Georgi Gerganov 
							
						 
					 
					
						
						
							
						
						7b50d589a8 
					 
					
						
						
							
							kv-cells : fix tracking of seq_pos ( #14339 )  
						
						... 
						
						
						
						* kv-cells : fix tracking of seq_pos during cache reuse
ggml-ci
* cont : improve error message
ggml-ci
* cont : add more comments 
						
						
					 
					
						2025-06-23 12:27:35 +03:00 
						 
				 
			
				
					
						
							
							
								Georgi Gerganov 
							
						 
					 
					
						
						
							
						
						692e3cdd0a 
					 
					
						
						
							
							memory : rename interface to llama_memory_context_i ( #14296 )  
						
						... 
						
						
						
						* memory : rename interface to llama_memory_context_i
ggml-ci
* cont : fix comments
* cont : use "mctx" for referencing a memory context
ggml-ci 
						
						
					 
					
						2025-06-21 08:03:46 +03:00 
						 
				 
			
				
					
						
							
							
								Georgi Gerganov 
							
						 
					 
					
						
						
							
						
						4c9fdfbe15 
					 
					
						
						
							
							ubatch : new splitting logic ( #14217 )  
						
						... 
						
						
						
						ggml-ci 
						
						
					 
					
						2025-06-20 10:14:14 +03:00 
						 
				 
			
				
					
						
							
							
								Georgi Gerganov 
							
						 
					 
					
						
						
							
						
						d3e64b9f49 
					 
					
						
						
							
							llama : rework embeddings logic ( #14208 )  
						
						... 
						
						
						
						* llama : rework embeddings logic
ggml-ci
* cont : fix rerank
ggml-ci
* cont : engrish [no ci]
* cont : fix rerank
ggml-ci
* server : support both embeddings and completions with single model
ggml-ci
* cont : avoid embeddings_org
ggml-ci 
						
						
					 
					
						2025-06-16 14:14:00 +03:00 
						 
				 
			
				
					
						
							
							
								Georgi Gerganov 
							
						 
					 
					
						
						
							
						
						c311ac664d 
					 
					
						
						
							
							cparams : rename LLAMA_MAX_PARALLEL_SEQUENCES to LLAMA_MAX_SEQ ( #14188 )  
						
						... 
						
						
						
						ggml-ci 
						
						
					 
					
						2025-06-15 10:08:58 +03:00 
						 
				 
			
				
					
						
							
							
								Georgi Gerganov 
							
						 
					 
					
						
						
							
						
						b9912ac570 
					 
					
						
						
							
							batch : auto-gen positions + verify multi-sequence input ( #14177 )  
						
						... 
						
						
						
						* batch : verify multi-sequence input batches
ggml-ci
* cont : auto-gen positions + verify multi-seq input
ggml-ci
* cont : first print debug info, then perform validation
ggml-ci
* cont : fix position auto-gen + add comments
ggml-ci 
						
						
					 
					
						2025-06-15 09:18:37 +03:00 
						 
				 
			
				
					
						
							
							
								Georgi Gerganov 
							
						 
					 
					
						
						
							
						
						60c666347b 
					 
					
						
						
							
							batch : rework llama_batch_allocr ( #14153 )  
						
						... 
						
						
						
						* batch : rework llama_batch_allocr
ggml-ci
* cont : move validation inside class
ggml-ci
* cont : move output counting to class
ggml-ci
* cont : minor
ggml-ci
* batch : add TODOs
ggml-ci 
						
						
					 
					
						2025-06-13 13:47:55 +03:00 
						 
				 
			
				
					
						
							
							
								Georgi Gerganov 
							
						 
					 
					
						
						
							
						
						f6e1a7aa87 
					 
					
						
						
							
							context : simplify output counting logic during decode ( #14142 )  
						
						... 
						
						
						
						* batch : remove logits_all flag
ggml-ci
* context : simplify output counting logic during decode
ggml-ci
* cont : fix comments 
						
						
					 
					
						2025-06-12 11:50:01 +03:00 
						 
				 
			
				
					
						
							
							
								Georgi Gerganov 
							
						 
					 
					
						
						
							
						
						c3ee46fab4 
					 
					
						
						
							
							batch : remove logits_all flag ( #14141 )  
						
						... 
						
						
						
						ggml-ci 
						
						
					 
					
						2025-06-12 11:49:26 +03:00 
						 
				 
			
				
					
						
							
							
								Georgi Gerganov 
							
						 
					 
					
						
						
							
						
						9596506965 
					 
					
						
						
							
							kv-cache : fix split_equal handling in unified implementation ( #14130 )  
						
						... 
						
						
						
						ggml-ci 
						
						
					 
					
						2025-06-12 10:02:15 +03:00 
						 
				 
			
				
					
						
							
							
								compilade 
							
						 
					 
					
						
						
							
						
						a20b2b05bc 
					 
					
						
						
							
							context : round n_tokens to next multiple of n_seqs when reserving ( #14140 )  
						
						... 
						
						
						
						This fixes RWKV inference which otherwise failed
when the worst case ubatch.n_seq_tokens rounded to 0. 
						
						
					 
					
						2025-06-12 02:56:04 -04:00 
						 
				 
			
				
					
						
							
							
								Georgi Gerganov 
							
						 
					 
					
						
						
							
						
						745aa5319b 
					 
					
						
						
							
							llama : deprecate llama_kv_self_ API ( #14030 )  
						
						... 
						
						
						
						* llama : deprecate llama_kv_self_ API
ggml-ci
* llama : allow llama_memory_(nullptr)
ggml-ci
* memory : add flag for optional data clear in llama_memory_clear
ggml-ci 
						
						
					 
					
						2025-06-06 14:11:15 +03:00 
						 
				 
			
				
					
						
							
							
								Georgi Gerganov 
							
						 
					 
					
						
						
							
						
						487a5e0401 
					 
					
						
						
							
							context : fix SWA-related warning for multiple sequences ( #14045 )  
						
						
						
						
					 
					
						2025-06-06 13:29:18 +03:00 
						 
				 
			
				
					
						
							
							
								Sigbjørn Skjæret 
							
						 
					 
					
						
						
							
						
						d17a809ef0 
					 
					
						
						
							
							llama : support multiple classifier outputs and labels ( #13940 )  
						
						
						
						
					 
					
						2025-06-06 09:03:25 +02:00 
						 
				 
			
				
					
						
							
							
								Georgi Gerganov 
							
						 
					 
					
						
						
							
						
						7f37b6cf1e 
					 
					
						
						
							
							memory : migrate from llama_kv_cache to more generic llama_memory ( #14006 )  
						
						... 
						
						
						
						* memory : merge llama_kv_cache into llama_memory + new `llama_memory` API
ggml-ci
* context : fix casts
ggml-ci 
						
						
					 
					
						2025-06-05 15:29:22 +03:00 
						 
				 
			
				
					
						
							
							
								Georgi Gerganov 
							
						 
					 
					
						
						
							
						
						9e31bec4fd 
					 
					
						
						
							
							context : fix pos_min initialization upon error decode ( #14008 )  
						
						... 
						
						
						
						ggml-ci 
						
						
					 
					
						2025-06-05 09:06:29 +03:00 
						 
				 
			
				
					
						
							
							
								Georgi Gerganov 
							
						 
					 
					
						
						
							
						
						3e63a58ef7 
					 
					
						
						
							
							kv-cache : refactor the update/defrag mechanism ( #13988 )  
						
						... 
						
						
						
						* kv-cache : refactor update mechanism
ggml-ci
* memory : improve status handling
* defrag : reset head + add comments
ggml-ci
* cont : minor fixes
ggml-ci 
						
						
					 
					
						2025-06-04 18:58:20 +03:00 
						 
				 
			
				
					
						
							
							
								Georgi Gerganov 
							
						 
					 
					
						
						
							
						
						803f8baf4f 
					 
					
						
						
							
							llama : deprecate explicit kv_self defrag/update calls ( #13921 )  
						
						... 
						
						
						
						ggml-ci 
						
						
					 
					
						2025-05-31 15:58:33 +03:00 
						 
				 
			
				
					
						
							
							
								Georgi Gerganov 
							
						 
					 
					
						
						
							
						
						3600cc2886 
					 
					
						
						
							
							llama : use n_swa + n_ubatch cells for SWA cache ( #13833 )  
						
						... 
						
						
						
						* llama : use n_swa + n_ubatch cells for SWA cache
ggml-ci
* llama : add warning about multi-sqeuence SWA contexts 
						
						
					 
					
						2025-05-31 15:57:44 +03:00 
						 
				 
			
				
					
						
							
							
								Georgi Gerganov 
							
						 
					 
					
						
						
							
						
						3f55f781f1 
					 
					
						
						
							
							llama : auto-batch preparation ( #13845 )  
						
						... 
						
						
						
						* llama : auto-batch
ggml-ci
* context : simplify if branching 
						
						
					 
					
						2025-05-31 12:55:57 +03:00 
						 
				 
			
				
					
						
							
							
								Georgi Gerganov 
							
						 
					 
					
						
						
							
						
						12d0188c0d 
					 
					
						
						
							
							kv-cache : refactor + add llama_memory_state_i ( #13746 )  
						
						... 
						
						
						
						* kv-cache : simplify the "struct llama_kv_cache" interface
ggml-ci
* kv-cache : revert the (n_swa + n_ubatch) change (for next PR)
ggml-ci
* kv-cache : some comments
ggml-ci
* context : fix graph reserve for multiple sequences
ggml-ci
* kv-cache : fix typo [no ci]
* kv-cache : fix find_slot() logic for free slots
ggml-ci
* llama : add TODO for deprecating the defrag API in the future
* kv-cache : improve find_slot() using min/max seq pos info
ggml-ci
* llama : handle aborts and compute errors
ggml-ci
* memory : extract state into llama_memory_state
ggml-ci
* kv-cache : add comments
ggml-ci
* server : update batching logic to reset n_batch on successful decode
* server : upon full re-processing, remove the sequence from the cache
* kv-cache : add TODO for doing split_equal when split_simple fails
ggml-ci 
						
						
					 
					
						2025-05-31 10:24:04 +03:00 
						 
				 
			
				
					
						
							
							
								Georgi Gerganov 
							
						 
					 
					
						
						
							
						
						4f81b33e32 
					 
					
						
						
							
							llama : validate seq id batch input ( #13809 )  
						
						... 
						
						
						
						* llama : validate seq id batch input
ggml-ci
* cont : fix the fix
ggml-ci 
						
						
					 
					
						2025-05-27 09:40:59 +03:00 
						 
				 
			
				
					
						
							
							
								Georgi Gerganov 
							
						 
					 
					
						
						
							
						
						79c137f776 
					 
					
						
						
							
							examples : allow extracting embeddings from decoder contexts ( #13797 )  
						
						... 
						
						
						
						ggml-ci 
						
						
					 
					
						2025-05-26 14:03:54 +03:00 
						 
				 
			
				
					
						
							
							
								Georgi Gerganov 
							
						 
					 
					
						
						
							
						
						de2ef53a4b 
					 
					
						
						
							
							kv-cache : rework kv_cell ( #13706 )  
						
						... 
						
						
						
						* kv-cache : rework kv_cell
ggml-ci
* kv-cells : use "shift" instead of "delta" consistently
ggml-ci
* llama : add llama_max_parallel_sequences()
ggml-ci
* kv-cells : update comments [no ci]
* context : fail upon construction if sequences exceed max value
ggml-ci
* kv-cells : get_pos() -> pos_get() + comments
ggml-ci
* kv-cells : fix tracking of "used" cells
ggml-ci 
						
						
					 
					
						2025-05-25 16:34:36 +03:00 
						 
				 
			
				
					
						
							
							
								Georgi Gerganov 
							
						 
					 
					
						
						
							
						
						797f2ac062 
					 
					
						
						
							
							kv-cache : simplify the interface ( #13660 )  
						
						... 
						
						
						
						* kv-cache : simplify the interface
ggml-ci
* context : revert llama_batch_allocr position change
ggml-ci 
						
						
					 
					
						2025-05-21 15:11:13 +03:00 
						 
				 
			
				
					
						
							
							
								Georgi Gerganov 
							
						 
					 
					
						
						
							
						
						a4090d1174 
					 
					
						
						
							
							llama : remove llama_kv_cache_view API + remove deprecated ( #13653 )  
						
						... 
						
						
						
						ggml-ci 
						
						
					 
					
						2025-05-20 16:13:16 +03:00 
						 
				 
			
				
					
						
							
							
								Georgi Gerganov 
							
						 
					 
					
						
						
							
						
						e298d2fbd0 
					 
					
						
						
							
							kv-cache : add SWA support ( #13194 )  
						
						... 
						
						
						
						* kv-cache : prepare for SWA
ggml-ci
* kv-cache : initial iSWA implementation
ggml-ci
* kv-cache : rework error recovery logic
ggml-ci
* models : fix Phi-3 SWA parameters
ggml-ci
* model : adjust Granite to rope factor changes
ggml-ci
* server : check if context can do shifts
ggml-ci
* iswa : for now, always enable shifts (experiment)
ggml-ci
* kv-cache : simplify SWA logic
ggml-ci
* kv-cache : apply defrag when we fail to find slots for the batch
ggml-ci
* llama : update docs about llama_decode
ggml-ci
* kv-cache : update warning logs when no space for the batch is available
ggml-ci
* llama : add llama_kv_self_seq_pos_min()
* kv-cache : keep track of partial SWA computes and print warnings
* server : disallow use cases involving partial SWA context
ggml-ci
* llama : add param to control SWA cache size
ggml-ci
* minor : clean-up
ggml-ci 
						
						
					 
					
						2025-05-20 08:05:46 +03:00 
						 
				 
			
				
					
						
							
							
								Sigbjørn Skjæret 
							
						 
					 
					
						
						
							
						
						f5170c1d7a 
					 
					
						
						
							
							editorconfig : fix trailing whitespace from  #13542  ( #13546 )  
						
						
						
						
					 
					
						2025-05-14 21:22:49 +03:00 
						 
				 
			
				
					
						
							
							
								Gilad S. 
							
						 
					 
					
						
						
							
						
						017f10b5fa 
					 
					
						
						
							
							fix: crash when calling llama_state_get_size on a context without a KV cache ( #13542 )  
						
						
						
						
					 
					
						2025-05-14 19:18:18 +03:00 
						 
				 
			
				
					
						
							
							
								Johannes Gäßler 
							
						 
					 
					
						
						
							
						
						10d2af0eaa 
					 
					
						
						
							
							llama/ggml: add LLM training support ( #10544 )  
						
						... 
						
						
						
						* llama/ggml: add LLM training support
more compact progress bar
llama_save_model_to_file
llama_opt_param_filter
ggml_graph_dup force_grads
refactor ggml_opt, fix test-opt
* remove logits_all
* refactor CUDA implementation for ACC
* reset graph at beginning of opt period 
						
						
					 
					
						2025-05-12 14:44:49 +02:00 
						 
				 
			
				
					
						
							
							
								Georgi Gerganov 
							
						 
					 
					
						
						
							
						
						064cc596ac 
					 
					
						
						
							
							context : fix state io for memory-less contexts ( #13470 )  
						
						... 
						
						
						
						ggml-ci 
						
						
					 
					
						2025-05-12 15:12:27 +03:00 
						 
				 
			
				
					
						
							
							
								David Huang 
							
						 
					 
					
						
						
							
						
						7f323a589f 
					 
					
						
						
							
							Add --no-op-offload to improve -ot pp perf in MoE models like llama4 400B ( #13386 )  
						
						
						
						
					 
					
						2025-05-11 14:18:39 +02:00 
						 
				 
			
				
					
						
							
							
								Georgi Gerganov 
							
						 
					 
					
						
						
							
						
						6562e5a4d6 
					 
					
						
						
							
							context : allow cache-less context for embeddings ( #13108 )  
						
						... 
						
						
						
						* context : allow cache-less context for embeddings
ggml-ci
* context : enable reranking with encode()
ggml-ci
* context : encode() clears embd_seq
ggml-ci
* examples : use llama_encode() when appropriate
ggml-ci
* models : nomic bert moe does not require KV cache
* llama : update comments for llama_decode/llama_encode
ggml-ci
* context : update warning log [no ci] 
						
						
					 
					
						2025-05-08 14:28:33 +03:00 
						 
				 
			
				
					
						
							
							
								Georgi Gerganov 
							
						 
					 
					
						
						
							
						
						51fb96b1ff 
					 
					
						
						
							
							context : remove logits_all flag ( #13284 )  
						
						... 
						
						
						
						* context : remove logits_all flag
ggml-ci
* llama : remove logits_all flag + reorder llama_context_params
ggml-ci 
						
						
					 
					
						2025-05-08 14:26:50 +03:00 
						 
				 
			
				
					
						
							
							
								Georgi Gerganov 
							
						 
					 
					
						
						
							
						
						a75cb30dc9 
					 
					
						
						
							
							context : fix reorder logic ( #13267 )  
						
						... 
						
						
						
						ggml-ci 
						
						
					 
					
						2025-05-02 20:54:13 +03:00 
						 
				 
			
				
					
						
							
							
								Georgi Gerganov 
							
						 
					 
					
						
						
							
						
						c642bc014c 
					 
					
						
						
							
							kv-cache : separate recurrent vs non-recurrent impl ( #12799 )  
						
						... 
						
						
						
						* kv-cache : serparate recurrent vs non-recurrent impl (wip)
ggml-ci
* kv-cache : init -> contructor + add llama_memory_params
ggml-ci
* kv-cache : fix callback reference
ggml-ci
* context : llama_kv_cache -> llama_memory_i
ggml-ci
* context : move memory creation logic to model
ggml-ci
* llama : remove reference of memory during encode
ggml-ci
* kv-cache : hide padding details in the implementation
ggml-ci
* kv-cache : add ubatch_next()
ggml-ci
* context : simplify sbatch logic
ggml-ci
* kv-cache : hide defrag logic in the implementation
ggml-ci
* context : hide kv cache details in implementation
ggml-ci
* build : fix
ggml-ci
* cont : another fix
ggml-ci
* kv-cache : simplify interface (wip)
ggml-ci
* kv-cache : use separate KV cell structs for unified/recurrent
ggml-ci
* kv-cache : clean-up
ggml-ci
* model : better llama_model::create_model() signature
ggml-ci
* kv-cache : fix recurrent seq_rm()
ggml-ci
* kv-cache : replace `struct callbacks` with `llama_model &`
ggml-ci
* kv-cache : replace `struct graph_params` with `llama_context &`
ggml-ci
* kv-cache : fix offload check
ggml-ci
* context : avoid passing unique_ptr
ggml-ci
* kv-cache : avoid using the backends from the llama_context
ref #13113 
ggml-ci
* kv-cache : more consistent debug logs [no ci]
* kv-cache : do not pass the full llama_context for kv graphs
ggml-ci
* kv-cache : remove comment
* kv-cache : ggml_rope_ext_inplace -> ggml_rope_ext
ggml-ci
* kv-cache : fix recurrent multi-user case
ggml-ci
* memory : remove comments [no ci] 
						
						
					 
					
						2025-05-02 17:48:36 +03:00 
						 
				 
			
				
					
						
							
							
								ddh0 
							
						 
					 
					
						
						
							
						
						16a457facd 
					 
					
						
						
							
							fix typo: n_ctx_pre_seq -> n_ctx_per_seq ( #13221 )  
						
						
						
						
					 
					
						2025-04-30 21:28:43 +01:00 
						 
				 
			
				
					
						
							
							
								pockers21 
							
						 
					 
					
						
						
							
						
						fb0471d175 
					 
					
						
						
							
							context : do not clear output buffer on reserve ( #13152 )  
						
						... 
						
						
						
						Co-authored-by: pockers21 <liyang2@uniontech.com > 
						
						
					 
					
						2025-04-28 16:45:40 +03:00 
						 
				 
			
				
					
						
							
							
								Diego Devesa 
							
						 
					 
					
						
						
							
						
						295354ea68 
					 
					
						
						
							
							llama : fix K-shift with quantized K and BLAS backend ( #13113 )  
						
						
						
						
					 
					
						2025-04-25 19:40:11 +02:00 
						 
				 
			
				
					
						
							
							
								Georgi Gerganov 
							
						 
					 
					
						
						
							
						
						2f74c354c0 
					 
					
						
						
							
							graph : make FA compatible with MLA + add initial Metal kernels ( #12953 )  
						
						... 
						
						
						
						* graph : make mla compatible with FA
* metal : add exp FA kernels for DeepSeek models
ggml-ci
* llama : minor naming updates
ggml-ci
* ggml : disable FA for DS head sizes
* tests : add FA tests for MLA shapes
ggml-ci 
						
						
					 
					
						2025-04-17 18:16:36 +03:00 
						 
				 
			
				
					
						
							
							
								Juk Armstrong 
							
						 
					 
					
						
						
							
						
						daa422881a 
					 
					
						
						
							
							llama : DeepSeek V2/V3 MLA implementation ( #12801 )  
						
						... 
						
						
						
						* Merged using squash to remove all noise commit messages
* Force flash attention off for `LLM_ARCH_DEEPSEEK2` - embedding too large
* Removed 3 conts (2x RoPE and 1x RMS-norm)
* Changed to use `<cmath>` instead of `<math.h>`
* Reverted removal of the 3 conts
* Used `reshape` in `llm_graph_context::build_attn_mha()`
* Use `k_pe = ggml_reshape`
* Removed the 3 conts again
* Removed the 3D views of `wk_b` and `wv_b`, and just save and 3D in GGUF
* Removed MQA optimisation from `build_attn_mha()` as no gains now
* Simplified `is_mla` branch in `llm_build_deepseek2()`
* Removed `build_attn_mla` and added `nullptr` to all `build_atnn` calls
* Fixed call to `build_attn` in `llm_build_t5_enc` 
						
						
					 
					
						2025-04-15 09:49:57 +03:00