Georgi Gerganov 
							
						 
					 
					
						
						
							
						
						803f8baf4f 
					 
					
						
						
							
							llama : deprecate explicit kv_self defrag/update calls ( #13921 )  
						
						... 
						
						
						
						ggml-ci 
						
						
					 
					
						2025-05-31 15:58:33 +03:00 
						 
				 
			
				
					
						
							
							
								Georgi Gerganov 
							
						 
					 
					
						
						
							
						
						3600cc2886 
					 
					
						
						
							
							llama : use n_swa + n_ubatch cells for SWA cache ( #13833 )  
						
						... 
						
						
						
						* llama : use n_swa + n_ubatch cells for SWA cache
ggml-ci
* llama : add warning about multi-sqeuence SWA contexts 
						
						
					 
					
						2025-05-31 15:57:44 +03:00 
						 
				 
			
				
					
						
							
							
								Georgi Gerganov 
							
						 
					 
					
						
						
							
						
						3f55f781f1 
					 
					
						
						
							
							llama : auto-batch preparation ( #13845 )  
						
						... 
						
						
						
						* llama : auto-batch
ggml-ci
* context : simplify if branching 
						
						
					 
					
						2025-05-31 12:55:57 +03:00 
						 
				 
			
				
					
						
							
							
								Georgi Gerganov 
							
						 
					 
					
						
						
							
						
						12d0188c0d 
					 
					
						
						
							
							kv-cache : refactor + add llama_memory_state_i ( #13746 )  
						
						... 
						
						
						
						* kv-cache : simplify the "struct llama_kv_cache" interface
ggml-ci
* kv-cache : revert the (n_swa + n_ubatch) change (for next PR)
ggml-ci
* kv-cache : some comments
ggml-ci
* context : fix graph reserve for multiple sequences
ggml-ci
* kv-cache : fix typo [no ci]
* kv-cache : fix find_slot() logic for free slots
ggml-ci
* llama : add TODO for deprecating the defrag API in the future
* kv-cache : improve find_slot() using min/max seq pos info
ggml-ci
* llama : handle aborts and compute errors
ggml-ci
* memory : extract state into llama_memory_state
ggml-ci
* kv-cache : add comments
ggml-ci
* server : update batching logic to reset n_batch on successful decode
* server : upon full re-processing, remove the sequence from the cache
* kv-cache : add TODO for doing split_equal when split_simple fails
ggml-ci 
						
						
					 
					
						2025-05-31 10:24:04 +03:00 
						 
				 
			
				
					
						
							
							
								Georgi Gerganov 
							
						 
					 
					
						
						
							
						
						4f81b33e32 
					 
					
						
						
							
							llama : validate seq id batch input ( #13809 )  
						
						... 
						
						
						
						* llama : validate seq id batch input
ggml-ci
* cont : fix the fix
ggml-ci 
						
						
					 
					
						2025-05-27 09:40:59 +03:00 
						 
				 
			
				
					
						
							
							
								Georgi Gerganov 
							
						 
					 
					
						
						
							
						
						79c137f776 
					 
					
						
						
							
							examples : allow extracting embeddings from decoder contexts ( #13797 )  
						
						... 
						
						
						
						ggml-ci 
						
						
					 
					
						2025-05-26 14:03:54 +03:00 
						 
				 
			
				
					
						
							
							
								Georgi Gerganov 
							
						 
					 
					
						
						
							
						
						de2ef53a4b 
					 
					
						
						
							
							kv-cache : rework kv_cell ( #13706 )  
						
						... 
						
						
						
						* kv-cache : rework kv_cell
ggml-ci
* kv-cells : use "shift" instead of "delta" consistently
ggml-ci
* llama : add llama_max_parallel_sequences()
ggml-ci
* kv-cells : update comments [no ci]
* context : fail upon construction if sequences exceed max value
ggml-ci
* kv-cells : get_pos() -> pos_get() + comments
ggml-ci
* kv-cells : fix tracking of "used" cells
ggml-ci 
						
						
					 
					
						2025-05-25 16:34:36 +03:00 
						 
				 
			
				
					
						
							
							
								Georgi Gerganov 
							
						 
					 
					
						
						
							
						
						797f2ac062 
					 
					
						
						
							
							kv-cache : simplify the interface ( #13660 )  
						
						... 
						
						
						
						* kv-cache : simplify the interface
ggml-ci
* context : revert llama_batch_allocr position change
ggml-ci 
						
						
					 
					
						2025-05-21 15:11:13 +03:00 
						 
				 
			
				
					
						
							
							
								Georgi Gerganov 
							
						 
					 
					
						
						
							
						
						a4090d1174 
					 
					
						
						
							
							llama : remove llama_kv_cache_view API + remove deprecated ( #13653 )  
						
						... 
						
						
						
						ggml-ci 
						
						
					 
					
						2025-05-20 16:13:16 +03:00 
						 
				 
			
				
					
						
							
							
								Georgi Gerganov 
							
						 
					 
					
						
						
							
						
						e298d2fbd0 
					 
					
						
						
							
							kv-cache : add SWA support ( #13194 )  
						
						... 
						
						
						
						* kv-cache : prepare for SWA
ggml-ci
* kv-cache : initial iSWA implementation
ggml-ci
* kv-cache : rework error recovery logic
ggml-ci
* models : fix Phi-3 SWA parameters
ggml-ci
* model : adjust Granite to rope factor changes
ggml-ci
* server : check if context can do shifts
ggml-ci
* iswa : for now, always enable shifts (experiment)
ggml-ci
* kv-cache : simplify SWA logic
ggml-ci
* kv-cache : apply defrag when we fail to find slots for the batch
ggml-ci
* llama : update docs about llama_decode
ggml-ci
* kv-cache : update warning logs when no space for the batch is available
ggml-ci
* llama : add llama_kv_self_seq_pos_min()
* kv-cache : keep track of partial SWA computes and print warnings
* server : disallow use cases involving partial SWA context
ggml-ci
* llama : add param to control SWA cache size
ggml-ci
* minor : clean-up
ggml-ci 
						
						
					 
					
						2025-05-20 08:05:46 +03:00 
						 
				 
			
				
					
						
							
							
								Sigbjørn Skjæret 
							
						 
					 
					
						
						
							
						
						f5170c1d7a 
					 
					
						
						
							
							editorconfig : fix trailing whitespace from  #13542  ( #13546 )  
						
						
						
						
					 
					
						2025-05-14 21:22:49 +03:00 
						 
				 
			
				
					
						
							
							
								Gilad S. 
							
						 
					 
					
						
						
							
						
						017f10b5fa 
					 
					
						
						
							
							fix: crash when calling llama_state_get_size on a context without a KV cache ( #13542 )  
						
						
						
						
					 
					
						2025-05-14 19:18:18 +03:00 
						 
				 
			
				
					
						
							
							
								Johannes Gäßler 
							
						 
					 
					
						
						
							
						
						10d2af0eaa 
					 
					
						
						
							
							llama/ggml: add LLM training support ( #10544 )  
						
						... 
						
						
						
						* llama/ggml: add LLM training support
more compact progress bar
llama_save_model_to_file
llama_opt_param_filter
ggml_graph_dup force_grads
refactor ggml_opt, fix test-opt
* remove logits_all
* refactor CUDA implementation for ACC
* reset graph at beginning of opt period 
						
						
					 
					
						2025-05-12 14:44:49 +02:00 
						 
				 
			
				
					
						
							
							
								Georgi Gerganov 
							
						 
					 
					
						
						
							
						
						064cc596ac 
					 
					
						
						
							
							context : fix state io for memory-less contexts ( #13470 )  
						
						... 
						
						
						
						ggml-ci 
						
						
					 
					
						2025-05-12 15:12:27 +03:00 
						 
				 
			
				
					
						
							
							
								David Huang 
							
						 
					 
					
						
						
							
						
						7f323a589f 
					 
					
						
						
							
							Add --no-op-offload to improve -ot pp perf in MoE models like llama4 400B ( #13386 )  
						
						
						
						
					 
					
						2025-05-11 14:18:39 +02:00 
						 
				 
			
				
					
						
							
							
								Georgi Gerganov 
							
						 
					 
					
						
						
							
						
						6562e5a4d6 
					 
					
						
						
							
							context : allow cache-less context for embeddings ( #13108 )  
						
						... 
						
						
						
						* context : allow cache-less context for embeddings
ggml-ci
* context : enable reranking with encode()
ggml-ci
* context : encode() clears embd_seq
ggml-ci
* examples : use llama_encode() when appropriate
ggml-ci
* models : nomic bert moe does not require KV cache
* llama : update comments for llama_decode/llama_encode
ggml-ci
* context : update warning log [no ci] 
						
						
					 
					
						2025-05-08 14:28:33 +03:00 
						 
				 
			
				
					
						
							
							
								Georgi Gerganov 
							
						 
					 
					
						
						
							
						
						51fb96b1ff 
					 
					
						
						
							
							context : remove logits_all flag ( #13284 )  
						
						... 
						
						
						
						* context : remove logits_all flag
ggml-ci
* llama : remove logits_all flag + reorder llama_context_params
ggml-ci 
						
						
					 
					
						2025-05-08 14:26:50 +03:00 
						 
				 
			
				
					
						
							
							
								Georgi Gerganov 
							
						 
					 
					
						
						
							
						
						a75cb30dc9 
					 
					
						
						
							
							context : fix reorder logic ( #13267 )  
						
						... 
						
						
						
						ggml-ci 
						
						
					 
					
						2025-05-02 20:54:13 +03:00 
						 
				 
			
				
					
						
							
							
								Georgi Gerganov 
							
						 
					 
					
						
						
							
						
						c642bc014c 
					 
					
						
						
							
							kv-cache : separate recurrent vs non-recurrent impl ( #12799 )  
						
						... 
						
						
						
						* kv-cache : serparate recurrent vs non-recurrent impl (wip)
ggml-ci
* kv-cache : init -> contructor + add llama_memory_params
ggml-ci
* kv-cache : fix callback reference
ggml-ci
* context : llama_kv_cache -> llama_memory_i
ggml-ci
* context : move memory creation logic to model
ggml-ci
* llama : remove reference of memory during encode
ggml-ci
* kv-cache : hide padding details in the implementation
ggml-ci
* kv-cache : add ubatch_next()
ggml-ci
* context : simplify sbatch logic
ggml-ci
* kv-cache : hide defrag logic in the implementation
ggml-ci
* context : hide kv cache details in implementation
ggml-ci
* build : fix
ggml-ci
* cont : another fix
ggml-ci
* kv-cache : simplify interface (wip)
ggml-ci
* kv-cache : use separate KV cell structs for unified/recurrent
ggml-ci
* kv-cache : clean-up
ggml-ci
* model : better llama_model::create_model() signature
ggml-ci
* kv-cache : fix recurrent seq_rm()
ggml-ci
* kv-cache : replace `struct callbacks` with `llama_model &`
ggml-ci
* kv-cache : replace `struct graph_params` with `llama_context &`
ggml-ci
* kv-cache : fix offload check
ggml-ci
* context : avoid passing unique_ptr
ggml-ci
* kv-cache : avoid using the backends from the llama_context
ref #13113 
ggml-ci
* kv-cache : more consistent debug logs [no ci]
* kv-cache : do not pass the full llama_context for kv graphs
ggml-ci
* kv-cache : remove comment
* kv-cache : ggml_rope_ext_inplace -> ggml_rope_ext
ggml-ci
* kv-cache : fix recurrent multi-user case
ggml-ci
* memory : remove comments [no ci] 
						
						
					 
					
						2025-05-02 17:48:36 +03:00 
						 
				 
			
				
					
						
							
							
								ddh0 
							
						 
					 
					
						
						
							
						
						16a457facd 
					 
					
						
						
							
							fix typo: n_ctx_pre_seq -> n_ctx_per_seq ( #13221 )  
						
						
						
						
					 
					
						2025-04-30 21:28:43 +01:00 
						 
				 
			
				
					
						
							
							
								pockers21 
							
						 
					 
					
						
						
							
						
						fb0471d175 
					 
					
						
						
							
							context : do not clear output buffer on reserve ( #13152 )  
						
						... 
						
						
						
						Co-authored-by: pockers21 <liyang2@uniontech.com > 
						
						
					 
					
						2025-04-28 16:45:40 +03:00 
						 
				 
			
				
					
						
							
							
								Diego Devesa 
							
						 
					 
					
						
						
							
						
						295354ea68 
					 
					
						
						
							
							llama : fix K-shift with quantized K and BLAS backend ( #13113 )  
						
						
						
						
					 
					
						2025-04-25 19:40:11 +02:00 
						 
				 
			
				
					
						
							
							
								Georgi Gerganov 
							
						 
					 
					
						
						
							
						
						2f74c354c0 
					 
					
						
						
							
							graph : make FA compatible with MLA + add initial Metal kernels ( #12953 )  
						
						... 
						
						
						
						* graph : make mla compatible with FA
* metal : add exp FA kernels for DeepSeek models
ggml-ci
* llama : minor naming updates
ggml-ci
* ggml : disable FA for DS head sizes
* tests : add FA tests for MLA shapes
ggml-ci 
						
						
					 
					
						2025-04-17 18:16:36 +03:00 
						 
				 
			
				
					
						
							
							
								Juk Armstrong 
							
						 
					 
					
						
						
							
						
						daa422881a 
					 
					
						
						
							
							llama : DeepSeek V2/V3 MLA implementation ( #12801 )  
						
						... 
						
						
						
						* Merged using squash to remove all noise commit messages
* Force flash attention off for `LLM_ARCH_DEEPSEEK2` - embedding too large
* Removed 3 conts (2x RoPE and 1x RMS-norm)
* Changed to use `<cmath>` instead of `<math.h>`
* Reverted removal of the 3 conts
* Used `reshape` in `llm_graph_context::build_attn_mha()`
* Use `k_pe = ggml_reshape`
* Removed the 3 conts again
* Removed the 3D views of `wk_b` and `wv_b`, and just save and 3D in GGUF
* Removed MQA optimisation from `build_attn_mha()` as no gains now
* Simplified `is_mla` branch in `llm_build_deepseek2()`
* Removed `build_attn_mla` and added `nullptr` to all `build_atnn` calls
* Fixed call to `build_attn` in `llm_build_t5_enc` 
						
						
					 
					
						2025-04-15 09:49:57 +03:00 
						 
				 
			
				
					
						
							
							
								Georgi Gerganov 
							
						 
					 
					
						
						
							
						
						3e1d29348b 
					 
					
						
						
							
							kv-cache : simplify + fix warning for recurrent models ( #12756 )  
						
						... 
						
						
						
						ggml-ci 
						
						
					 
					
						2025-04-04 21:48:10 +03:00 
						 
				 
			
				
					
						
							
							
								Diego Devesa 
							
						 
					 
					
						
						
							
						
						e0e912f49b 
					 
					
						
						
							
							llama : add option to override model tensor buffers ( #11397 )  
						
						... 
						
						
						
						* llama : add option to override tensor buffers
* ggml : fix possible underflow in ggml_nbytes 
						
						
					 
					
						2025-04-02 14:52:01 +02:00 
						 
				 
			
				
					
						
							
							
								Georgi Gerganov 
							
						 
					 
					
						
						
							
						
						a10b36c91a 
					 
					
						
						
							
							llama : refactor kv cache guard ( #12695 )  
						
						... 
						
						
						
						* llama : refactor kv cache guard
ggml-ci
* cont : fix comment [no ci]
* llama : fix kv_cache restore logic
ggml-ci
* context : simplify kv cache updates
ggml-ci
* cont : better name [no ci]
* llama : fix llama_decode return code when could not find KV slot
ggml-ci
* context : change log err -> warn [no ci]
* kv-cache : add comment + warning 
						
						
					 
					
						2025-04-02 14:32:59 +03:00 
						 
				 
			
				
					
						
							
							
								Xuan-Son Nguyen 
							
						 
					 
					
						
						
							
						
						af6ae1efb2 
					 
					
						
						
							
							llama : fix non-causal mask for gemma 3 ( #12615 )  
						
						
						
						
					 
					
						2025-03-30 00:07:37 +01:00 
						 
				 
			
				
					
						
							
							
								Georgi Gerganov 
							
						 
					 
					
						
						
							
						
						b4ae50810e 
					 
					
						
						
							
							metal : improve FA + improve MoE ( #12612 )  
						
						... 
						
						
						
						* ggml : FA with different K, V head sizes (CPU)
ggml-ci
* metal : add FA with HS=192
* metal : extend FA to support different K and V head sizes
ggml-ci
* metal : add FA vector kernels for heads K 192 and V 128
ggml-ci
* ggml : restrict op on other backends to equal head sizes
ggml-ci
* metal : optimize FA-vec kernel
ggml-ci
* metal : FA remove mq registers
* metal : improve MoE mul_mat_id condition
ggml-ci
* metal : fix comments + remove unnecessary addition
ggml-ci
* metal : avoid too much shared memory usage with mul_mat_id
ggml-ci 
						
						
					 
					
						2025-03-28 20:21:59 +02:00 
						 
				 
			
				
					
						
							
							
								Georgi Gerganov 
							
						 
					 
					
						
						
							
						
						2d77d88e70 
					 
					
						
						
							
							context : fix worst-case reserve outputs ( #12545 )  
						
						... 
						
						
						
						ggml-ci 
						
						
					 
					
						2025-03-25 09:19:23 +02:00 
						 
				 
			
				
					
						
							
							
								fairydreaming 
							
						 
					 
					
						
						
							
						
						568013d0cd 
					 
					
						
						
							
							context : clear sets containing encoder output sequence ids before storing new values ( #12470 )  
						
						... 
						
						
						
						Co-authored-by: Stanisław Szymczyk <sszymczy@gmail.com > 
						
						
					 
					
						2025-03-19 21:01:57 +01:00 
						 
				 
			
				
					
						
							
							
								Georgi Gerganov 
							
						 
					 
					
						
						
							
						
						75422e8bc4 
					 
					
						
						
							
							graph : normalize Q, K, V shapes + sync cross attention ( #12449 )  
						
						... 
						
						
						
						* graph : normalize Q, K, V shapes and add comments
ggml-ci
* context : synchronize before getting cross attention data
* model : fix command-r attention norm check 
						
						
					 
					
						2025-03-18 21:35:19 +02:00 
						 
				 
			
				
					
						
							
							
								Georgi Gerganov 
							
						 
					 
					
						
						
							
						
						8551c44d84 
					 
					
						
						
							
							context : always use non-causal attention for encoder graphs ( #12447 )  
						
						... 
						
						
						
						* context : always use non-causal attention for encoder graphs
ggml-ci
* context : move the change to llama_context::encode()
ggml-ci 
						
						
					 
					
						2025-03-18 13:05:49 +02:00 
						 
				 
			
				
					
						
							
							
								Georgi Gerganov 
							
						 
					 
					
						
						
							
						
						dc079cfdff 
					 
					
						
						
							
							context : fix init of n_outputs ( #12397 )  
						
						... 
						
						
						
						ggml-ci 
						
						
					 
					
						2025-03-16 19:29:36 +02:00 
						 
				 
			
				
					
						
							
							
								fairydreaming 
							
						 
					 
					
						
						
							
						
						8fcb563613 
					 
					
						
						
							
							Load all MoE experts during warmup ( #11571 )  
						
						... 
						
						
						
						* llama : introduce llama_set_warmup() API call that controls warmup mode; use all MoE experts during warmup
* common : use new API to enable warmup mode during model warmup
---------
Co-authored-by: Stanisław Szymczyk <sszymczy@gmail.com > 
						
						
					 
					
						2025-03-14 13:47:05 +01:00 
						 
				 
			
				
					
						
							
							
								Georgi Gerganov 
							
						 
					 
					
						
						
							
						
						081bee8c64 
					 
					
						
						
							
							hparams : add SWA rope parameters ( #12374 )  
						
						... 
						
						
						
						ggml-ci 
						
						
					 
					
						2025-03-14 09:03:24 +02:00 
						 
				 
			
				
					
						
							
							
								Georgi Gerganov 
							
						 
					 
					
						
						
							
						
						84d5475541 
					 
					
						
						
							
							llama : fix Gemma3 SWA KV cache shift ( #12373 )  
						
						... 
						
						
						
						* llama : fix Gemma3 SWA KV cache shift
ggml-ci
* hparams : add comment [no ci] 
						
						
					 
					
						2025-03-13 19:08:07 +02:00 
						 
				 
			
				
					
						
							
							
								Georgi Gerganov 
							
						 
					 
					
						
						
							
						
						e0dbec0bc6 
					 
					
						
						
							
							llama : refactor llama_context, llama_kv_cache, llm_build_context ( #12181 )  
						
						... 
						
						
						
						* llama : refactor llama_context, llama_kv_cache, llm_build_context
ggml-ci
* graph : don't mutate the KV cache during defrag
ggml-ci
* context : reduce virtuals + remove test function
ggml-ci
* context : move interface implementation to source file + factory
ggml-ci
* graph : move KV cache build functions to llama_context impl
ggml-ci
* graph : remove model reference from build_pooling
ggml-ci
* graph : remove llama_model reference
ggml-ci
* kv_cache : provide rope factors
ggml-ci
* graph : rework inputs to use only unique_ptr, remove attn input abstraction
ggml-ci
* context : remove llama_context_i abstraction
ggml-ci
* context : clean-up
ggml-ci
* graph : clean-up
ggml-ci
* llama : remove redundant keywords (struct, enum)
ggml-ci
* model : adapt gemma3
ggml-ci
* graph : restore same attention ops as on master
ggml-ci
* llama : remove TODO + fix indent
ggml-ci 
						
						
					 
					
						2025-03-13 12:35:44 +02:00 
						 
				 
			
				
					
						
							
							
								Georgi Gerganov 
							
						 
					 
					
						
						
							
						
						afa8a9ec9b 
					 
					
						
						
							
							llama : add llama_vocab, functions -> methods, naming ( #11110 )  
						
						... 
						
						
						
						* llama : functions -> methods (#11110 )
* llama : add struct llama_vocab to the API (#11156 )
ggml-ci
* hparams : move vocab params to llama_vocab (#11159 )
ggml-ci
* vocab : more pimpl (#11165 )
ggml-ci
* vocab : minor tokenization optimizations (#11160 )
ggml-ci
Co-authored-by: Diego Devesa <slarengh@gmail.com >
* lora : update API names (#11167 )
ggml-ci
* llama : update API names to use correct prefix (#11174 )
* llama : update API names to use correct prefix
ggml-ci
* cont
ggml-ci
* cont
ggml-ci
* minor [no ci]
* vocab : llama_vocab_add_[be]os -> llama_vocab_get_add_[be]os (#11174 )
ggml-ci
* vocab : llama_vocab_n_vocab -> llama_vocab_n_tokens (#11174 )
ggml-ci
---------
Co-authored-by: Diego Devesa <slarengh@gmail.com > 
						
						
					 
					
						2025-01-12 11:32:42 +02:00 
						 
				 
			
				
					
						
							
							
								Georgi Gerganov 
							
						 
					 
					
						
						
							
						
						f66f582927 
					 
					
						
						
							
							llama : refactor src/llama.cpp ( #10902 )  
						
						... 
						
						
						
						* llama : scatter llama.cpp into multiple modules (wip)
* llama : control-vector -> adapter
* llama : arch
* llama : mmap
ggml-ci
* ci : remove BUILD_SHARED_LIBS=OFF
ggml-ci
* llama : arch (cont)
ggml-ci
* llama : chat
ggml-ci
* llama : model
ggml-ci
* llama : hparams
ggml-ci
* llama : adapter
ggml-ci
* examples : fix
ggml-ci
* rebase
ggml-ci
* minor
* llama : kv cache
ggml-ci
* llama : impl
ggml-ci
* llama : batch
ggml-ci
* cont
ggml-ci
* llama : context
ggml-ci
* minor
* llama : context (cont)
ggml-ci
* llama : model loader
ggml-ci
* common : update lora
ggml-ci
* llama : quant
ggml-ci
* llama : quant (cont)
ggml-ci
* minor [no ci] 
						
						
					 
					
						2025-01-03 10:18:53 +02:00