Georgi Gerganov 
							
						 
					 
					
						
						
							
						
						a885dcff11 
					 
					
						
						
							
							batched-bench : fix llama_synchronize usage during prompt processing ( #15835 )  
						
						... 
						
						
						
						ggml-ci 
						
						
					 
					
						2025-09-08 10:27:07 +03:00 
						 
				 
			
				
					
						
							
							
								Johannes Gäßler 
							
						 
					 
					
						
						
							
						
						e81b8e4b7f 
					 
					
						
						
							
							llama: use FA + max. GPU layers by default ( #15434 )  
						
						... 
						
						
						
						* llama: use max. GPU layers by default, auto -fa
* ggml-backend: abort instead of segfault 
						
						
					 
					
						2025-08-30 16:32:10 +02:00 
						 
				 
			
				
					
						
							
							
								Georgi Gerganov 
							
						 
					 
					
						
						
							
						
						b3964c1e89 
					 
					
						
						
							
							metal : optimize FA vec for large sequences and BS <= 8 ( #15566 )  
						
						... 
						
						
						
						* metal : optmize FA vec for large heads and sequences
* metal : adjust small-batch mul mv kernels
ggml-ci
* batched-bench : fix total speed computation
ggml-ci
* cont : add comments
ggml-ci 
						
						
					 
					
						2025-08-26 14:22:14 +03:00 
						 
				 
			
				
					
						
							
							
								Georgi Gerganov 
							
						 
					 
					
						
						
							
						
						6b64f74b55 
					 
					
						
						
							
							batched-bench : fix unified KV cache handling + pp timing ( #15562 )  
						
						... 
						
						
						
						* batched-bench : fix unified KV cache handling + pp timing
* cont : run dummy token only with split KV cache 
						
						
					 
					
						2025-08-25 13:56:43 +03:00 
						 
				 
			
				
					
						
							
							
								Georgi Gerganov 
							
						 
					 
					
						
						
							
						
						f0d3c7405c 
					 
					
						
						
							
							batched-bench : use rand tokens ( #15398 )  
						
						
						
						
					 
					
						2025-08-19 08:45:12 +03:00 
						 
				 
			
				
					
						
							
							
								Georgi Gerganov 
							
						 
					 
					
						
						
							
						
						225e7a1438 
					 
					
						
						
							
							llama : add high-throughput mode ( #14363 )  
						
						... 
						
						
						
						* kv-cache : prepare K/V buffers for separation
ggml-ci
* batched-bench : fix oob write
ggml-ci
* llama : add "virtual sequences"
ggml-ci
* llama : use "stream" vs "virtual sequence"
ggml-ci
* graph : fix stream splitting when KV cache is not used
ggml-ci
* kv-cache : add multi-stream save/load support
ggml-ci
* llama : add "--attn-streams" flag
ggml-ci
* kv-cache : fix handling when find_slot fails
ggml-ci
* kv-cache : restore find_slot impl
ggml-ci
* kv-cache : add comments
* kv-cache : add bounds checks for sequence id
ggml-ci
* cont : add n_seq_max to batch allocr
ggml-ci
* kv-cache : perform stream copies lazily after llama_synchronize
ggml-ci
* kv-cache : avoid throwing exceptions across the C boundary
ggml-ci
* CUDA: 4D FlashAttention support (#14628 )
* CUDA: 4D FlashAttention support
* CUDA: fix WMMA FA kernel
* llama : rename attn_streams -> kv_unified
ggml-ci
* common : rename kv_split -> kv_unified
ggml-ci
---------
Co-authored-by: Johannes Gäßler <johannesg@5d6.de > 
						
						
					 
					
						2025-07-16 16:35:42 +03:00 
						 
				 
			
				
					
						
							
							
								Georgi Gerganov 
							
						 
					 
					
						
						
							
						
						745aa5319b 
					 
					
						
						
							
							llama : deprecate llama_kv_self_ API ( #14030 )  
						
						... 
						
						
						
						* llama : deprecate llama_kv_self_ API
ggml-ci
* llama : allow llama_memory_(nullptr)
ggml-ci
* memory : add flag for optional data clear in llama_memory_clear
ggml-ci 
						
						
					 
					
						2025-06-06 14:11:15 +03:00 
						 
				 
			
				
					
						
							
							
								Georgi Gerganov 
							
						 
					 
					
						
						
							
						
						b89d605a91 
					 
					
						
						
							
							batched-bench : fix pp batch contents ( #13492 )  
						
						
						
						
					 
					
						2025-05-13 18:01:53 +03:00 
						 
				 
			
				
					
						
							
							
								Diego Devesa 
							
						 
					 
					
						
						
							
						
						1d36b3670b 
					 
					
						
						
							
							llama : move end-user examples to tools directory ( #13249 )  
						
						... 
						
						
						
						* llama : move end-user examples to tools directory
---------
Co-authored-by: Xuan Son Nguyen <son@huggingface.co > 
						
						
					 
					
						2025-05-02 20:27:13 +02:00