Georgi Gerganov 
							
						 
					 
					
						
						
							
						
						b9912ac570 
					 
					
						
						
							
							batch : auto-gen positions + verify multi-sequence input ( #14177 )  
						
						 
						
						... 
						
						
						
						* batch : verify multi-sequence input batches
ggml-ci
* cont : auto-gen positions + verify multi-seq input
ggml-ci
* cont : first print debug info, then perform validation
ggml-ci
* cont : fix position auto-gen + add comments
ggml-ci 
						
						
					 
					
						2025-06-15 09:18:37 +03:00  
					
					
						 
						
						
							
							
							 
							
							
							
							
							 
						
					 
				 
			
				
					
						
							
							
								 
								Georgi Gerganov 
							
						 
					 
					
						
						
							
						
						745aa5319b 
					 
					
						
						
							
							llama : deprecate llama_kv_self_ API ( #14030 )  
						
						 
						
						... 
						
						
						
						* llama : deprecate llama_kv_self_ API
ggml-ci
* llama : allow llama_memory_(nullptr)
ggml-ci
* memory : add flag for optional data clear in llama_memory_clear
ggml-ci 
						
						
					 
					
						2025-06-06 14:11:15 +03:00  
					
					
						 
						
						
							
							
							 
							
							
							
							
							 
						
					 
				 
			
				
					
						
							
							
								 
								Sigbjørn Skjæret 
							
						 
					 
					
						
						
							
						
						d17a809ef0 
					 
					
						
						
							
							llama : support multiple classifier outputs and labels ( #13940 )  
						
						 
						
						
						
						
					 
					
						2025-06-06 09:03:25 +02:00  
					
					
						 
						
						
							
							
							 
							
							
							
							
							 
						
					 
				 
			
				
					
						
							
							
								 
								Georgi Gerganov 
							
						 
					 
					
						
						
							
						
						7f37b6cf1e 
					 
					
						
						
							
							memory : migrate from llama_kv_cache to more generic llama_memory ( #14006 )  
						
						 
						
						... 
						
						
						
						* memory : merge llama_kv_cache into llama_memory + new `llama_memory` API
ggml-ci
* context : fix casts
ggml-ci 
						
						
					 
					
						2025-06-05 15:29:22 +03:00  
					
					
						 
						
						
							
							
							 
							
							
							
							
							 
						
					 
				 
			
				
					
						
							
							
								 
								Georgi Gerganov 
							
						 
					 
					
						
						
							
						
						803f8baf4f 
					 
					
						
						
							
							llama : deprecate explicit kv_self defrag/update calls ( #13921 )  
						
						 
						
						... 
						
						
						
						ggml-ci 
						
						
					 
					
						2025-05-31 15:58:33 +03:00  
					
					
						 
						
						
							
							
							 
							
							
							
							
							 
						
					 
				 
			
				
					
						
							
							
								 
								Georgi Gerganov 
							
						 
					 
					
						
						
							
						
						3600cc2886 
					 
					
						
						
							
							llama : use n_swa + n_ubatch cells for SWA cache ( #13833 )  
						
						 
						
						... 
						
						
						
						* llama : use n_swa + n_ubatch cells for SWA cache
ggml-ci
* llama : add warning about multi-sqeuence SWA contexts 
						
						
					 
					
						2025-05-31 15:57:44 +03:00  
					
					
						 
						
						
							
							
							 
							
							
							
							
							 
						
					 
				 
			
				
					
						
							
							
								 
								Georgi Gerganov 
							
						 
					 
					
						
						
							
						
						12d0188c0d 
					 
					
						
						
							
							kv-cache : refactor + add llama_memory_state_i ( #13746 )  
						
						 
						
						... 
						
						
						
						* kv-cache : simplify the "struct llama_kv_cache" interface
ggml-ci
* kv-cache : revert the (n_swa + n_ubatch) change (for next PR)
ggml-ci
* kv-cache : some comments
ggml-ci
* context : fix graph reserve for multiple sequences
ggml-ci
* kv-cache : fix typo [no ci]
* kv-cache : fix find_slot() logic for free slots
ggml-ci
* llama : add TODO for deprecating the defrag API in the future
* kv-cache : improve find_slot() using min/max seq pos info
ggml-ci
* llama : handle aborts and compute errors
ggml-ci
* memory : extract state into llama_memory_state
ggml-ci
* kv-cache : add comments
ggml-ci
* server : update batching logic to reset n_batch on successful decode
* server : upon full re-processing, remove the sequence from the cache
* kv-cache : add TODO for doing split_equal when split_simple fails
ggml-ci 
						
						
					 
					
						2025-05-31 10:24:04 +03:00  
					
					
						 
						
						
							
							
							 
							
							
							
							
							 
						
					 
				 
			
				
					
						
							
							
								 
								Georgi Gerganov 
							
						 
					 
					
						
						
							
						
						22229314fc 
					 
					
						
						
							
							llama : clarify deprecation message ( #13794 )  
						
						 
						
						
						
						
					 
					
						2025-05-26 12:57:50 +03:00  
					
					
						 
						
						
							
							
							 
							
							
							
							
							 
						
					 
				 
			
				
					
						
							
							
								 
								Georgi Gerganov 
							
						 
					 
					
						
						
							
						
						de2ef53a4b 
					 
					
						
						
							
							kv-cache : rework kv_cell ( #13706 )  
						
						 
						
						... 
						
						
						
						* kv-cache : rework kv_cell
ggml-ci
* kv-cells : use "shift" instead of "delta" consistently
ggml-ci
* llama : add llama_max_parallel_sequences()
ggml-ci
* kv-cells : update comments [no ci]
* context : fail upon construction if sequences exceed max value
ggml-ci
* kv-cells : get_pos() -> pos_get() + comments
ggml-ci
* kv-cells : fix tracking of "used" cells
ggml-ci 
						
						
					 
					
						2025-05-25 16:34:36 +03:00  
					
					
						 
						
						
							
							
							 
							
							
							
							
							 
						
					 
				 
			
				
					
						
							
							
								 
								Georgi Gerganov 
							
						 
					 
					
						
						
							
						
						797f2ac062 
					 
					
						
						
							
							kv-cache : simplify the interface ( #13660 )  
						
						 
						
						... 
						
						
						
						* kv-cache : simplify the interface
ggml-ci
* context : revert llama_batch_allocr position change
ggml-ci 
						
						
					 
					
						2025-05-21 15:11:13 +03:00  
					
					
						 
						
						
							
							
							 
							
							
							
							
							 
						
					 
				 
			
				
					
						
							
							
								 
								Georgi Gerganov 
							
						 
					 
					
						
						
							
						
						a4090d1174 
					 
					
						
						
							
							llama : remove llama_kv_cache_view API + remove deprecated ( #13653 )  
						
						 
						
						... 
						
						
						
						ggml-ci 
						
						
					 
					
						2025-05-20 16:13:16 +03:00  
					
					
						 
						
						
							
							
							 
							
							
							
							
							 
						
					 
				 
			
				
					
						
							
							
								 
								Georgi Gerganov 
							
						 
					 
					
						
						
							
						
						e298d2fbd0 
					 
					
						
						
							
							kv-cache : add SWA support ( #13194 )  
						
						 
						
						... 
						
						
						
						* kv-cache : prepare for SWA
ggml-ci
* kv-cache : initial iSWA implementation
ggml-ci
* kv-cache : rework error recovery logic
ggml-ci
* models : fix Phi-3 SWA parameters
ggml-ci
* model : adjust Granite to rope factor changes
ggml-ci
* server : check if context can do shifts
ggml-ci
* iswa : for now, always enable shifts (experiment)
ggml-ci
* kv-cache : simplify SWA logic
ggml-ci
* kv-cache : apply defrag when we fail to find slots for the batch
ggml-ci
* llama : update docs about llama_decode
ggml-ci
* kv-cache : update warning logs when no space for the batch is available
ggml-ci
* llama : add llama_kv_self_seq_pos_min()
* kv-cache : keep track of partial SWA computes and print warnings
* server : disallow use cases involving partial SWA context
ggml-ci
* llama : add param to control SWA cache size
ggml-ci
* minor : clean-up
ggml-ci 
						
						
					 
					
						2025-05-20 08:05:46 +03:00  
					
					
						 
						
						
							
							
							 
							
							
							
							
							 
						
					 
				 
			
				
					
						
							
							
								 
								Diego Devesa 
							
						 
					 
					
						
						
							
						
						cf0a43bb64 
					 
					
						
						
							
							llama-bench : add defrag-thold, check for invalid ranges ( #13487 )  
						
						 
						
						
						
						
					 
					
						2025-05-13 00:31:37 +02:00  
					
					
						 
						
						
							
							
							 
							
							
							
							
							 
						
					 
				 
			
				
					
						
							
							
								 
								Johannes Gäßler 
							
						 
					 
					
						
						
							
						
						10d2af0eaa 
					 
					
						
						
							
							llama/ggml: add LLM training support ( #10544 )  
						
						 
						
						... 
						
						
						
						* llama/ggml: add LLM training support
more compact progress bar
llama_save_model_to_file
llama_opt_param_filter
ggml_graph_dup force_grads
refactor ggml_opt, fix test-opt
* remove logits_all
* refactor CUDA implementation for ACC
* reset graph at beginning of opt period 
						
						
					 
					
						2025-05-12 14:44:49 +02:00  
					
					
						 
						
						
							
							
							 
							
							
							
							
							 
						
					 
				 
			
				
					
						
							
							
								 
								David Huang 
							
						 
					 
					
						
						
							
						
						7f323a589f 
					 
					
						
						
							
							Add --no-op-offload to improve -ot pp perf in MoE models like llama4 400B ( #13386 )  
						
						 
						
						
						
						
					 
					
						2025-05-11 14:18:39 +02:00  
					
					
						 
						
						
							
							
							 
							
							
							
							
							 
						
					 
				 
			
				
					
						
							
							
								 
								Sigbjørn Skjæret 
							
						 
					 
					
						
						
							
						
						d2a4ef05c6 
					 
					
						
						
							
							vocab : add ByteDance-Seed/Seed-Coder ( #13423 )  
						
						 
						
						
						
						
					 
					
						2025-05-10 22:08:07 +02:00  
					
					
						 
						
						
							
							
							 
							
							
							
							
							 
						
					 
				 
			
				
					
						
							
							
								 
								Georgi Gerganov 
							
						 
					 
					
						
						
							
						
						6562e5a4d6 
					 
					
						
						
							
							context : allow cache-less context for embeddings ( #13108 )  
						
						 
						
						... 
						
						
						
						* context : allow cache-less context for embeddings
ggml-ci
* context : enable reranking with encode()
ggml-ci
* context : encode() clears embd_seq
ggml-ci
* examples : use llama_encode() when appropriate
ggml-ci
* models : nomic bert moe does not require KV cache
* llama : update comments for llama_decode/llama_encode
ggml-ci
* context : update warning log [no ci] 
						
						
					 
					
						2025-05-08 14:28:33 +03:00  
					
					
						 
						
						
							
							
							 
							
							
							
							
							 
						
					 
				 
			
				
					
						
							
							
								 
								Georgi Gerganov 
							
						 
					 
					
						
						
							
						
						51fb96b1ff 
					 
					
						
						
							
							context : remove logits_all flag ( #13284 )  
						
						 
						
						... 
						
						
						
						* context : remove logits_all flag
ggml-ci
* llama : remove logits_all flag + reorder llama_context_params
ggml-ci 
						
						
					 
					
						2025-05-08 14:26:50 +03:00  
					
					
						 
						
						
							
							
							 
							
							
							
							
							 
						
					 
				 
			
				
					
						
							
							
								 
								Georgi Gerganov 
							
						 
					 
					
						
						
							
						
						d9d398f84f 
					 
					
						
						
							
							sampling : when top-k <= 0 -> noop ( #13173 )  
						
						 
						
						... 
						
						
						
						ggml-ci 
						
						
					 
					
						2025-04-29 20:22:57 +03:00  
					
					
						 
						
						
							
							
							 
							
							
							
							
							 
						
					 
				 
			
				
					
						
							
							
								 
								Xuan-Son Nguyen 
							
						 
					 
					
						
						
							
						
						ecda2ec4b3 
					 
					
						
						
							
							mtmd : Support Pixtral 12B ( #13065 )  
						
						 
						
						... 
						
						
						
						* add pixtral text model (vision is wip)
* cgraph ok, just missing 2D RoPE
* fix bad rebase
* first working version
* fix problem with img_break token
* support dynamic image size
* update docs
* update test script 
						
						
					 
					
						2025-04-23 20:21:59 +02:00  
					
					
						 
						
						
							
							
							 
							
							
							
							
							 
						
					 
				 
			
				
					
						
							
							
								 
								Ed Addario 
							
						 
					 
					
						
						
							
						
						71e90e8813 
					 
					
						
						
							
							quantize: Handle user-defined quantization levels for additional tensors ( #12511 )  
						
						 
						
						... 
						
						
						
						* Add llama_model_quantize_params parameters
* Add new quantize parameters parsing and validation
* Update usage
* Add new parameters defaults
* Add new quantization parameters logic
* Add llama_model_quantize_params parameters
* Add new quantize parameters parsing and validation
* Update usage
* Add new parameters defaults
* Add new quantization parameters logic
* Minor refactoring as per the contributors' coding guidelines
* Update descriptions to match existing style
* Add llama_model_quantize_params parameters
* Add new quantize parameters parsing and validation
* Update usage
* Add new parameters defaults
* Add new quantization parameters logic
* Minor refactoring as per the contributors' guidelines
* Implement general --tensor-type instead of tensor-specific command option
* Fix implied type bug
* Restore missing #includes
* Add regex capability for tensor selection
* Refactor function name and update ALLOWED_TENSOR_TYPE
* Add missing #include
* Handle edge case when tensor name is cls.output
* Minor logging improvement 
						
						
					 
					
						2025-04-13 21:29:28 +03:00  
					
					
						 
						
						
							
							
							 
							
							
							
							
							 
						
					 
				 
			
				
					
						
							
							
								 
								Xuan-Son Nguyen 
							
						 
					 
					
						
						
							
						
						1466621e73 
					 
					
						
						
							
							llama : Support llama 4 text-only ( #12791 )  
						
						 
						
						... 
						
						
						
						* llama4 conversion
* initial support, no chat template
* clean up a bit
* fix tokenizer conversion
* correct hparams
* try this
* fix shexp
* ffn_inp_normed
* chat template
* clean up model conversion
* add_bos
* add scale_before_ffn
* fix order
* weight_before_ffn
* llm_graph_input_attn_temp
* add chunk attn mask
* build_inp_attn_scale()
* add comment about ggml_repeat
* clarify comments
* fix build 
						
						
					 
					
						2025-04-07 23:06:44 +02:00  
					
					
						 
						
						
							
							
							 
							
							
							
							
							 
						
					 
				 
			
				
					
						
							
							
								 
								Diego Devesa 
							
						 
					 
					
						
						
							
						
						e0e912f49b 
					 
					
						
						
							
							llama : add option to override model tensor buffers ( #11397 )  
						
						 
						
						... 
						
						
						
						* llama : add option to override tensor buffers
* ggml : fix possible underflow in ggml_nbytes 
						
						
					 
					
						2025-04-02 14:52:01 +02:00  
					
					
						 
						
						
							
							
							 
							
							
							
							
							 
						
					 
				 
			
				
					
						
							
							
								 
								Sigbjørn Skjæret 
							
						 
					 
					
						
						
							
						
						2c3f8b850a 
					 
					
						
						
							
							llama : support BailingMoE (Ling) ( #12634 )  
						
						 
						
						
						
						
					 
					
						2025-03-30 22:21:03 +02:00  
					
					
						 
						
						
							
							
							 
							
							
							
							
							 
						
					 
				 
			
				
					
						
							
							
								 
								Juyoung Suk 
							
						 
					 
					
						
						
							
						
						b3de7cac73 
					 
					
						
						
							
							llama : add Trillion 7B model support ( #12556 )  
						
						 
						
						... 
						
						
						
						* Support Trillion 7B
* Update llama.h
* Update llama.h
* Update llama-vocab.cpp for Trillion
* Update llama-vocab.cpp 
						
						
					 
					
						2025-03-30 20:38:33 +02:00  
					
					
						 
						
						
							
							
							 
							
							
							
							
							 
						
					 
				 
			
				
					
						
							
							
								 
								Johannes Gäßler 
							
						 
					 
					
						
						
							
						
						dd373dd3bf 
					 
					
						
						
							
							llama: fix error on bad grammar ( #12628 )  
						
						 
						
						
						
						
					 
					
						2025-03-28 18:08:52 +01:00  
					
					
						 
						
						
							
							
							 
							
							
							
							
							 
						
					 
				 
			
				
					
						
							
							
								 
								compilade 
							
						 
					 
					
						
						
							
						
						00d53800e0 
					 
					
						
						
							
							llama-vocab : add SuperBPE pre-tokenizer ( #12532 )  
						
						 
						
						
						
						
					 
					
						2025-03-24 11:47:24 +01:00  
					
					
						 
						
						
							
							
							 
							
							
							
							
							 
						
					 
				 
			
				
					
						
							
							
								 
								fairydreaming 
							
						 
					 
					
						
						
							
						
						8fcb563613 
					 
					
						
						
							
							Load all MoE experts during warmup ( #11571 )  
						
						 
						
						... 
						
						
						
						* llama : introduce llama_set_warmup() API call that controls warmup mode; use all MoE experts during warmup
* common : use new API to enable warmup mode during model warmup
---------
Co-authored-by: Stanisław Szymczyk <sszymczy@gmail.com > 
						
						
					 
					
						2025-03-14 13:47:05 +01:00  
					
					
						 
						
						
							
							
							 
							
							
							
							
							 
						
					 
				 
			
				
					
						
							
							
								 
								Georgi Gerganov 
							
						 
					 
					
						
						
							
						
						e0dbec0bc6 
					 
					
						
						
							
							llama : refactor llama_context, llama_kv_cache, llm_build_context ( #12181 )  
						
						 
						
						... 
						
						
						
						* llama : refactor llama_context, llama_kv_cache, llm_build_context
ggml-ci
* graph : don't mutate the KV cache during defrag
ggml-ci
* context : reduce virtuals + remove test function
ggml-ci
* context : move interface implementation to source file + factory
ggml-ci
* graph : move KV cache build functions to llama_context impl
ggml-ci
* graph : remove model reference from build_pooling
ggml-ci
* graph : remove llama_model reference
ggml-ci
* kv_cache : provide rope factors
ggml-ci
* graph : rework inputs to use only unique_ptr, remove attn input abstraction
ggml-ci
* context : remove llama_context_i abstraction
ggml-ci
* context : clean-up
ggml-ci
* graph : clean-up
ggml-ci
* llama : remove redundant keywords (struct, enum)
ggml-ci
* model : adapt gemma3
ggml-ci
* graph : restore same attention ops as on master
ggml-ci
* llama : remove TODO + fix indent
ggml-ci 
						
						
					 
					
						2025-03-13 12:35:44 +02:00  
					
					
						 
						
						
							
							
							 
							
							
							
							
							 
						
					 
				 
			
				
					
						
							
							
								 
								Olivier Chafik 
							
						 
					 
					
						
						
							
						
						669912d9a5 
					 
					
						
						
							
							tool-call: fix Qwen 2.5 Coder support, add micro benchmarks, support trigger patterns for lazy grammars (#12034 )  
						
						 
						
						... 
						
						
						
						* sampler: turn lazy grammar trigger words to regexes
* add scripts/tool_bench.sh & .py
* constrain llama json output regardless of function name if matches at beginning
* update relaxed newline space rule in grammar tests
* support add_generation_prompt query parameter (useful for /apply_template)
* Update src/llama-grammar.cpp
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com >
---------
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com > 
						
						
					 
					
						2025-03-05 13:05:13 +00:00  
					
					
						 
						
						
							
							
							 
							
							
							
							
							 
						
					 
				 
			
				
					
						
							
							
								 
								Xuan-Son Nguyen 
							
						 
					 
					
						
						
							
						
						c43a3e7996 
					 
					
						
						
							
							llama : add Phi-4-mini support (supersede  #12099 ) ( #12108 )  
						
						 
						
						... 
						
						
						
						* Added Phi-4-mini-instruct support
* Update regex per ngxson
* Change the vocab base to Xenova/gpt-4o
* fix conversion update script
* no need to check longrope
* minor style fix
* fix python style
---------
Co-authored-by: Nicholas Sparks <nisparks@microsoft.com > 
						
						
					 
					
						2025-02-28 12:44:11 +01:00  
					
					
						 
						
						
							
							
							 
							
							
							
							
							 
						
					 
				 
			
				
					
						
							
							
								 
								Vitali Lovich 
							
						 
					 
					
						
						
							
						
						3e9a2860e9 
					 
					
						
						
							
							llama : expose llama_model_n_head_kv in the API ( #11997 )  
						
						 
						
						... 
						
						
						
						It's useful to be able to have this from the library layer as it's a key
parameter of the model (e.g. to figure out how much KV cache memory is
needed). 
						
						
					 
					
						2025-02-25 11:29:33 +02:00  
					
					
						 
						
						
							
							
							 
							
							
							
							
							 
						
					 
				 
			
				
					
						
							
							
								 
								Georgi Gerganov 
							
						 
					 
					
						
						
							
						
						68ff663a04 
					 
					
						
						
							
							repo : update links to new url ( #11886 )  
						
						 
						
						... 
						
						
						
						* repo : update links to new url
ggml-ci
* cont : more urls
ggml-ci 
						
						
					 
					
						2025-02-15 16:40:57 +02:00  
					
					
						 
						
						
							
							
							 
							
							
							
							
							 
						
					 
				 
			
				
					
						
							
							
								 
								Vinesh Janarthanan 
							
						 
					 
					
						
						
							
						
						27e8a23300 
					 
					
						
						
							
							sampling: add Top-nσ sampler ( #11223 )  
						
						 
						
						... 
						
						
						
						* initial sampling changes:
* completed top nsigma sampler implementation
* apply parameter to only llama-cli
* updated readme
* added tests and fixed nsigma impl
* cleaned up pr
* format
* format
* format
* removed commented tests
* cleanup pr and remove explicit floats
* added top-k sampler to improve performance
* changed sigma to float
* fixed string format to float
* Update src/llama-sampling.cpp
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com >
* Update common/sampling.cpp
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com >
* Update src/llama-sampling.cpp
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com >
* Update src/llama-sampling.cpp
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com >
* Update src/llama-sampling.cpp
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com >
* Update src/llama-sampling.cpp
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com >
* added llama_sampler_init
---------
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com > 
						
						
					 
					
						2025-02-13 08:45:57 +02:00  
					
					
						 
						
						
							
							
							 
							
							
							
							
							 
						
					 
				 
			
				
					
						
							
							
								 
								Christian Fillion 
							
						 
					 
					
						
						
							
						
						7ee953a64a 
					 
					
						
						
							
							llama : add llama_sampler_init for safe usage of llama_sampler_free ( #11727 )  
						
						 
						
						... 
						
						
						
						The C API in llama.h claims users can implement `llama_sampler_i` to
create custom `llama_sampler`. The sampler chain takes ownership and
calls `llama_sampler_free` on them. However, `llama_sampler_free` is
hard-coded to use `delete`. This is undefined behavior if the object
wasn't also allocated via `new` from libllama's C++ runtime. Callers
in C and C-compatible languages do not use C++'s `new` operator. C++
callers may not be sharing the same heap as libllama. 
						
						
					 
					
						2025-02-07 11:33:27 +02:00  
					
					
						 
						
						
							
							
							 
							
							
							
							
							 
						
					 
				 
			
				
					
						
							
							
								 
								Olivier Chafik 
							
						 
					 
					
						
						
							
						
						8b576b6c55 
					 
					
						
						
							
							Tool call support (generic + native for Llama, Functionary, Hermes, Mistral, Firefunction, DeepSeek) w/ lazy grammars ( #9639 )  
						
						 
						
						... 
						
						
						
						---------
Co-authored-by: Xuan Son Nguyen <thichthat@gmail.com >
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com >
Co-authored-by: Xuan Son Nguyen <son@huggingface.co > 
						
						
					 
					
						2025-01-30 19:13:58 +00:00  
					
					
						 
						
						
							
							
							 
							
							
							
							
							 
						
					 
				 
			
				
					
						
							
							
								 
								Olivier Chafik 
							
						 
					 
					
						
						
							
						
						6171c9d258 
					 
					
						
						
							
							Add Jinja template support ( #11016 )  
						
						 
						
						... 
						
						
						
						* Copy minja from 58f0ca6dd7 
* Add --jinja and --chat-template-file flags
* Add missing <optional> include
* Avoid print in get_hf_chat_template.py
* No designated initializers yet
* Try and work around msvc++ non-macro max resolution quirk
* Update test_chat_completion.py
* Wire LLM_KV_TOKENIZER_CHAT_TEMPLATE_N in llama_model_chat_template
* Refactor test-chat-template
* Test templates w/ minja
* Fix deprecation
* Add --jinja to llama-run
* Update common_chat_format_example to use minja template wrapper
* Test chat_template in e2e test
* Update utils.py
* Update test_chat_completion.py
* Update run.cpp
* Update arg.cpp
* Refactor common_chat_* functions to accept minja template + use_jinja option
* Attempt to fix linkage of LLAMA_CHATML_TEMPLATE
* Revert LLAMA_CHATML_TEMPLATE refactor
* Normalize newlines in test-chat-templates for windows tests
* Forward decl minja::chat_template to avoid eager json dep
* Flush stdout in chat template before potential crash
* Fix copy elision warning
* Rm unused optional include
* Add missing optional include to server.cpp
* Disable jinja test that has a cryptic windows failure
* minja: fix vigogne (https://github.com/google/minja/pull/22 )
* Apply suggestions from code review
Co-authored-by: Xuan Son Nguyen <thichthat@gmail.com >
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com >
* Finish suggested renamings
* Move chat_templates inside server_context + remove mutex
* Update --chat-template-file w/ recent change to --chat-template
* Refactor chat template validation
* Guard against missing eos/bos tokens (null token otherwise throws in llama_vocab::impl::token_get_attr)
* Warn against missing eos / bos tokens when jinja template references them
* rename: common_chat_template[s]
* reinstate assert on chat_templates.template_default
* Update minja to b8437df626 
* Update minja to https://github.com/google/minja/pull/25 
* Update minja from https://github.com/google/minja/pull/27 
* rm unused optional header
---------
Co-authored-by: Xuan Son Nguyen <thichthat@gmail.com >
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com > 
						
						
					 
					
						2025-01-21 13:18:51 +00:00  
					
					
						 
						
						
							
							
							 
							
							
							
							
							 
						
					 
				 
			
				
					
						
							
							
								 
								Radoslav Gerganov 
							
						 
					 
					
						
						
							
						
						667d72846c 
					 
					
						
						
							
							rpc : early register backend devices ( #11262 )  
						
						 
						
						... 
						
						
						
						Early register RPC devices and do not propagate RPC specifics in the
llama model structures.
ref: #10609  
						
						
					 
					
						2025-01-17 10:57:09 +02:00  
					
					
						 
						
						
							
							
							 
							
							
							
							
							 
						
					 
				 
			
				
					
						
							
							
								 
								David Renshaw 
							
						 
					 
					
						
						
							
						
						960ec65273 
					 
					
						
						
							
							llama : fix deprecation message: vocabable -> vocab ( #11269 )  
						
						 
						
						
						
						
					 
					
						2025-01-17 08:12:01 +01:00  
					
					
						 
						
						
							
							
							 
							
							
							
							
							 
						
					 
				 
			
				
					
						
							
							
								 
								Xuan Son Nguyen 
							
						 
					 
					
						
						
							
						
						681149ced2 
					 
					
						
						
							
							llama : add llama_model_load_from_splits ( #11255 )  
						
						 
						
						... 
						
						
						
						* llama : add `llama_model_load_from_splits`
* update 
						
						
					 
					
						2025-01-16 13:54:08 +01:00  
					
					
						 
						
						
							
							
							 
							
							
							
							
							 
						
					 
				 
			
				
					
						
							
							
								 
								Georgi Gerganov 
							
						 
					 
					
						
						
							
						
						08f10f69c3 
					 
					
						
						
							
							llama : remove notion of CLS token ( #11064 )  
						
						 
						
						... 
						
						
						
						ggml-ci 
						
						
					 
					
						2025-01-12 12:15:53 +02:00  
					
					
						 
						
						
							
							
							 
							
							
							
							
							 
						
					 
				 
			
				
					
						
							
							
								 
								Georgi Gerganov 
							
						 
					 
					
						
						
							
						
						afa8a9ec9b 
					 
					
						
						
							
							llama : add llama_vocab, functions -> methods, naming ( #11110 )  
						
						 
						
						... 
						
						
						
						* llama : functions -> methods (#11110 )
* llama : add struct llama_vocab to the API (#11156 )
ggml-ci
* hparams : move vocab params to llama_vocab (#11159 )
ggml-ci
* vocab : more pimpl (#11165 )
ggml-ci
* vocab : minor tokenization optimizations (#11160 )
ggml-ci
Co-authored-by: Diego Devesa <slarengh@gmail.com >
* lora : update API names (#11167 )
ggml-ci
* llama : update API names to use correct prefix (#11174 )
* llama : update API names to use correct prefix
ggml-ci
* cont
ggml-ci
* cont
ggml-ci
* minor [no ci]
* vocab : llama_vocab_add_[be]os -> llama_vocab_get_add_[be]os (#11174 )
ggml-ci
* vocab : llama_vocab_n_vocab -> llama_vocab_n_tokens (#11174 )
ggml-ci
---------
Co-authored-by: Diego Devesa <slarengh@gmail.com > 
						
						
					 
					
						2025-01-12 11:32:42 +02:00  
					
					
						 
						
						
							
							
							 
							
							
							
							
							 
						
					 
				 
			
				
					
						
							
							
								 
								Georgi Gerganov 
							
						 
					 
					
						
						
							
						
						47182dd03f 
					 
					
						
						
							
							llama : update llama_model API names ( #11063 )  
						
						 
						
						... 
						
						
						
						* llama : deprecate llama_free_model, add llama_model_free
ggml-ci
* llama : change `llama_load_model_from_file` -> `llama_model_load_from_file`
ggml-ci 
						
						
					 
					
						2025-01-06 10:55:18 +02:00  
					
					
						 
						
						
							
							
							 
							
							
							
							
							 
						
					 
				 
			
				
					
						
							
							
								 
								Georgi Gerganov 
							
						 
					 
					
						
						
							
						
						727368c60f 
					 
					
						
						
							
							llama : use LLAMA_TOKEN_NULL ( #11062 )  
						
						 
						
						... 
						
						
						
						ggml-ci 
						
						
					 
					
						2025-01-06 10:52:15 +02:00  
					
					
						 
						
						
							
							
							 
							
							
							
							
							 
						
					 
				 
			
				
					
						
							
							
								 
								fairydreaming 
							
						 
					 
					
						
						
							
						
						9394bbd484 
					 
					
						
						
							
							llama : Add support for DeepSeek V3 ( #11049 )  
						
						 
						
						... 
						
						
						
						* convert : extend DEEPSEEK2 model architecture to support DeepseekV3ForCausalLM by adding EXPERT_WEIGHTS_NORM and EXPERT_GATING_FUNC model parameters and FFN_EXP_PROBS_B tensor type
* vocab : add DeepSeek V3 pre-tokenizer regexes
* unicode : handle ACCENT_MARK and SYMBOL categories in regex
* llama : add DeepSeek V3 chat template, handle new model parameters and tensor types
---------
Co-authored-by: Stanisław Szymczyk <sszymczy@gmail.com > 
						
						
					 
					
						2025-01-04 21:06:11 +01:00  
					
					
						 
						
						
							
							
							 
							
							
							
							
							 
						
					 
				 
			
				
					
						
							
							
								 
								Georgi Gerganov 
							
						 
					 
					
						
						
							
						
						f66f582927 
					 
					
						
						
							
							llama : refactor src/llama.cpp ( #10902 )  
						
						 
						
						... 
						
						
						
						* llama : scatter llama.cpp into multiple modules (wip)
* llama : control-vector -> adapter
* llama : arch
* llama : mmap
ggml-ci
* ci : remove BUILD_SHARED_LIBS=OFF
ggml-ci
* llama : arch (cont)
ggml-ci
* llama : chat
ggml-ci
* llama : model
ggml-ci
* llama : hparams
ggml-ci
* llama : adapter
ggml-ci
* examples : fix
ggml-ci
* rebase
ggml-ci
* minor
* llama : kv cache
ggml-ci
* llama : impl
ggml-ci
* llama : batch
ggml-ci
* cont
ggml-ci
* llama : context
ggml-ci
* minor
* llama : context (cont)
ggml-ci
* llama : model loader
ggml-ci
* common : update lora
ggml-ci
* llama : quant
ggml-ci
* llama : quant (cont)
ggml-ci
* minor [no ci] 
						
						
					 
					
						2025-01-03 10:18:53 +02:00  
					
					
						 
						
						
							
							
							 
							
							
							
							
							 
						
					 
				 
			
				
					
						
							
							
								 
								Georgi Gerganov 
							
						 
					 
					
						
						
							
						
						0bf2d10c55 
					 
					
						
						
							
							tts : add OuteTTS support ( #10784 )  
						
						 
						
						... 
						
						
						
						* server : add "tokens" output
ggml-ci
* server : output embeddings for all tokens when pooling = none
ggml-ci
* server : be explicit about the pooling type in the tests
ggml-ci
* server : do not normalize embeddings when there is no pooling
ggml-ci
* llama : add OuteTTS support (wip)
* wip
* extract features
* first conv
* group norm
* resnet conv
* resnet
* attn
* pos net
* layer norm
* convnext
* head
* hann window
* fix n_embd + remove llama.cpp hacks
* compute hann window
* fft
* spectrum processing
* clean-up
* tts : receive input text and generate codes
* clip : fix new conv name
* tts : minor fix
* tts : add header + minor fixes
ggml-ci
* tts : add matchematical constant
ggml-ci
* tts : fix sampling + cut initial noise
* tts : fixes
* tts : update default samplers
ggml-ci
* tts : text pre-processing
* tts : outetts-voc -> wavtokenizer-dec
* tts : remove hardcoded constants
ggml-ci
* tts : fix tensor shapes
* llama : refactor wavtokenizer tensors
ggml-ci
* cont
ggml-ci
* cont [no ci]
* llama : update WavTokenizer to non-causal attn
* llama : handle no-vocab detokenization
* tts : add Python example for OuteTTS (wip)
* tts : extend python example to generate spectrogram
ggml-ci
* server : fix rebase artifacts
* tts : enable "return_tokens" in Python example
ggml-ci
* tts : minor fixes
* common : support HF download for vocoder 
						
						
					 
					
						2024-12-18 19:27:21 +02:00  
					
					
						 
						
						
							
							
							 
							
							
							
							
							 
						
					 
				 
			
				
					
						
							
							
								 
								Georgi Gerganov 
							
						 
					 
					
						
						
							
						
						644fd71b44 
					 
					
						
						
							
							sampling : refactor + optimize penalties sampler ( #10803 )  
						
						 
						
						... 
						
						
						
						* sampling : refactor + optimize penalties sampler
ggml-ci
* common : apply ignore_eos as logit bias
ggml-ci
* batched : remove penalties sampler
* params : allow penalty_last_n == -1 to be equal to context size
ggml-ci
* common : by default, move the penalties at the end of the sampling chain
ggml-ci
* common : ignore all EOG tokens
Co-authored-by: Diego Devesa <slarengh@gmail.com >
* common : move back the penalties at the front of the sampling chain
ggml-ci
* readme : restore hint about --ignore-eos flag [no ci]
* llama : minor
ggml-ci
* webui : update
---------
Co-authored-by: Diego Devesa <slarengh@gmail.com > 
						
						
					 
					
						2024-12-16 12:31:14 +02:00  
					
					
						 
						
						
							
							
							 
							
							
							
							
							 
						
					 
				 
			
				
					
						
							
							
								 
								HimariO 
							
						 
					 
					
						
						
							
						
						ba1cb19cdd 
					 
					
						
						
							
							llama : add Qwen2VL support + multimodal RoPE ( #10361 )  
						
						 
						
						... 
						
						
						
						* Barebone Qwen2VL LLM convertor
* Add Qwen2VL cli entrypoint
* [WIP] add qwen2vl arch
* Verify m-rope output
* Add vl-rope/2d-rope support for qwen2vl ViT
* update qwen2vl cli tool
* update 5D tensor op workaround
* [WIP] qwen2vl vision model
* make batch and clip utils compatible with qwen2vl
* [WIP] create inference workflow, gguf convert script but fix
* correcting vision-rope behavior, add the missing last layer back to ViT
* add arg parser to qwen2vl_surgery
* replace variable size array with vector
* cuda-gdb cmake preset
* add fp32 mrope, vision rope kernel
* add fp16 support for qwen2vl and m-rope
* add `GGML_ROPE_TYPE_MROPE`, `GGML_ROPE_TYPE_VISION`
* fix rope op mode switching, out dated func args
* update `llama_hparams`
* update to keep up stream changes
* resolve linter, test errors
* add makefile entry, update speical image padding token
* add mrope unit test, fix few compiler warnings
* rename `mrope` related function, params
* minor updates on debug util, bug fixs
* add `m-rope` testcase to `test-backend-ops`
* Apply suggestions from code review
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com >
* fix traililng whitespce
* store `llama_hparams.rope_sections` with fixed size array
* update position id tensor size check in GGML_OP_ROPE
* minor updates
* update `ggml_backend_*_supports_op` of unsupported backends
* remote old `rope_section` compare operator
---------
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com > 
						
						
					 
					
						2024-12-14 14:43:46 +02:00  
					
					
						 
						
						
							
							
							 
							
							
							
							
							 
						
					 
				 
			
				
					
						
							
							
								 
								kallewoof 
							
						 
					 
					
						
						
							
						
						484d2f31ae 
					 
					
						
						
							
							bug-fix: snprintf prints NULL in place of the last character ( #10419 )  
						
						 
						
						... 
						
						
						
						* bug-fix: snprintf prints NULL in place of the last character
We need to give snprintf enough space to print the last character and the null character, thus we allocate one extra byte and then ignore it when converting to std::string.
* add comment about extra null-term byte requirement 
						
						
					 
					
						2024-12-11 14:48:04 +01:00