Georgi Gerganov 
							
						 
					 
					
						
						
							
						
						de2ef53a4b 
					 
					
						
						
							
							kv-cache : rework kv_cell ( #13706 )  
						
						... 
						
						
						
						* kv-cache : rework kv_cell
ggml-ci
* kv-cells : use "shift" instead of "delta" consistently
ggml-ci
* llama : add llama_max_parallel_sequences()
ggml-ci
* kv-cells : update comments [no ci]
* context : fail upon construction if sequences exceed max value
ggml-ci
* kv-cells : get_pos() -> pos_get() + comments
ggml-ci
* kv-cells : fix tracking of "used" cells
ggml-ci 
						
						
					 
					
						2025-05-25 16:34:36 +03:00 
						 
				 
			
				
					
						
							
							
								Piotr Jasiukajtis 
							
						 
					 
					
						
						
							
						
						4032ca4066 
					 
					
						
						
							
							llama : add support for Qwen3 MoE tied word embeddings ( #13768 )  
						
						
						
						
					 
					
						2025-05-25 10:29:43 +02:00 
						 
				 
			
				
					
						
							
							
								Olivier Chafik 
							
						 
					 
					
						
						
							
						
						f5cd27b71d 
					 
					
						
						
							
							server: streaming of tool calls and thoughts when --jinja is on (#12379 )  
						
						... 
						
						
						
						* add common_json w/ support for truncated json healing
* add common_chat_msg_diff
* partial common_chat_parse
* refactor parser w/ optionals
* server: wire chat diffs in stream mode
* fix trigger of thinking models (must happen after thoughts are closed)
* fix functionary v3.2 raw python!
* rename: common_chat_syntax (now contains format)
* rm common_regex.at_start
* don't return empty <think></think>
* accommodate yet another deepseek r1 distill fantasy syntax (`<|tool▁calls|>`)
* fix QwQ 32B tool call parsing after thoughts (hermes2)
* better logs for grammar triggers
* consume spaces after parse_json_tool_calls
* fix required tool calls w/ thinking models that have pre-opened thinking tags
* fix thinking model's initial trigger + test qwq's template
* run most test_tool_call tests in stream + non-stream modes
* make functionary v3.2 parsing more strict (differentiate first match from others)
* send final diff from server, to close off raw python arguments
* support partial content streaming in Generic mode
* tool-call: allow content prelude before hermes2 tool calls (for Qwen2.5)
* Update function-calling.md
* Update tool_bench.py
* chat-parser: remove input from exception (llm output may contain PII)
---------
Co-authored-by: ochafik <ochafik@google.com >
Co-authored-by: Olivier Chafik <ochafik@users.noreply.github.com > 
						
						
					 
					
						2025-05-25 01:48:08 +01:00 
						 
				 
			
				
					
						
							
							
								0cc4m 
							
						 
					 
					
						
						
							
						
						259469c4b5 
					 
					
						
						
							
							Move GLM4 f32 attention fix to the correct function ( #13750 )  
						
						
						
						
					 
					
						2025-05-24 16:49:12 +02:00 
						 
				 
			
				
					
						
							
							
								Sigbjørn Skjæret 
							
						 
					 
					
						
						
							
						
						c3a2624339 
					 
					
						
						
							
							vocab : fix ugm tokenizer precision ( #13743 )  
						
						
						
						
					 
					
						2025-05-24 12:29:09 +02:00 
						 
				 
			
				
					
						
							
							
								Georgi Gerganov 
							
						 
					 
					
						
						
							
						
						d13d0f6135 
					 
					
						
						
							
							hparams : initialize arrays ( #13728 )  
						
						... 
						
						
						
						ggml-ci 
						
						
					 
					
						2025-05-23 20:16:13 +03:00 
						 
				 
			
				
					
						
							
							
								Xuan-Son Nguyen 
							
						 
					 
					
						
						
							
						
						8a2afb7520 
					 
					
						
						
							
							llama : allow custom list of swa_layers ( #13726 )  
						
						
						
						
					 
					
						2025-05-23 17:07:04 +02:00 
						 
				 
			
				
					
						
							
							
								Georgi Gerganov 
							
						 
					 
					
						
						
							
						
						8a1d206f1d 
					 
					
						
						
							
							tts : fix n_ubatch + make WavTokenizer cache-less ( #13713 )  
						
						... 
						
						
						
						ggml-ci 
						
						
					 
					
						2025-05-22 22:21:07 +03:00 
						 
				 
			
				
					
						
							
							
								Georgi Gerganov 
							
						 
					 
					
						
						
							
						
						8e186ef0e7 
					 
					
						
						
							
							hparams : support models for which all layers use SWA ( #13682 )  
						
						... 
						
						
						
						ggml-ci 
						
						
					 
					
						2025-05-21 20:00:49 +03:00 
						 
				 
			
				
					
						
							
							
								Georgi Gerganov 
							
						 
					 
					
						
						
							
						
						797f2ac062 
					 
					
						
						
							
							kv-cache : simplify the interface ( #13660 )  
						
						... 
						
						
						
						* kv-cache : simplify the interface
ggml-ci
* context : revert llama_batch_allocr position change
ggml-ci 
						
						
					 
					
						2025-05-21 15:11:13 +03:00 
						 
				 
			
				
					
						
							
							
								Georgi Gerganov 
							
						 
					 
					
						
						
							
						
						b44890df2e 
					 
					
						
						
							
							model : disable SWA for Phi models ( #13676 )  
						
						... 
						
						
						
						* model : disable SWA for Phi models
ggml-ci
* model : update warning message
* model : print warning only if n_swa > 0
* model : fix typo 
						
						
					 
					
						2025-05-21 13:09:21 +03:00 
						 
				 
			
				
					
						
							
							
								Georgi Gerganov 
							
						 
					 
					
						
						
							
						
						be0239693c 
					 
					
						
						
							
							model : fix llama4 graph ( #13663 )  
						
						... 
						
						
						
						ggml-ci 
						
						
					 
					
						2025-05-20 19:21:04 +03:00 
						 
				 
			
				
					
						
							
							
								Georgi Gerganov 
							
						 
					 
					
						
						
							
						
						a4090d1174 
					 
					
						
						
							
							llama : remove llama_kv_cache_view API + remove deprecated ( #13653 )  
						
						... 
						
						
						
						ggml-ci 
						
						
					 
					
						2025-05-20 16:13:16 +03:00 
						 
				 
			
				
					
						
							
							
								0cc4m 
							
						 
					 
					
						
						
							
						
						c9c64dee57 
					 
					
						
						
							
							Set GLM4 blk.*.attn_output.weight, kqv_out-* matmul to GGML_PREC_F32 to fix infinity values in output ( #13639 )  
						
						
						
						
					 
					
						2025-05-20 10:11:56 +02:00 
						 
				 
			
				
					
						
							
							
								Georgi Gerganov 
							
						 
					 
					
						
						
							
						
						e298d2fbd0 
					 
					
						
						
							
							kv-cache : add SWA support ( #13194 )  
						
						... 
						
						
						
						* kv-cache : prepare for SWA
ggml-ci
* kv-cache : initial iSWA implementation
ggml-ci
* kv-cache : rework error recovery logic
ggml-ci
* models : fix Phi-3 SWA parameters
ggml-ci
* model : adjust Granite to rope factor changes
ggml-ci
* server : check if context can do shifts
ggml-ci
* iswa : for now, always enable shifts (experiment)
ggml-ci
* kv-cache : simplify SWA logic
ggml-ci
* kv-cache : apply defrag when we fail to find slots for the batch
ggml-ci
* llama : update docs about llama_decode
ggml-ci
* kv-cache : update warning logs when no space for the batch is available
ggml-ci
* llama : add llama_kv_self_seq_pos_min()
* kv-cache : keep track of partial SWA computes and print warnings
* server : disallow use cases involving partial SWA context
ggml-ci
* llama : add param to control SWA cache size
ggml-ci
* minor : clean-up
ggml-ci 
						
						
					 
					
						2025-05-20 08:05:46 +03:00 
						 
				 
			
				
					
						
							
							
								Diego Devesa 
							
						 
					 
					
						
						
							
						
						5364ae4ba5 
					 
					
						
						
							
							llama : print hint when loading a model when no backends are loaded ( #13589 )  
						
						
						
						
					 
					
						2025-05-16 16:38:07 +02:00 
						 
				 
			
				
					
						
							
							
								Diego Devesa 
							
						 
					 
					
						
						
							
						
						c6a2c9e741 
					 
					
						
						
							
							gguf : use ggml log system ( #13571 )  
						
						... 
						
						
						
						* gguf : use ggml log system
* llama : remove unnecessary new lines in exception messages 
						
						
					 
					
						2025-05-15 19:13:11 +02:00 
						 
				 
			
				
					
						
							
							
								Georgi Gerganov 
							
						 
					 
					
						
						
							
						
						e3a9421b78 
					 
					
						
						
							
							kv-cache : fix out-of-bounds view during reserve graph ( #13547 )  
						
						... 
						
						
						
						* kv-cache : fix reserve graph out-of-bounds access
ggml-ci
* cont : add comment
* cont : fix comments [no ci]
* cont : more correct comment [no ci] 
						
						
					 
					
						2025-05-14 23:15:15 +03:00 
						 
				 
			
				
					
						
							
							
								Sigbjørn Skjæret 
							
						 
					 
					
						
						
							
						
						f5170c1d7a 
					 
					
						
						
							
							editorconfig : fix trailing whitespace from  #13542  ( #13546 )  
						
						
						
						
					 
					
						2025-05-14 21:22:49 +03:00 
						 
				 
			
				
					
						
							
							
								Gilad S. 
							
						 
					 
					
						
						
							
						
						017f10b5fa 
					 
					
						
						
							
							fix: crash when calling llama_state_get_size on a context without a KV cache ( #13542 )  
						
						
						
						
					 
					
						2025-05-14 19:18:18 +03:00 
						 
				 
			
				
					
						
							
							
								Diego Devesa 
							
						 
					 
					
						
						
							
						
						b7d2672082 
					 
					
						
						
							
							llama : fix quantize with dl backends ( #13539 )  
						
						
						
						
					 
					
						2025-05-14 16:12:36 +02:00 
						 
				 
			
				
					
						
							
							
								Gabe Goodhart 
							
						 
					 
					
						
						
							
						
						5e7d95e22e 
					 
					
						
						
							
							fix: Move build_inp_pos to the top of the graph section for build_granite ( #13538 )  
						
						... 
						
						
						
						This matches how others do it, but will still avoid the extra
initialization when rope is disabled.
Branch: GraniteFour
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com > 
						
						
					 
					
						2025-05-14 15:53:59 +03:00 
						 
				 
			
				
					
						
							
							
								Ed Addario 
							
						 
					 
					
						
						
							
						
						e5c834f718 
					 
					
						
						
							
							quantize : improve tensor-type pattern matching ( #13033 )  
						
						
						
						
					 
					
						2025-05-13 19:12:31 +02:00 
						 
				 
			
				
					
						
							
							
								Gabe Goodhart 
							
						 
					 
					
						
						
							
						
						d590cd4c24 
					 
					
						
						
							
							model : Granite MoE shared ( #13269 )  
						
						... 
						
						
						
						* feat: Add GGUF conversion for granitemoeshared
Branch: GraniteMoEShared
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com >
* feat: hparam and arch plumbing for granitemoeshared
Branch: GraniteMoEShared
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com >
* fix: Split MoE fused tensors for shared experts in conversion
Branch: GraniteMoEShared
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com >
* feat: First WIP cut at model arch in cpp
The hparam and architecture plumbing should be correct, but the
implementation of the shared experts seems to still be broken.
Branch: GraniteMoEShared
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com >
* fix: Cleaner (maybe more correct?) splitting for gate/up
Branch: GraniteMoEShared
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com >
* fix: Fix the input to the shared experts
I had misread that the shared experts take the inputs _before_ the standard
MoE layer and was feeding the output of the MoE to the shared experts.
Branch: GraniteMoEShared
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com >
* fix: Avoid architecture-specific checks for Granite MoE Shared
This is a cleaner way that will allow more flexibility in architecture
strings going forward.
Branch: GraniteMoEShared
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com >
* refactor: Split granite architectures out of llm_build_llama
This helps de-clutter the llama-family graph construction and allows
granite to diverge further (in preparation for Granite 4).
NOTE: I removed the granite scale factors from llm_build_deci because they
appear to only be there as copy-paste from llm_build_llama. The HF config
does not seem to set those values:
https://huggingface.co/Deci/DeciLM-7B/blob/main/config.json 
Branch: GraniteMoEShared
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com >
* fix: Fix compiler warning about uninitialized inp_pos
This should not have been reachable, but it warns on some compliers
Branch: GraniteMoEShared
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com >
* fix: Consoladate GraniteMoEShared into GraniteMoE for conversion
Branch: GraniteMoEShared
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com >
* fix: Consolidate GraniteMoEShared into GraniteMoE on the c++ side
Branch: GraniteMoEShared
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com >
---------
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com > 
						
						
					 
					
						2025-05-13 15:12:01 +02:00 
						 
				 
			
				
					
						
							
							
								Johannes Gäßler 
							
						 
					 
					
						
						
							
						
						10d2af0eaa 
					 
					
						
						
							
							llama/ggml: add LLM training support ( #10544 )  
						
						... 
						
						
						
						* llama/ggml: add LLM training support
more compact progress bar
llama_save_model_to_file
llama_opt_param_filter
ggml_graph_dup force_grads
refactor ggml_opt, fix test-opt
* remove logits_all
* refactor CUDA implementation for ACC
* reset graph at beginning of opt period 
						
						
					 
					
						2025-05-12 14:44:49 +02:00 
						 
				 
			
				
					
						
							
							
								Georgi Gerganov 
							
						 
					 
					
						
						
							
						
						064cc596ac 
					 
					
						
						
							
							context : fix state io for memory-less contexts ( #13470 )  
						
						... 
						
						
						
						ggml-ci 
						
						
					 
					
						2025-05-12 15:12:27 +03:00 
						 
				 
			
				
					
						
							
							
								David Huang 
							
						 
					 
					
						
						
							
						
						7f323a589f 
					 
					
						
						
							
							Add --no-op-offload to improve -ot pp perf in MoE models like llama4 400B ( #13386 )  
						
						
						
						
					 
					
						2025-05-11 14:18:39 +02:00 
						 
				 
			
				
					
						
							
							
								Sigbjørn Skjæret 
							
						 
					 
					
						
						
							
						
						d2a4ef05c6 
					 
					
						
						
							
							vocab : add ByteDance-Seed/Seed-Coder ( #13423 )  
						
						
						
						
					 
					
						2025-05-10 22:08:07 +02:00 
						 
				 
			
				
					
						
							
							
								Johannes Gäßler 
							
						 
					 
					
						
						
							
						
						0cf6725e9f 
					 
					
						
						
							
							CUDA: FA support for Deepseek (Ampere or newer) ( #13306 )  
						
						... 
						
						
						
						* CUDA: FA support for Deepseek (Ampere or newer)
* do loop unrolling via C++ template 
						
						
					 
					
						2025-05-09 13:34:58 +02:00 
						 
				 
			
				
					
						
							
							
								Diego Devesa 
							
						 
					 
					
						
						
							
						
						27ebfcacba 
					 
					
						
						
							
							llama : do not crash if there is no CPU backend ( #13395 )  
						
						... 
						
						
						
						* llama : do not crash if there is no CPU backend
* add checks to examples 
						
						
					 
					
						2025-05-09 13:02:07 +02:00 
						 
				 
			
				
					
						
							
							
								Xuan-Son Nguyen 
							
						 
					 
					
						
						
							
						
						3f96aeff39 
					 
					
						
						
							
							llama : one-off chat template fix for Mistral-Small-2503 ( #13398 )  
						
						... 
						
						
						
						* llama : one-off chat template fix for Mistral-Small-2503
* update readme
* add mistral-v7-tekken 
						
						
					 
					
						2025-05-09 11:17:51 +02:00 
						 
				 
			
				
					
						
							
							
								Georgi Gerganov 
							
						 
					 
					
						
						
							
						
						6562e5a4d6 
					 
					
						
						
							
							context : allow cache-less context for embeddings ( #13108 )  
						
						... 
						
						
						
						* context : allow cache-less context for embeddings
ggml-ci
* context : enable reranking with encode()
ggml-ci
* context : encode() clears embd_seq
ggml-ci
* examples : use llama_encode() when appropriate
ggml-ci
* models : nomic bert moe does not require KV cache
* llama : update comments for llama_decode/llama_encode
ggml-ci
* context : update warning log [no ci] 
						
						
					 
					
						2025-05-08 14:28:33 +03:00 
						 
				 
			
				
					
						
							
							
								Georgi Gerganov 
							
						 
					 
					
						
						
							
						
						51fb96b1ff 
					 
					
						
						
							
							context : remove logits_all flag ( #13284 )  
						
						... 
						
						
						
						* context : remove logits_all flag
ggml-ci
* llama : remove logits_all flag + reorder llama_context_params
ggml-ci 
						
						
					 
					
						2025-05-08 14:26:50 +03:00 
						 
				 
			
				
					
						
							
							
								Diego Devesa 
							
						 
					 
					
						
						
							
						
						f061021206 
					 
					
						
						
							
							llama : print size and type of overridden tensors ( #13364 )  
						
						
						
						
					 
					
						2025-05-08 13:15:15 +02:00 
						 
				 
			
				
					
						
							
							
								Sigbjørn Skjæret 
							
						 
					 
					
						
						
							
						
						bc4e1128f7 
					 
					
						
						
							
							llama : deci : support ffn-free with attention ( #13296 )  
						
						
						
						
					 
					
						2025-05-07 12:49:27 +02:00 
						 
				 
			
				
					
						
							
							
								piDack 
							
						 
					 
					
						
						
							
						
						6c7fd67b64 
					 
					
						
						
							
							llama : support tie embedding for chatglm models ( #13328 )  
						
						
						
						
					 
					
						2025-05-07 09:23:11 +02:00 
						 
				 
			
				
					
						
							
							
								DocShotgun 
							
						 
					 
					
						
						
							
						
						ffc727203a 
					 
					
						
						
							
							sampling : make top_n_sigma no-op at <=0 or a single candidate ( #13345 )  
						
						
						
						
					 
					
						2025-05-06 22:36:24 +02:00 
						 
				 
			
				
					
						
							
							
								oobabooga 
							
						 
					 
					
						
						
							
						
						91a86a6f35 
					 
					
						
						
							
							sampling : don't consider -infinity values in top_n_sigma ( #13344 )  
						
						
						
						
					 
					
						2025-05-06 20:24:15 +02:00 
						 
				 
			
				
					
						
							
							
								Xuan-Son Nguyen 
							
						 
					 
					
						
						
							
						
						2f54e348ad 
					 
					
						
						
							
							llama : fix build_ffn without gate ( #13336 )  
						
						... 
						
						
						
						* llama : fix build_ffn without gate
* fix build on windows
* Revert "fix build on windows"
This reverts commit fc420d3c7e 
						
						
					 
					
						2025-05-06 14:25:40 +02:00 
						 
				 
			
				
					
						
							
							
								oobabooga 
							
						 
					 
					
						
						
							
						
						233461f812 
					 
					
						
						
							
							sampling : Integrate Top-nσ into main sampling chain (and add it to the server) ( #13264 )  
						
						... 
						
						
						
						* sampling: add Top-nσ sampler to `llama-server` and sampler ordering
* revert: sampler ordering
* revert: VS' crappy auto-formatting
* revert: VS' crappy auto-formatting pt.2
* revert: my crappy eye sight...
* sampling: add XTC to Top-nσ sampler chain
* sampling: add Dyna. Temp. to Top-nσ sampler chain
* sampling: actually remove Top-nσ from sampler(oops)
* Integrate top_n_sigma into main sampler chain
* Define COMMON_SAMPLER_TYPE_TOP_N_SIGMA
* Formatting
* Lint
* Exit early in the sampler if nsigma < 0
---------
Co-authored-by: CasualAutopsy <casual_autopsy@outlook.com > 
						
						
					 
					
						2025-05-05 22:12:19 +02:00 
						 
				 
			
				
					
						
							
							
								ymcki 
							
						 
					 
					
						
						
							
						
						3bf785f3ef 
					 
					
						
						
							
							llama : Llama-3_1-Nemotron-Ultra-253B-v1 support ( #12843 )  
						
						
						
						
					 
					
						2025-05-03 17:39:51 +02:00 
						 
				 
			
				
					
						
							
							
								Georgi Gerganov 
							
						 
					 
					
						
						
							
						
						a75cb30dc9 
					 
					
						
						
							
							context : fix reorder logic ( #13267 )  
						
						... 
						
						
						
						ggml-ci 
						
						
					 
					
						2025-05-02 20:54:13 +03:00 
						 
				 
			
				
					
						
							
							
								Jared Van Bortel 
							
						 
					 
					
						
						
							
						
						2f567611c0 
					 
					
						
						
							
							llama-model : support Qwen2 embedding models and pooling_mode_lasttoken ( #13245 )  
						
						
						
						
					 
					
						2025-05-02 11:42:30 -04:00 
						 
				 
			
				
					
						
							
							
								Georgi Gerganov 
							
						 
					 
					
						
						
							
						
						c642bc014c 
					 
					
						
						
							
							kv-cache : separate recurrent vs non-recurrent impl ( #12799 )  
						
						... 
						
						
						
						* kv-cache : serparate recurrent vs non-recurrent impl (wip)
ggml-ci
* kv-cache : init -> contructor + add llama_memory_params
ggml-ci
* kv-cache : fix callback reference
ggml-ci
* context : llama_kv_cache -> llama_memory_i
ggml-ci
* context : move memory creation logic to model
ggml-ci
* llama : remove reference of memory during encode
ggml-ci
* kv-cache : hide padding details in the implementation
ggml-ci
* kv-cache : add ubatch_next()
ggml-ci
* context : simplify sbatch logic
ggml-ci
* kv-cache : hide defrag logic in the implementation
ggml-ci
* context : hide kv cache details in implementation
ggml-ci
* build : fix
ggml-ci
* cont : another fix
ggml-ci
* kv-cache : simplify interface (wip)
ggml-ci
* kv-cache : use separate KV cell structs for unified/recurrent
ggml-ci
* kv-cache : clean-up
ggml-ci
* model : better llama_model::create_model() signature
ggml-ci
* kv-cache : fix recurrent seq_rm()
ggml-ci
* kv-cache : replace `struct callbacks` with `llama_model &`
ggml-ci
* kv-cache : replace `struct graph_params` with `llama_context &`
ggml-ci
* kv-cache : fix offload check
ggml-ci
* context : avoid passing unique_ptr
ggml-ci
* kv-cache : avoid using the backends from the llama_context
ref #13113 
ggml-ci
* kv-cache : more consistent debug logs [no ci]
* kv-cache : do not pass the full llama_context for kv graphs
ggml-ci
* kv-cache : remove comment
* kv-cache : ggml_rope_ext_inplace -> ggml_rope_ext
ggml-ci
* kv-cache : fix recurrent multi-user case
ggml-ci
* memory : remove comments [no ci] 
						
						
					 
					
						2025-05-02 17:48:36 +03:00 
						 
				 
			
				
					
						
							
							
								Sigbjørn Skjæret 
							
						 
					 
					
						
						
							
						
						cb06a3c363 
					 
					
						
						
							
							llama : orion rope type is neox ( #13261 )  
						
						
						
						
					 
					
						2025-05-02 12:44:24 +02:00 
						 
				 
			
				
					
						
							
							
								Sigbjørn Skjæret 
							
						 
					 
					
						
						
							
						
						626083faf7 
					 
					
						
						
							
							llama : plamo rope type is neox ( #13260 )  
						
						
						
						
					 
					
						2025-05-02 12:40:56 +02:00 
						 
				 
			
				
					
						
							
							
								piDack 
							
						 
					 
					
						
						
							
						
						2af6880178 
					 
					
						
						
							
							llama-chat : reset glmedge chat template ( #13253 )  
						
						... 
						
						
						
						* reset glmedge chat template
* fix glmedge chat template 
						
						
					 
					
						2025-05-02 11:06:09 +02:00 
						 
				 
			
				
					
						
							
							
								matteo 
							
						 
					 
					
						
						
							
						
						e0f572c846 
					 
					
						
						
							
							llama-chat : update GLM4 chat template ( #13238 )  
						
						... 
						
						
						
						* update GLM4 chat template
* Update chat template
Co-authored-by: Xuan-Son Nguyen <thichthat@gmail.com >
---------
Co-authored-by: Xuan-Son Nguyen <thichthat@gmail.com > 
						
						
					 
					
						2025-05-01 21:16:38 +02:00 
						 
				 
			
				
					
						
							
							
								Jared Van Bortel 
							
						 
					 
					
						
						
							
						
						a70183eb00 
					 
					
						
						
							
							llama-model : fix the reported size class for nomic-embed-text-v2-moe ( #13223 )  
						
						
						
						
					 
					
						2025-05-01 10:09:41 +03:00 
						 
				 
			
				
					
						
							
							
								ddh0 
							
						 
					 
					
						
						
							
						
						16a457facd 
					 
					
						
						
							
							fix typo: n_ctx_pre_seq -> n_ctx_per_seq ( #13221 )  
						
						
						
						
					 
					
						2025-04-30 21:28:43 +01:00