Francis Couture-Harpin 
							
						 
					 
					
						
						
							
						
						833d03c25d 
					 
					
						
						
							
							convert : for FP8, use scale type to decide auto type  
						
						
						
						
					 
					
						2025-09-09 14:36:34 -04:00 
						 
				 
			
				
					
						
							
							
								Francis Couture-Harpin 
							
						 
					 
					
						
						
							
						
						2499e47cfd 
					 
					
						
						
							
							gguf-py : allow previewing reflinked size on non-Linux platforms  
						
						
						
						
					 
					
						2025-09-09 14:36:34 -04:00 
						 
				 
			
				
					
						
							
							
								Francis Couture-Harpin 
							
						 
					 
					
						
						
							
						
						8ef4136b20 
					 
					
						
						
							
							convert : remove unused field ModelTensorInfo.src_qtype  
						
						
						
						
					 
					
						2025-09-09 14:36:34 -04:00 
						 
				 
			
				
					
						
							
							
								Francis Couture-Harpin 
							
						 
					 
					
						
						
							
						
						be600e2622 
					 
					
						
						
							
							convert : more robust default ftype detection  
						
						
						
						
					 
					
						2025-09-09 14:36:34 -04:00 
						 
				 
			
				
					
						
							
							
								Francis Couture-Harpin 
							
						 
					 
					
						
						
							
						
						fb879b40c0 
					 
					
						
						
							
							convert : use F32 operations on Mamba A_log  
						
						... 
						
						
						
						This matches the previous behavior for BF16 tensors. 
						
						
					 
					
						2025-09-09 14:36:34 -04:00 
						 
				 
			
				
					
						
							
							
								Francis Couture-Harpin 
							
						 
					 
					
						
						
							
						
						6792f66a93 
					 
					
						
						
							
							convert : detect filesystem block size for reflinks  
						
						... 
						
						
						
						* convert : use direct copies when possible
Using os.copy_file_range where available,
and falling back to shutil.copyfileobj otherwise.
* gguf : handle misaligned offset more cleanly 
						
						
					 
					
						2025-09-09 14:36:34 -04:00 
						 
				 
			
				
					
						
							
							
								Francis Couture-Harpin 
							
						 
					 
					
						
						
							
						
						7724bf9e4f 
					 
					
						
						
							
							convert : fix reflinks for stacked MoE tensors  
						
						
						
						
					 
					
						2025-09-09 14:36:34 -04:00 
						 
				 
			
				
					
						
							
							
								Francis Couture-Harpin 
							
						 
					 
					
						
						
							
						
						f7394cdaf4 
					 
					
						
						
							
							convert : use reflinks for faster conversion  
						
						
						
						
					 
					
						2025-09-09 14:36:32 -04:00 
						 
				 
			
				
					
						
							
							
								Francis Couture-Harpin 
							
						 
					 
					
						
						
							
						
						e582f1ac63 
					 
					
						
						
							
							convert : fix no-lazy dtypes from direct safetensors  
						
						
						
						
					 
					
						2025-09-09 14:33:01 -04:00 
						 
				 
			
				
					
						
							
							
								Francis Couture-Harpin 
							
						 
					 
					
						
						
							
						
						0edc189842 
					 
					
						
						
							
							gguf-py : order safetensors tensors by name  
						
						... 
						
						
						
						Applies to both local and remote safetensors custom parsing.
This matches the behavior of the official safetensors implementation.
* convert : rename from_safetensors_meta to from_local_tensor
For consistency with from_remote_tensor 
						
						
					 
					
						2025-09-09 14:33:01 -04:00 
						 
				 
			
				
					
						
							
							
								Francis Couture-Harpin 
							
						 
					 
					
						
						
							
						
						ca8f736fe4 
					 
					
						
						
							
							convert : parse safetensors directly  
						
						
						
						
					 
					
						2025-09-09 14:33:01 -04:00 
						 
				 
			
				
					
						
							
							
								Francis Couture-Harpin 
							
						 
					 
					
						
						
							
						
						0d5cfed596 
					 
					
						
						
							
							Merge branch 'master' into compilade/convert-prequant  
						
						
						
						
					 
					
						2025-09-09 14:23:06 -04:00 
						 
				 
			
				
					
						
							
							
								Daniel Bevenius 
							
						 
					 
					
						
						
							
						
						233d773d02 
					 
					
						
						
							
							convert : force setting sliding_window from original config ( #15867 )  
						
						... 
						
						
						
						* convert : force setting sliding_window from original config
This commit modifies the set_gguf_parameters method for EmbeddingGemma
so that it reads the sliding_window parameter from the original model
config.json and uses that value.
The motivation for this change is that the Gemma3TextConfig
constructor adjusts the sliding_window value, which can lead to
inconsistencies when converting models as we expects this value to
match the original model's configuration.
Refs: bb45d3631e/src/transformers/models/gemma3/configuration_gemma3.py (L230) 
						
						
					 
					
						2025-09-08 09:44:34 +02:00 
						 
				 
			
				
					
						
							
							
								Daniel Bevenius 
							
						 
					 
					
						
						
							
						
						fb15d649ed 
					 
					
						
						
							
							llama : add support for EmbeddingGemma 300m ( #15798 )  
						
						... 
						
						
						
						This commit add support for the EmbeddingGemma 300m. This model supports
sliding window attention (SWA) and a new swq_type is introduced to
support symmetric SWA masking.
This commit also extracts the code from the function
llama_is_masked_swa in llama-impl.h, so that the logic can be shared
by both llm_graph_input_attn_no_cache::set_input and
llama_kv_cache::set_input_kq_mask.
With this commit the EmbeddingGemma 300m model can be converted to
to GGUF and used with llama.cpp.
Once the model has been uploaded to HuggingFace it can be used like
this:
```console
./build/bin/llama-cli -hf ggml-org/embeddinggemma-300m-GGUF:Q8_0
``` 
						
						
					 
					
						2025-09-04 18:10:29 +02:00 
						 
				 
			
				
					
						
							
							
								Jie Fu (傅杰) 
							
						 
					 
					
						
						
							
						
						4b20d8b7e3 
					 
					
						
						
							
							convert : remove redundant code ( #15708 )  
						
						... 
						
						
						
						Signed-off-by: Jie Fu <jiefu@tencent.com > 
						
						
					 
					
						2025-09-01 23:53:31 +08:00 
						 
				 
			
				
					
						
							
							
								Francis Couture-Harpin 
							
						 
					 
					
						
						
							
						
						adec43d774 
					 
					
						
						
							
							Merge branch 'master' into compilade/convert-prequant  
						
						
						
						
					 
					
						2025-09-01 10:13:29 -04:00 
						 
				 
			
				
					
						
							
							
								Gabe Goodhart 
							
						 
					 
					
						
						
							
						
						e8d99dd0b6 
					 
					
						
						
							
							nvidia nemotron nano v2 (nemotronh) ( #15507 )  
						
						... 
						
						
						
						* feat: Add NEMOTRONH to python arch enum
https://github.com/ggml-org/llama.cpp/issues/nemotron-nano-15409 
Branch: gabe-l-hart/nvidia-nemotron-nano-15409
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com >
* feat: Add NEMOTRONH to c++ arch enum
https://github.com/ggml-org/llama.cpp/issues/nemotron-nano-15409 
Branch: gabe-l-hart/nvidia-nemotron-nano-15409
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com >
* feat: Add NEMOTRONH to llama-arch layer map
https://github.com/ggml-org/llama.cpp/issues/nemotron-nano-15409 
Branch: gabe-l-hart/nvidia-nemotron-nano-15409
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com >
* feat: First pass at conversion for nemotronh
https://github.com/ggml-org/llama.cpp/issues/nemotron-nano-15409 
Branch: gabe-l-hart/nvidia-nemotron-nano-15409
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com >
* feat: Add a verbose log for each tensor loaded
This is really helpful for diagnosing mismatches between the expected and
received tensors
https://github.com/ggml-org/llama.cpp/issues/nemotron-nano-15409 
Branch: gabe-l-hart/nvidia-nemotron-nano-15409
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com >
* feat: First (broken) pass at nemotronh model architecture
It generates tokens, just not valid ones!
https://github.com/ggml-org/llama.cpp/issues/nemotron-nano-15409 
Branch: gabe-l-hart/nvidia-nemotron-nano-15409
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com >
* fix: Explicitly enable add_bos_token during conversion
The `tokenizer.json`/`tokenizer_config.json` in the model are a bit
contradictory. In the config, add_bos_token is set to False, but the
tokenizer model itself has a post_processor that adds the BOS token via
type: TemplateProcessing
https://github.com/ggml-org/llama.cpp/issues/nemotron-nano-15409 
Branch: gabe-l-hart/nvidia-nemotron-nano-15409
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com >
* fix: Use relu2 (LLM_FFN_RELU_SQR) for activation in FFN layers
https://github.com/ggml-org/llama.cpp/issues/nemotron-nano-15409 
Branch: gabe-l-hart/nvidia-nemotron-nano-15409
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com >
* fix: Only allocate attention cache for attention layers (not non-recurrent)
https://github.com/ggml-org/llama.cpp/issues/nemotron-nano-15409 
Branch: gabe-l-hart/nvidia-nemotron-nano-15409
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com >
* fix: Move residual add to after every block
https://github.com/ggml-org/llama.cpp/issues/nemotron-nano-15409 
Branch: gabe-l-hart/nvidia-nemotron-nano-15409
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com >
* fix: Use the correct norm tensor for the MLP blocks
https://github.com/ggml-org/llama.cpp/issues/nemotron-nano-15409 
Branch: gabe-l-hart/nvidia-nemotron-nano-15409
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com >
* Nemotron-H: MLP gate cleanup (pass NULL for unused gate)
This model does not use a gate in MLP blocks; pass NULLs for gate tensors to make intent clear and avoid unused-pointer noise.
* SSM: respect ssm_dt_rank for dt_dim when provided
Use GGUF-provided time_step_rank (ssm_dt_rank) to set dt_dim when > 0; fallback to max(64, n_embd/16).
* fix: plamo2 - revert dt_dim to default (remove ssm_dt_rank usage)
* Rename nemotronh to nemotron_h for consistency
- Update architecture name from NEMOTRONH to NEMOTRON_H in constants.py
- Change architecture string from 'nemotronh' to 'nemotron_h' in all files
- Update enum LLM_ARCH_NEMOTRONH to LLM_ARCH_NEMOTRON_H
- Update class name llm_build_nemotronh to llm_build_nemotron_h
- Consistent naming with underscore convention (nemotron_h vs nemotronh)
* feat: Support conversion for older NemotronH models
https://github.com/ggml-org/llama.cpp/issues/nemotron-nano-15409 
Branch: gabe-l-hart/nvidia-nemotron-nano-15409
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com >
---------
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com >
Co-authored-by: Maicon Domingues <dominguesm@outlook.com >
Co-authored-by: weatherman <fxdstudios@gmail.com > 
						
						
					 
					
						2025-08-28 18:39:31 -06:00 
						 
				 
			
				
					
						
							
							
								Sigbjørn Skjæret 
							
						 
					 
					
						
						
							
						
						84ab83cc0b 
					 
					
						
						
							
							model : jina-embeddings-v3 support ( #13693 )  
						
						... 
						
						
						
						* initial jina-embeddings-v3 support
* initial jina-embeddings-v3 support
* initial jina-embeddings-v3 support
* fix vocab parsing with only tokenizer.json
* set mask token lstrip attribute
* additional unk_token_id fallback just in case [no ci]
* revert vocab_size() change [no ci]
* merge tensor loading into general bert
* rope
* add lora embedding and loading (non-functional)
* export separate lora ggufs instead
* add adapter metadata api
* use std::string
* convert_hf_to_lora compatibility
* fix assert
* apply suggestions from review
* apply suggestion from review 
						
						
					 
					
						2025-08-28 15:49:50 +02:00 
						 
				 
			
				
					
						
							
							
								Xuan-Son Nguyen 
							
						 
					 
					
						
						
							
						
						79a546220c 
					 
					
						
						
							
							mtmd : support Kimi VL model ( #15458 )  
						
						... 
						
						
						
						* convert : fix tensor naming conflict for llama 4 vision
* convert ok
* support kimi vision model
* clean up
* fix style
* fix calc number of output tokens
* refactor resize_position_embeddings
* add test case
* rename build fn
* correct a small bug 
						
						
					 
					
						2025-08-26 12:54:19 +02:00 
						 
				 
			
				
					
						
							
							
								Weizhao Ouyang 
							
						 
					 
					
						
						
							
						
						0d5a470223 
					 
					
						
						
							
							convert : update Ernie 4.5 dense architecture name ( #15555 )  
						
						... 
						
						
						
						Signed-off-by: Weizhao Ouyang <o451686892@gmail.com > 
						
						
					 
					
						2025-08-25 11:15:06 +02:00 
						 
				 
			
				
					
						
							
							
								RunningLeon 
							
						 
					 
					
						
						
							
						
						7da9fed0d6 
					 
					
						
						
							
							convert : support interns1-mini ( #15412 )  
						
						... 
						
						
						
						* support interns1-mini
* fix comment
* update 
						
						
					 
					
						2025-08-25 08:32:16 +02:00 
						 
				 
			
				
					
						
							
							
								Piotr Wilkin (ilintar) 
							
						 
					 
					
						
						
							
						
						b1afcab804 
					 
					
						
						
							
							model : add support for Seed-OSS ( #15490 )  
						
						... 
						
						
						
						* First draft
* Fix linter errors
* Added missing sinks nullptr
* Don't forget the llama-arch!
* We're through to the generation stage.
* Fix post-attention norm
* Apply suggestions from code review
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com >
* Fix RoPE type
* Fix tensor name and reorder llm_types
* Update gguf-py/gguf/constants.py
Remove nonexistent FFN_POST_NORM tensor
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com >
* Update src/llama-model.h
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com >
* Add basic chat template
* Add chat template tests
* Remake chat template test
* Apply suggestions from code review
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com >
* Update src/llama-chat.cpp
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com >
* Reorder llm type descriptions
* Update src/llama-model.cpp
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com >
---------
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com > 
						
						
					 
					
						2025-08-23 15:21:52 +02:00 
						 
				 
			
				
					
						
							
							
								Julien Denize 
							
						 
					 
					
						
						
							
						
						b2caf67db1 
					 
					
						
						
							
							convert : make Mistral community chat templates optional via parameter ( #15420 )  
						
						... 
						
						
						
						* Make Mistral community chat templates optional
* Change the flag arg to disable instead of enable community chat templates
* Improve error message
* Improve help message
* Tone down the logger messages 
						
						
					 
					
						2025-08-21 11:19:50 +02:00 
						 
				 
			
				
					
						
							
							
								Francis Couture-Harpin 
							
						 
					 
					
						
						
							
						
						899398277d 
					 
					
						
						
							
							convert : fix conversion from FP8 for Deepseek-V3.1-Base  
						
						
						
						
					 
					
						2025-08-19 17:27:59 -04:00 
						 
				 
			
				
					
						
							
							
								Sigbjørn Skjæret 
							
						 
					 
					
						
						
							
						
						4d196981d4 
					 
					
						
						
							
							convert : force patch_embd weights to F16 or F32 to avoid broken GGUFs ( #15367 )  
						
						... 
						
						
						
						* force patch_embd weights to f32
* use MmprojModel base tensor_force_quant instead 
						
						
					 
					
						2025-08-17 14:47:42 +02:00 
						 
				 
			
				
					
						
							
							
								Tarek Dakhran 
							
						 
					 
					
						
						
							
						
						65349f26f2 
					 
					
						
						
							
							model : support vision LiquidAI LFM2-VL family ( #15347 )  
						
						... 
						
						
						
						* wip lfm2 vision model
* Fix conv weight
* Implement dynamic resolution
* Fix cuda
* support LFM2-VL-450M
* happy CI
* Remove extra `ggml_conv` and put others into the right place
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com >
---------
Co-authored-by: Xuan Son Nguyen <son@huggingface.co >
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com > 
						
						
					 
					
						2025-08-16 23:33:54 +02:00 
						 
				 
			
				
					
						
							
							
								Francis Couture-Harpin 
							
						 
					 
					
						
						
							
						
						1ae6ab7601 
					 
					
						
						
							
							Merge branch 'master' into compilade/convert-prequant  
						
						
						
						
					 
					
						2025-08-14 17:05:21 -04:00 
						 
				 
			
				
					
						
							
							
								Sigbjørn Skjæret 
							
						 
					 
					
						
						
							
						
						50e81bdf5d 
					 
					
						
						
							
							convert : fix merge conflicts ( #15229 )  
						
						
						
						
					 
					
						2025-08-11 11:15:44 +02:00 
						 
				 
			
				
					
						
							
							
								Julien Denize 
							
						 
					 
					
						
						
							
						
						a3a7874272 
					 
					
						
						
							
							convert : improve Mistral models integration ( #14737 )  
						
						... 
						
						
						
						* Improve Mistral models integration with llama.cpp
* Revert changes and fix gguf
* Revert change
* refactor convert_mistral_to_gguf.py in convert_hf_to_gguf.py
* Revert collateral
* Rename model name
* refactor
* revert
* remove duplicate
* Remove duplication code
* Fixes
* Fix flake issues
* Apply comments
* Apply comments
* Apply comments
* Fix remote
* add default chat template
* Revert
* nit 
						
						
					 
					
						2025-08-11 10:07:49 +02:00 
						 
				 
			
				
					
						
							
							
								Xuan-Son Nguyen 
							
						 
					 
					
						
						
							
						
						50aa938901 
					 
					
						
						
							
							convert : support non-mxfp4 HF model ( #15153 )  
						
						... 
						
						
						
						* convert : support non-mxfp4 HF model
* rm redundant check
* disable debug check 
						
						
					 
					
						2025-08-07 23:26:03 +02:00 
						 
				 
			
				
					
						
							
							
								RunningLeon 
							
						 
					 
					
						
						
							
						
						99acbc9921 
					 
					
						
						
							
							llama : Support intern-s1 ( #14875 )  
						
						... 
						
						
						
						* support internvl
* support interns1
* resolve comments
* put interns1 in tensor mapping
* resolve comment
* move tokenizer changes to sub class 
						
						
					 
					
						2025-08-07 18:20:40 +02:00 
						 
				 
			
				
					
						
							
							
								Georgi Gerganov 
							
						 
					 
					
						
						
							
						
						fd1234cb46 
					 
					
						
						
							
							llama : add gpt-oss ( #15091 )  
						
						... 
						
						
						
						* oai moe
* compat with new checkpoint
* add attn sink impl
* add rope scaling yarn
* logits match with latest transformers code
* wip chat template
* rm trailing space
* use ggml_scale_bias
* rm redundant is_swa_all
* convert interleaved gate_up
* graph : fix activation function to match reference (#7 )
* vocab : handle o200k_harmony special tokens
* ggml : add attention sinks support (#1 )
* llama : add attn sinks
* ggml : add attn sinks
* cuda : add attn sinks
* vulkan : add support for sinks in softmax
remove unnecessary return
* ggml : add fused swiglu_oai op (#11 )
* ggml : add fused swiglu_oai op
* Update ggml/src/ggml-cpu/ops.cpp
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com >
* update CUDA impl
* cont : metal impl
* add vulkan impl
* test-backend-ops : more test cases, clean up
* llama : remove unfused impl
* remove extra lines
---------
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com >
---------
Co-authored-by: slaren <slarengh@gmail.com >
* repack mxfp4 upon conversion
* clean up a bit
* enable thinking
* add quick hack to render only some special tokens
* fix bf16 conversion
* remove vocab hack
* webui ok
* support chat parsing for gpt-oss
* fix webui
* direct mapping mxfp4, FINALLY
* force using mxfp4
* properly use lazy tensor
* ggml : add mxfp4
ggml : use e8m0 conversion instead of powf
Co-authored-by: Diego Devesa <slarengh@gmail.com >
change kvalues_mxfp4 table to match e2m1 (#6 )
metal : remove quantization for now (not used)
cuda : fix disabled CUDA graphs due to ffn moe bias
vulkan : add support for mxfp4
cont : add cm2 dequant
* ggml : add ggml_add_id (#13 )
* ggml : add ggml_add_id
* add cuda impl
* llama : add weight support check for add_id
* perf opt
* add vulkan impl
* rename cuda files
* add metal impl
* allow in-place ggml_add_id
* llama : keep biases on CPU with --cpu-moe
* llama : fix compile error
ggml-ci
* cuda : add fallback for __nv_cvt_e8m0_to_bf16raw
ggml-ci
* cleanup
ggml-ci
* sycl : fix supports_op for MXFP4
ggml-ci
* fix Unknown reasoning format
* ggml-cpu : fix AVX build
ggml-ci
* fix hip build
ggml-ci
* cuda : add mxfp4 dequantization support for cuBLAS
ggml-ci
* ggml-cpu : fix mxfp4 fallback definitions for some architectures
ggml-ci
* cuda : fix version required for __nv_cvt_e8m0_to_bf16raw
---------
Co-authored-by: Xuan Son Nguyen <son@huggingface.co >
Co-authored-by: slaren <slarengh@gmail.com > 
						
						
					 
					
						2025-08-05 22:10:36 +03:00 
						 
				 
			
				
					
						
							
							
								Sam 
							
						 
					 
					
						
						
							
						
						ef0144c087 
					 
					
						
						
							
							model: support GLM 4.5 family of models ( #14939 )  
						
						... 
						
						
						
						* model: Add GLM 4.5 (#14921 )
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com >
* Merge in PR suggestions
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com >
* model: Add GLM 4.5 family of models (#14921 )
1. Updated tensor_mapping.py with NextN tensor mappings
- Added proper tensor mappings for all NextN/MTP tensors in /Users/samm/git/llama.cpp/gguf-py/gguf/tensor_mapping.py
- Added mappings for: eh_proj, embed_tokens, enorm, hnorm, shared_head.head, shared_head.norm
2. Added num_nextn_predict_layers configuration
- Added LLM_KV_NUM_NEXTN_PREDICT_LAYERS constant to llama-arch.h and llama-arch.cpp
- Added num_nextn_predict_layers field to llama_hparams struct
- Updated GLM4_MOE parameter loading in llama-model.cpp to read this parameter
- Modified tensor loading logic to conditionally load NextN tensors based on num_nextn_predict_layers
- Added GGUF writer support in gguf_writer.py with add_num_nextn_predict_layers() method
- Updated conversion script to extract and write this parameter from HuggingFace config
3. Added FIM tokens for GLM4_MOE
- Added GLM-4.5's FIM tokens to llama-vocab.cpp:
  - <|code_prefix|> for FIM_PRE
  - <|code_suffix|> for FIM_SUF
  - <|code_middle|> for FIM_MID
4. Removed manual NextN tensor handling
- Removed the special-case handling in convert_hf_to_gguf.py that manually mapped NextN tensors
- NextN tensors are now handled automatically through the proper tensor mapping system
* glm 4.5 update tensors names
* model: glm 4.5 apply suggestions from code review
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com >
* Update src/llama-model.cpp
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com >
* model: glm 4.5 apply suggestions from code review
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com >
* model: glm 4.5 apply suggestions from code review
* Apply suggestions from code review
* patch broken chat template
* typings fix
* add TENSOR_SKIP flag
Co-authored-by: Diego Devesa <slarengh@gmail.com >
* Update src/llama-model-loader.h
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com >
---------
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com >
Co-authored-by: Diego Devesa <slarengh@gmail.com > 
						
						
					 
					
						2025-08-04 20:29:25 +02:00 
						 
				 
			
				
					
						
							
							
								Csaba Kecskemeti 
							
						 
					 
					
						
						
							
						
						97366dc6ab 
					 
					
						
						
							
							vocab : JetBrains Mellum pre-tokenizer ( #15045 )  
						
						
						
						
					 
					
						2025-08-03 21:38:18 +02:00 
						 
				 
			
				
					
						
							
							
								Gabriel Larson 
							
						 
					 
					
						
						
							
						
						83bc2f288c 
					 
					
						
						
							
							model : add text-only support for Kimi-VL (and find special tokens in text_config)  ( #15051 )  
						
						... 
						
						
						
						* basic kimi-vl textmodel conversion
* check config["text_config"] for special tokens 
						
						
					 
					
						2025-08-03 16:56:25 +02:00 
						 
				 
			
				
					
						
							
							
								Douglas Hanley 
							
						 
					 
					
						
						
							
						
						711d5e6fe6 
					 
					
						
						
							
							convert : fix Qwen3-Embedding pre-tokenizer hash ( #15030 )  
						
						
						
						
					 
					
						2025-08-02 12:51:02 +02:00 
						 
				 
			
				
					
						
							
							
								Douglas Hanley 
							
						 
					 
					
						
						
							
						
						339bd0268c 
					 
					
						
						
							
							model : support Qwen3-Embedding ( #15023 )  
						
						
						
						
					 
					
						2025-08-02 10:44:50 +02:00 
						 
				 
			
				
					
						
							
							
								stevenkuang 
							
						 
					 
					
						
						
							
						
						0f5ccd6fd1 
					 
					
						
						
							
							model : add hunyuan dense ( #14878 )  
						
						... 
						
						
						
						* support hunyuan_v1_dense
Signed-off-by: stevenkuang <stevenkuang@tencent.com >
* update hunyuan_moe to hunyuan_v1_moe
Signed-off-by: stevenkuang <stevenkuang@tencent.com >
* fix rope alpha assert and bos token
Signed-off-by: stevenkuang <stevenkuang@tencent.com >
* add blank line
Signed-off-by: stevenkuang <stevenkuang@tencent.com >
* Revert "update hunyuan_moe to hunyuan_v1_moe"
This reverts commit aa973ca219stevenkuang@tencent.com >
* fix hunyuan_moe chat template
Signed-off-by: stevenkuang <stevenkuang@tencent.com >
* remove leftover code
Signed-off-by: stevenkuang <stevenkuang@tencent.com >
* update hunyuan dense chat template
Signed-off-by: stevenkuang <stevenkuang@tencent.com >
* fix hunyuan dense vocab and chat template
Signed-off-by: stevenkuang <stevenkuang@tencent.com >
---------
Signed-off-by: stevenkuang <stevenkuang@tencent.com > 
						
						
					 
					
						2025-08-01 15:31:12 +02:00 
						 
				 
			
				
					
						
							
							
								Aman Gupta 
							
						 
					 
					
						
						
							
						
						8a4a856277 
					 
					
						
						
							
							Add LLaDA 8b Diffusion model ( #14771 )  
						
						... 
						
						
						
						* Add support for Llada-8b: diffusion model
* Add README
* Fix README and convert_hf_to_gguf
* convert_hf_to_gguf.py: address review comments
* Make everything in a single example
* Remove model-specific sampling
* Remove unused argmax
* Remove braced initializers, improve README.md a bit
* Add diffusion specific gguf params in set_vocab, remove setting rope_theta and rms_norm_eps
* Remove adding the mask token
* Move add_add_bos_token to set_vocab
* use add_bool in gguf_writer.py 
						
						
					 
					
						2025-07-31 19:49:09 +08:00 
						 
				 
			
				
					
						
							
							
								Xuan-Son Nguyen 
							
						 
					 
					
						
						
							
						
						00fa15fedc 
					 
					
						
						
							
							mtmd : add support for Voxtral ( #14862 )  
						
						... 
						
						
						
						* mtmd : add support for Voxtral
* clean up
* fix python requirements
* add [BEGIN_AUDIO] token
* also support Devstral conversion
* add docs and tests
* fix regression for ultravox
* minor coding style improvement
* correct project activation fn
* Apply suggestions from code review
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com >
---------
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com > 
						
						
					 
					
						2025-07-28 15:01:48 +02:00 
						 
				 
			
				
					
						
							
							
								Dongliang Wei 
							
						 
					 
					
						
						
							
						
						6c6e397aff 
					 
					
						
						
							
							model : add support for SmallThinker series ( #14898 )  
						
						... 
						
						
						
						* support smallthinker
* support 20b softmax, 4b no sliding window
* new build_moe_ffn_from_probs, and can run 4b
* fix 4b rope bug
* fix python type check
* remove is_moe judge
* remove set_dense_start_swa_pattern function and modify set_swa_pattern function
* trim trailing whitespace
* remove get_vocab_base of SmallThinkerModel in convert_hf_to_gguf.py
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com >
* better whitespace
Apply suggestions from code review
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com >
* use GGML_ASSERT for expert count validation
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com >
* Improve null pointer check for probs
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com >
* use template parameter for SWA attention logic
* better whitespace
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com >
* move the creation of inp_out_ids before the layer loop
* remove redundant judge for probs
---------
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com >
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com > 
						
						
					 
					
						2025-07-28 13:47:00 +02:00 
						 
				 
			
				
					
						
							
							
								Shunta Saito 
							
						 
					 
					
						
						
							
						
						1dc9614e06 
					 
					
						
						
							
							llama : fix kq_scale for the attention layers of PLaMo2 ( #14892 )  
						
						... 
						
						
						
						* Fix dimensions for expand
* Change dimensions to copy states to cache
* Fix the default value for plamo2 conversion
* Fix scale given to build_attn
* Update src/llama-model.cpp
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com >
* Update src/llama-model.cpp
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com >
* Update src/llama-model.cpp
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com >
---------
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com > 
						
						
					 
					
						2025-07-27 09:38:44 +02:00 
						 
				 
			
				
					
						
							
							
								jacekpoplawski 
							
						 
					 
					
						
						
							
						
						a12363bbf0 
					 
					
						
						
							
							convert : text-only support for GLM-4.1V-9B-Thinking ( #14823 )  
						
						... 
						
						
						
						* use language_model part only, ignore visual layers
* fix rope_dim calculation 
						
						
					 
					
						2025-07-23 23:23:57 +02:00 
						 
				 
			
				
					
						
							
							
								Francis Couture-Harpin 
							
						 
					 
					
						
						
							
						
						de12f8ac50 
					 
					
						
						
							
							convert : begin handling pre-quantized models  
						
						
						
						
					 
					
						2025-07-22 04:11:34 -04:00 
						 
				 
			
				
					
						
							
							
								lgai-exaone 
							
						 
					 
					
						
						
							
						
						e0cb5c5cb8 
					 
					
						
						
							
							model : add EXAONE 4.0 support ( #14630 )  
						
						
						
						
					 
					
						2025-07-18 10:45:49 +02:00 
						 
				 
			
				
					
						
							
							
								Piotr Wilkin (ilintar) 
							
						 
					 
					
						
						
							
						
						670e1360cd 
					 
					
						
						
							
							convert : fix Ernie4.5 MoE without shared experts ( #14746 )  
						
						
						
						
					 
					
						2025-07-18 01:17:16 +02:00 
						 
				 
			
				
					
						
							
							
								Piotr Wilkin (ilintar) 
							
						 
					 
					
						
						
							
						
						cb887f1bc1 
					 
					
						
						
							
							model: add Ernie 4.5 MoE support ( #14658 )  
						
						... 
						
						
						
						* Add Ernie4.5 MoE
* Fix Flake errors.
* Properly encode/decode MoE layer step
* Correct tensor mappings (.weight)
* Pass and read n_ff_exp
* n_ff_shexp calculation and further minor changes
* Rope fixes.
* .gitignore fix
* Add unit32 cast for Linux builds
* Apply suggestions from code review
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com >
* Further fixes from code review
* Fix trailing whitespace
* Reenable missing experts error
* Code style from code review
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com >
* Fix non-MoE regression
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com >
---------
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com > 
						
						
					 
					
						2025-07-17 23:15:32 +02:00 
						 
				 
			
				
					
						
							
							
								Aman Gupta 
							
						 
					 
					
						
						
							
						
						ab14019821 
					 
					
						
						
							
							Support diffusion models: Add Dream 7B ( #14644 )  
						
						... 
						
						
						
						* Support diffusion models: Add Dream 7B
* Move diffusion to examples
* Move stuff to examples. Add patch to not use kv-cache
* Address review comments
* Make sampling fast
* llama: remove diffusion functions
* Add basic timings + cleanup
* More cleanup
* Review comments: better formating, use LOG instead std::cerr, re-use batch, use ubatch instead of max_length
* fixup!
* Review: move everything to diffusion-cli for now 
						
						
					 
					
						2025-07-16 20:03:51 +08:00 
						 
				 
			
				
					
						
							
							
								Sigbjørn Skjæret 
							
						 
					 
					
						
						
							
						
						cf91f217f1 
					 
					
						
						
							
							convert : add pre-computed hashes first to prevent order mishaps ( #14701 )  
						
						
						
						
					 
					
						2025-07-16 08:51:12 +02:00 
						 
				 
			
				
					
						
							
							
								Gabriel Larson 
							
						 
					 
					
						
						
							
						
						4a4f426944 
					 
					
						
						
							
							model : add Kimi-K2 support ( #14654 )  
						
						... 
						
						
						
						* Kimi-K2 conversion
* add Kimi_K2  pre type
* Kimi-K2
* Kimi-K2 unicode
* Kimi-K2
* LLAMA_MAX_EXPERTS 384
* fix vocab iteration
* regex space fix
* add kimi-k2 to pre_computed_hashes
* Updated with kimi-k2 get_vocab_base_pre hash
* fix whitespaces
* fix flake errors
* remove more unicode.cpp whitespaces
* change set_vocab() flow
* add moonshotai-Kimi-K2.jinja to /models/templates/
* update moonshotai-Kimi-K2.jinja
* add kimi-k2 chat template
* add kimi-k2
* update NotImplementedError
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com >
* except Exception
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com >
* LLM_CHAT_TEMPLATE_KIMI_K2 if(add_ass){}
---------
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com > 
						
						
					 
					
						2025-07-15 21:54:22 +02:00