model: support GLM 4.5 family of models (#14939)

mirror of https://github.com/ggml-org/llama.cpp.git synced 2025-11-05 09:36:52 +00:00

* model: Add GLM 4.5 (#14921)

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

* Merge in PR suggestions

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

* model: Add GLM 4.5 family of models (#14921)

1. Updated tensor_mapping.py with NextN tensor mappings

- Added proper tensor mappings for all NextN/MTP tensors in /Users/samm/git/llama.cpp/gguf-py/gguf/tensor_mapping.py
- Added mappings for: eh_proj, embed_tokens, enorm, hnorm, shared_head.head, shared_head.norm

2. Added num_nextn_predict_layers configuration

- Added LLM_KV_NUM_NEXTN_PREDICT_LAYERS constant to llama-arch.h and llama-arch.cpp
- Added num_nextn_predict_layers field to llama_hparams struct
- Updated GLM4_MOE parameter loading in llama-model.cpp to read this parameter
- Modified tensor loading logic to conditionally load NextN tensors based on num_nextn_predict_layers
- Added GGUF writer support in gguf_writer.py with add_num_nextn_predict_layers() method
- Updated conversion script to extract and write this parameter from HuggingFace config

3. Added FIM tokens for GLM4_MOE

- Added GLM-4.5's FIM tokens to llama-vocab.cpp:
  - <|code_prefix|> for FIM_PRE
  - <|code_suffix|> for FIM_SUF
  - <|code_middle|> for FIM_MID

4. Removed manual NextN tensor handling

- Removed the special-case handling in convert_hf_to_gguf.py that manually mapped NextN tensors
- NextN tensors are now handled automatically through the proper tensor mapping system

* glm 4.5 update tensors names

* model: glm 4.5 apply suggestions from code review

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

* Update src/llama-model.cpp

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

* model: glm 4.5 apply suggestions from code review

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

* model: glm 4.5 apply suggestions from code review

* Apply suggestions from code review

* patch broken chat template

* typings fix

* add TENSOR_SKIP flag


Co-authored-by: Diego Devesa <slarengh@gmail.com>

* Update src/llama-model-loader.h

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

---------

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
Co-authored-by: Diego Devesa <slarengh@gmail.com>

This commit is contained in:

Sam

2025-08-05 04:29:25 +10:00

committed by

GitHub

parent 2721257e3e

commit ef0144c087

15 changed files with 594 additions and 8 deletions

									
										4

src/llama-kv-cache-unified.cpp
									
												View File
												
				@@ -39,6 +39,10 @@ llama_kv_cache_unified::llama_kv_cache_unified(

				    if (model.arch == LLM_ARCH_GEMMA3N) {

				        n_layer_cache = 20;

				    }

				    if (model.arch == LLM_ARCH_GLM4_MOE) {

				        // GLM-4.5: Only process up to last layer, skip final NextN layer

				        n_layer_cache = hparams.n_layer - hparams.nextn_predict_layers;

				    }

				    // create a context for each buffer type

				    std::map<ggml_backend_buffer_type_t, ggml_context *> ctx_map;

model: support GLM 4.5 family of models (#14939)

4 src/llama-kv-cache-unified.cpp Unescape Escape View File

4

src/llama-kv-cache-unified.cpp

View File