leejet 
							
						 
					 
					
						
						
							
						
						36f2215e4c 
					 
					
						
						
							
							add test_pad_ext to test-backend-ops.cpp  
						
						 
						
						
						
						
							
						
					 
					
						2025-08-31 18:32:24 +08:00  
					
					
						 
						
						
							
							
							
							
							
							 
						
					 
				 
			
				
					
						
							
							
								 
								leejet 
							
						 
					 
					
						
						
							
						
						21e933806f 
					 
					
						
						
							
							cuda: get_rows: dfloat2 -> float2  
						
						 
						
						
						
						
							
						
					 
					
						2025-08-31 12:10:01 +08:00  
					
					
						 
						
						
							
							
							
							
							
							 
						
					 
				 
			
				
					
						
							
							
								 
								leejet 
							
						 
					 
					
						
						
							
						
						8f5e7b0ce6 
					 
					
						
						
							
							remove unused variables  
						
						 
						
						
						
						
							
						
					 
					
						2025-08-31 12:02:23 +08:00  
					
					
						 
						
						
							
							
							
							
							
							 
						
					 
				 
			
				
					
						
							
							
								 
								leejet 
							
						 
					 
					
						
						
							
						
						b4c50bec23 
					 
					
						
						
							
							fix test_im2col_3d  
						
						 
						
						
						
						
							
						
					 
					
						2025-08-31 12:01:23 +08:00  
					
					
						 
						
						
							
							
							
							
							
							 
						
					 
				 
			
				
					
						
							
							
								 
								leejet 
							
						 
					 
					
						
						
							
						
						e66bf6e503 
					 
					
						
						
							
							cpu: im2col_3d support non continuous src  
						
						 
						
						... 
						
						
						
						Co-authored-by: Jeff Bolz <jbolz@nvidia.com > 
						
						
							
						
					 
					
						2025-08-31 11:58:32 +08:00  
					
					
						 
						
						
							
							
							
							
							
							 
						
					 
				 
			
				
					
						
							
							
								 
								leejet 
							
						 
					 
					
						
						
							
						
						3f901e316b 
					 
					
						
						
							
							test-backend-ops.cpp: remove trailing whitespace  
						
						 
						
						
						
						
							
						
					 
					
						2025-08-31 00:55:34 +08:00  
					
					
						 
						
						
							
							
							
							
							
							 
						
					 
				 
			
				
					
						
							
							
								 
								leejet 
							
						 
					 
					
						
						
							
						
						aafa79ae03 
					 
					
						
						
							
							add test_im2col_3d to test-backend-ops  
						
						 
						
						
						
						
							
						
					 
					
						2025-08-31 00:51:05 +08:00  
					
					
						 
						
						
							
							
							
							
							
							 
						
					 
				 
			
				
					
						
							
							
								 
								leejet 
							
						 
					 
					
						
						
							
						
						0d5eb51252 
					 
					
						
						
							
							cuda: use simpler loop in get_rows  
						
						 
						
						
						
						
							
						
					 
					
						2025-08-31 00:21:24 +08:00  
					
					
						 
						
						
							
							
							
							
							
							 
						
					 
				 
			
				
					
						
							
							
								 
								leejet 
							
						 
					 
					
						
						
							
						
						131ae2d585 
					 
					
						
						
							
							adjust the code style  
						
						 
						
						
						
						
							
						
					 
					
						2025-08-31 00:04:27 +08:00  
					
					
						 
						
						
							
							
							
							
							
							 
						
					 
				 
			
				
					
						
							
							
								 
								leejet 
							
						 
					 
					
						
						
							
						
						c9b9fabe08 
					 
					
						
						
							
							fix cpu im2col_3d  
						
						 
						
						
						
						
							
						
					 
					
						2025-08-30 11:25:07 +08:00  
					
					
						 
						
						
							
							
							
							
							
							 
						
					 
				 
			
				
					
						
							
							
								 
								leejet 
							
						 
					 
					
						
						
							
						
						f6278c832f 
					 
					
						
						
							
							cuda: remove unnecessary MIN define  
						
						 
						
						
						
						
							
						
					 
					
						2025-08-30 04:14:19 +08:00  
					
					
						 
						
						
							
							
							
							
							
							 
						
					 
				 
			
				
					
						
							
							
								 
								leejet 
							
						 
					 
					
						
						
							
						
						f6a874c04a 
					 
					
						
						
							
							avoid build failure on MacOS  
						
						 
						
						
						
						
							
						
					 
					
						2025-08-30 03:53:03 +08:00  
					
					
						 
						
						
							
							
							
							
							
							 
						
					 
				 
			
				
					
						
							
							
								 
								leejet 
							
						 
					 
					
						
						
							
						
						d11a729898 
					 
					
						
						
							
							avoid build failure  
						
						 
						
						
						
						
							
						
					 
					
						2025-08-30 03:48:47 +08:00  
					
					
						 
						
						
							
							
							
							
							
							 
						
					 
				 
			
				
					
						
							
							
								 
								leejet 
							
						 
					 
					
						
						
							
						
						9d035c4c4a 
					 
					
						
						
							
							correct GGML_OP_COUNT assertion  
						
						 
						
						
						
						
							
						
					 
					
						2025-08-30 03:36:59 +08:00  
					
					
						 
						
						
							
							
							
							
							
							 
						
					 
				 
			
				
					
						
							
							
								 
								leejet 
							
						 
					 
					
						
						
							
						
						df05913bc4 
					 
					
						
						
							
							avoid ggml_conv_3d conflict  
						
						 
						
						
						
						
							
						
					 
					
						2025-08-30 03:28:07 +08:00  
					
					
						 
						
						
							
							
							
							
							
							 
						
					 
				 
			
				
					
						
							
							
								 
								leejet 
							
						 
					 
					
						
						
							
						
						d30e07dbb3 
					 
					
						
						
							
							fix cuda get_rows  
						
						 
						
						
						
						
							
						
					 
					
						2025-08-30 03:13:57 +08:00  
					
					
						 
						
						
							
							
							
							
							
							 
						
					 
				 
			
				
					
						
							
							
								 
								leejet 
							
						 
					 
					
						
						
							
						
						d8377a0a37 
					 
					
						
						
							
							gguf: support loading tensors which n_dims > GGML_MAX_DIMS  
						
						 
						
						
						
						
							
						
					 
					
						2025-08-30 03:11:09 +08:00  
					
					
						 
						
						
							
							
							
							
							
							 
						
					 
				 
			
				
					
						
							
							
								 
								leejet 
							
						 
					 
					
						
						
							
						
						dd745ba31f 
					 
					
						
						
							
							make im2col_3d faster  
						
						 
						
						
						
						
							
						
					 
					
						2025-08-30 03:11:09 +08:00  
					
					
						 
						
						
							
							
							
							
							
							 
						
					 
				 
			
				
					
						
							
							
								 
								leejet 
							
						 
					 
					
						
						
							
						
						ae47caca70 
					 
					
						
						
							
							fix cuda pad/scale/im2col3d  
						
						 
						
						
						
						
							
						
					 
					
						2025-08-30 03:11:08 +08:00  
					
					
						 
						
						
							
							
							
							
							
							 
						
					 
				 
			
				
					
						
							
							
								 
								leejet 
							
						 
					 
					
						
						
							
						
						85c8e1e519 
					 
					
						
						
							
							cuda: make im2col a little faster  
						
						 
						
						
						
						
							
						
					 
					
						2025-08-30 03:11:08 +08:00  
					
					
						 
						
						
							
							
							
							
							
							 
						
					 
				 
			
				
					
						
							
							
								 
								leejet 
							
						 
					 
					
						
						
							
						
						f7a12f9e69 
					 
					
						
						
							
							cuda/cpu: add im2col_3d support  
						
						 
						
						
						
						
							
						
					 
					
						2025-08-30 03:11:08 +08:00  
					
					
						 
						
						
							
							
							
							
							
							 
						
					 
				 
			
				
					
						
							
							
								 
								leejet 
							
						 
					 
					
						
						
							
						
						93c7e775b8 
					 
					
						
						
							
							add ggml_pad_ext for cpu & cuda backend  
						
						 
						
						
						
						
							
						
					 
					
						2025-08-30 02:56:56 +08:00  
					
					
						 
						
						
							
							
							
							
							
							 
						
					 
				 
			
				
					
						
							
							
								 
								leejet 
							
						 
					 
					
						
						
							
						
						c92f9b4a68 
					 
					
						
						
							
							add conv3d support  
						
						 
						
						
						
						
							
						
					 
					
						2025-08-30 02:56:56 +08:00  
					
					
						 
						
						
							
							
							
							
							
							 
						
					 
				 
			
				
					
						
							
							
								 
								ExtReMLapin 
							
						 
					 
					
						
						
							
						
						792b44f2ed 
					 
					
						
						
							
							server : add documentation for parallel_tool_calls param ( #15647 )  
						
						 
						
						... 
						
						
						
						Co-authored-by: Pierre F <no@p.e> 
						
						
							
						
					 
					
						2025-08-29 20:25:40 +03:00  
					
					
						 
						
						
							
							
							
							
							
							 
						
					 
				 
			
				
					
						
							
							
								 
								Aman Gupta 
							
						 
					 
					
						
						
							
						
						81017865ee 
					 
					
						
						
							
							CUDA: fix bug in rms_norm fusion ( #15660 )  
						
						 
						
						... 
						
						
						
						* CUDA: fix bug in rms_norm fusion
* Fix bug for OP_REPEAT
* Fix index for add 
						
						
							
  b6318
 
						
					 
					
						2025-08-29 21:30:06 +08:00  
					
					
						 
						
						
							
							
							
							
							
							 
						
					 
				 
			
				
					
						
							
							
								 
								Piotr Wilkin (ilintar) 
							
						 
					 
					
						
						
							
						
						60e5eee31f 
					 
					
						
						
							
							chat : Seed OSS thinking + tool call support ( #15552 )  
						
						 
						
						... 
						
						
						
						* Reasoning and tool-calling support for Seed OSS
* Fix grammar and partial parsing
* Whitespace
* New chat template
* Update common/chat.cpp
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com >
* Update common/chat.cpp
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com >
* Remove unused 'purge_healing_marker' helper
---------
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com > 
						
						
							
  b6317
 
						
					 
					
						2025-08-29 14:53:41 +02:00  
					
					
						 
						
						
							
							
							
							
							
							 
						
					 
				 
			
				
					
						
							
							
								 
								Aman Gupta 
							
						 
					 
					
						
						
							
						
						009b709d6e 
					 
					
						
						
							
							CUDA: fuse adds, fuse add with rms norm ( #15631 )  
						
						 
						
						... 
						
						
						
						* CUDA: fused add with rms_norm_mul
* Non-broadcast fuse works
* Add fused adds
* format
* Remove n_fuse from template params
* Address review comments
* Move template inside binbcast 
						
						
							
  b6316
 
						
					 
					
						2025-08-29 11:35:58 +08:00  
					
					
						 
						
						
							
							
							
							
							
							 
						
					 
				 
			
				
					
						
							
							
								 
								Gabe Goodhart 
							
						 
					 
					
						
						
							
						
						e8d99dd0b6 
					 
					
						
						
							
							nvidia nemotron nano v2 (nemotronh) ( #15507 )  
						
						 
						
						... 
						
						
						
						* feat: Add NEMOTRONH to python arch enum
https://github.com/ggml-org/llama.cpp/issues/nemotron-nano-15409 
Branch: gabe-l-hart/nvidia-nemotron-nano-15409
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com >
* feat: Add NEMOTRONH to c++ arch enum
https://github.com/ggml-org/llama.cpp/issues/nemotron-nano-15409 
Branch: gabe-l-hart/nvidia-nemotron-nano-15409
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com >
* feat: Add NEMOTRONH to llama-arch layer map
https://github.com/ggml-org/llama.cpp/issues/nemotron-nano-15409 
Branch: gabe-l-hart/nvidia-nemotron-nano-15409
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com >
* feat: First pass at conversion for nemotronh
https://github.com/ggml-org/llama.cpp/issues/nemotron-nano-15409 
Branch: gabe-l-hart/nvidia-nemotron-nano-15409
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com >
* feat: Add a verbose log for each tensor loaded
This is really helpful for diagnosing mismatches between the expected and
received tensors
https://github.com/ggml-org/llama.cpp/issues/nemotron-nano-15409 
Branch: gabe-l-hart/nvidia-nemotron-nano-15409
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com >
* feat: First (broken) pass at nemotronh model architecture
It generates tokens, just not valid ones!
https://github.com/ggml-org/llama.cpp/issues/nemotron-nano-15409 
Branch: gabe-l-hart/nvidia-nemotron-nano-15409
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com >
* fix: Explicitly enable add_bos_token during conversion
The `tokenizer.json`/`tokenizer_config.json` in the model are a bit
contradictory. In the config, add_bos_token is set to False, but the
tokenizer model itself has a post_processor that adds the BOS token via
type: TemplateProcessing
https://github.com/ggml-org/llama.cpp/issues/nemotron-nano-15409 
Branch: gabe-l-hart/nvidia-nemotron-nano-15409
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com >
* fix: Use relu2 (LLM_FFN_RELU_SQR) for activation in FFN layers
https://github.com/ggml-org/llama.cpp/issues/nemotron-nano-15409 
Branch: gabe-l-hart/nvidia-nemotron-nano-15409
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com >
* fix: Only allocate attention cache for attention layers (not non-recurrent)
https://github.com/ggml-org/llama.cpp/issues/nemotron-nano-15409 
Branch: gabe-l-hart/nvidia-nemotron-nano-15409
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com >
* fix: Move residual add to after every block
https://github.com/ggml-org/llama.cpp/issues/nemotron-nano-15409 
Branch: gabe-l-hart/nvidia-nemotron-nano-15409
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com >
* fix: Use the correct norm tensor for the MLP blocks
https://github.com/ggml-org/llama.cpp/issues/nemotron-nano-15409 
Branch: gabe-l-hart/nvidia-nemotron-nano-15409
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com >
* Nemotron-H: MLP gate cleanup (pass NULL for unused gate)
This model does not use a gate in MLP blocks; pass NULLs for gate tensors to make intent clear and avoid unused-pointer noise.
* SSM: respect ssm_dt_rank for dt_dim when provided
Use GGUF-provided time_step_rank (ssm_dt_rank) to set dt_dim when > 0; fallback to max(64, n_embd/16).
* fix: plamo2 - revert dt_dim to default (remove ssm_dt_rank usage)
* Rename nemotronh to nemotron_h for consistency
- Update architecture name from NEMOTRONH to NEMOTRON_H in constants.py
- Change architecture string from 'nemotronh' to 'nemotron_h' in all files
- Update enum LLM_ARCH_NEMOTRONH to LLM_ARCH_NEMOTRON_H
- Update class name llm_build_nemotronh to llm_build_nemotron_h
- Consistent naming with underscore convention (nemotron_h vs nemotronh)
* feat: Support conversion for older NemotronH models
https://github.com/ggml-org/llama.cpp/issues/nemotron-nano-15409 
Branch: gabe-l-hart/nvidia-nemotron-nano-15409
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com >
---------
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com >
Co-authored-by: Maicon Domingues <dominguesm@outlook.com >
Co-authored-by: weatherman <fxdstudios@gmail.com > 
						
						
							
  b6315
 
						
					 
					
						2025-08-28 18:39:31 -06:00  
					
					
						 
						
						
							
							
							
							
							
							 
						
					 
				 
			
				
					
						
							
							
								 
								Gabe Goodhart 
							
						 
					 
					
						
						
							
						
						a8bca68f72 
					 
					
						
						
							
							fix: Compute the full sum in llama-eval-callback, not just the sum of printed values ( #15637 )  
						
						 
						
						... 
						
						
						
						This makes it much easier to compare between llama.cpp and transformers!
https://github.com/ggml-org/llama.cpp/issues/nemotron-nano-15409 
Branch: gabe-l-hart/nvidia-nemotron-nano-15409
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com > 
						
						
							
  b6314
 
						
					 
					
						2025-08-28 15:27:36 -05:00  
					
					
						 
						
						
							
							
							
							
							
							 
						
					 
				 
			
				
					
						
							
							
								 
								mnehete32 
							
						 
					 
					
						
						
							
						
						c97dc09391 
					 
					
						
						
							
							CUDA: add conv2d ( #15635 )  
						
						 
						
						... 
						
						
						
						* CUDA: add conv2d
* CUDA: conv2d - correct formatting and added const 
						
						
							
  b6313
 
						
					 
					
						2025-08-28 20:33:03 +02:00  
					
					
						 
						
						
							
							
							
							
							
							 
						
					 
				 
			
				
					
						
							
							
								 
								Aaron Teo 
							
						 
					 
					
						
						
							
						
						6c442f42ff 
					 
					
						
						
							
							ggml-cpu: fix invalid hsum build in debug s390x ( #15634 )  
						
						 
						
						... 
						
						
						
						Signed-off-by: Aaron Teo <aaron.teo1@ibm.com > 
						
						
							
  b6312
 
						
					 
					
						2025-08-28 22:39:27 +08:00  
					
					
						 
						
						
							
							
							
							
							
							 
						
					 
				 
			
				
					
						
							
							
								 
								compilade 
							
						 
					 
					
						
						
							
						
						73804145ab 
					 
					
						
						
							
							ggml : fix SSM_SCAN for n_groups > 1 ( #15625 )  
						
						 
						
						
						
						
							
  b6311
 
						
					 
					
						2025-08-28 10:11:36 -04:00  
					
					
						 
						
						
							
							
							
							
							
							 
						
					 
				 
			
				
					
						
							
							
								 
								Georgi Gerganov 
							
						 
					 
					
						
						
							
						
						c8d0d14e77 
					 
					
						
						
							
							kv-cache : fix find_slot to not search for continuous slot ( #15638 )  
						
						 
						
						... 
						
						
						
						ggml-ci 
						
						
							
  b6310
 
						
					 
					
						2025-08-28 17:09:05 +03:00  
					
					
						 
						
						
							
							
							
							
							
							 
						
					 
				 
			
				
					
						
							
							
								 
								Sigbjørn Skjæret 
							
						 
					 
					
						
						
							
						
						84ab83cc0b 
					 
					
						
						
							
							model : jina-embeddings-v3 support ( #13693 )  
						
						 
						
						... 
						
						
						
						* initial jina-embeddings-v3 support
* initial jina-embeddings-v3 support
* initial jina-embeddings-v3 support
* fix vocab parsing with only tokenizer.json
* set mask token lstrip attribute
* additional unk_token_id fallback just in case [no ci]
* revert vocab_size() change [no ci]
* merge tensor loading into general bert
* rope
* add lora embedding and loading (non-functional)
* export separate lora ggufs instead
* add adapter metadata api
* use std::string
* convert_hf_to_lora compatibility
* fix assert
* apply suggestions from review
* apply suggestion from review 
						
						
							
  b6309
 
						
					 
					
						2025-08-28 15:49:50 +02:00  
					
					
						 
						
						
							
							
							
							
							
							 
						
					 
				 
			
				
					
						
							
							
								 
								Aman Gupta 
							
						 
					 
					
						
						
							
						
						55042b3692 
					 
					
						
						
							
							scripts: add sqlite3 check for compare-commits.sh ( #15633 )  
						
						 
						
						
						
						
							
						
					 
					
						2025-08-28 19:23:22 +08:00  
					
					
						 
						
						
							
							
							
							
							
							 
						
					 
				 
			
				
					
						
							
							
								 
								Georgi Gerganov 
							
						 
					 
					
						
						
							
						
						8a4280ce43 
					 
					
						
						
							
							kv-cache : remove LLAMA_SET_ROWS checks ( #15505 )  
						
						 
						
						... 
						
						
						
						ggml-ci 
						
						
							
  b6307
 
						
					 
					
						2025-08-28 12:27:02 +03:00  
					
					
						 
						
						
							
							
							
							
							
							 
						
					 
				 
			
				
					
						
							
							
								 
								Aleksei Nikiforov 
							
						 
					 
					
						
						
							
						
						64387f6e95 
					 
					
						
						
							
							gguf-py: byteswapping improvements ( #12851 )  
						
						 
						
						... 
						
						
						
						* gguf-py: implement byteswapping for Q4_0
This is needed to byteswap Mistral model.
Also restore original shapes after byteswapping tensors.
It is not needed at the moment, but do it in case
they'd be used in future.
* Rework byteswapping code in gguf-py
Move out details from byteswapping tensor blocks code 
						
						
							
						
					 
					
						2025-08-28 16:56:41 +08:00  
					
					
						 
						
						
							
							
							
							
							
							 
						
					 
				 
			
				
					
						
							
							
								 
								Joshua Cogliati 
							
						 
					 
					
						
						
							
						
						d35a1e8c41 
					 
					
						
						
							
							cli : change log to warning to explain reason for stopping ( #15604 )  
						
						 
						
						... 
						
						
						
						* Change to warn instead of debug, to explain reason for stopping.
* Update tools/main/main.cpp
Fix printing --2
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com >
---------
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com > 
						
						
							
  b6305
 
						
					 
					
						2025-08-28 10:48:20 +03:00  
					
					
						 
						
						
							
							
							
							
							
							 
						
					 
				 
			
				
					
						
							
							
								 
								Daniel Bevenius 
							
						 
					 
					
						
						
							
						
						46d9caa27a 
					 
					
						
						
							
							model-conversion : add mmproj conversion target ( #15628 )  
						
						 
						
						... 
						
						
						
						This commit adds a new target to the Makefile for converting models that
are multimodal. This target will convert the original model and in
addition also create the mmproj GGUF model.
The motivation for this change is that for models that are multimodal,
for example those that contain a vision encoders, we will often want to
upload both the quantized model and the vision encoder model to
HuggingFace.
Example usage:
```console
$ make causal-convert-mm-model MODEL_PATH=~/work/ai/models/gemma-3-4b-it-qat-q4_0-unquantized/
...
The environment variable CONVERTED_MODEL can be set to this path using:
export CONVERTED_MODEL=/home/danbev/work/ai/llama.cpp/models/gemma-3-4b-it-qat-q4_0-unquantized.gguf
The mmproj model was created in /home/danbev/work/ai/llama.cpp/models/mmproj-gemma-3-4b-it-qat-q4_0-unquantized.gguf
```
The converted original model can then be quantized, and after that both
the quantized model and the mmproj file can then be uploaded to
HuggingFace.
Refs: https://huggingface.co/ggml-org/gemma-3-4b-it-qat-GGUF/tree/main  
						
						
							
						
					 
					
						2025-08-28 09:26:48 +02:00  
					
					
						 
						
						
							
							
							
							
							
							 
						
					 
				 
			
				
					
						
							
							
								 
								matiaslin 
							
						 
					 
					
						
						
							
						
						5a0e3ef6f0 
					 
					
						
						
							
							cuda: Add cublasLt_static linking when GGML_STATIC is enabled ( #15622 )  
						
						 
						
						... 
						
						
						
						Prior to this change, we faced undefined cublasLt references when
attempting to compile 'llama-cli' with GGML_STATIC=ON on Linux.
We add linking with CUDA::cublasLt_static when CUDA version is greater
than 10.1. 
						
						
							
  b6303
 
						
					 
					
						2025-08-28 02:32:36 +02:00  
					
					
						 
						
						
							
							
							
							
							
							 
						
					 
				 
			
				
					
						
							
							
								 
								Johannes Gäßler 
							
						 
					 
					
						
						
							
						
						fbef0fad7a 
					 
					
						
						
							
							server: higher timeout for tests ( #15621 )  
						
						 
						
						
						
						
							
						
					 
					
						2025-08-27 20:58:09 +02:00  
					
					
						 
						
						
							
							
							
							
							
							 
						
					 
				 
			
				
					
						
							
							
								 
								Georgi Gerganov 
							
						 
					 
					
						
						
							
						
						da54f9f1a2 
					 
					
						
						
							
							presets : add qwen3-30B-a3b FIM ( #15616 )  
						
						 
						
						
						
						
							
  b6301
 
						
					 
					
						2025-08-27 15:48:07 +03:00  
					
					
						 
						
						
							
							
							
							
							
							 
						
					 
				 
			
				
					
						
							
							
								 
								uvos 
							
						 
					 
					
						
						
							
						
						47373271f9 
					 
					
						
						
							
							HIP: Enable support for ggml_backend_cuda_register_host_buffer ( #15615 )  
						
						 
						
						
						
						
							
  b6300
 
						
					 
					
						2025-08-27 13:58:54 +02:00  
					
					
						 
						
						
							
							
							
							
							
							 
						
					 
				 
			
				
					
						
							
							
								 
								Georgi Gerganov 
							
						 
					 
					
						
						
							
						
						1bded5a3b3 
					 
					
						
						
							
							kv-cache : better estimate of n_kv for multi-sequence batches ( #15610 )  
						
						 
						
						... 
						
						
						
						ggml-ci 
						
						
							
  b6299
 
						
					 
					
						2025-08-27 13:55:12 +03:00  
					
					
						 
						
						
							
							
							
							
							
							 
						
					 
				 
			
				
					
						
							
							
								 
								Chenguang Li 
							
						 
					 
					
						
						
							
						
						1e7489745a 
					 
					
						
						
							
							CANN: refactor mask handling and improve performance in FA ( #15561 )  
						
						 
						
						... 
						
						
						
						* CANN(flash-attn): refactor mask handling and improve performance
1. Refactored the mask computation in Flash Attention, unified the logic without separating prefill and decode.
2. Optimized performance in non-alibi scenarios by reducing one repeat operation.
3. Updated operator management to explicitly mark unsupported cases on 310P devices and when dim is not divisible by 16.
Signed-off-by: noemotiovon <757486878@qq.com >
* [CANN]: fix review
Signed-off-by: noemotiovon <757486878@qq.com >
* [CANN]: Optimization FA BNSD to BSND
Signed-off-by: noemotiovon <757486878@qq.com >
---------
Signed-off-by: noemotiovon <757486878@qq.com > 
						
						
							
  b6298
 
						
					 
					
						2025-08-27 17:21:41 +08:00  
					
					
						 
						
						
							
							
							
							
							
							 
						
					 
				 
			
				
					
						
							
							
								 
								xctan 
							
						 
					 
					
						
						
							
						
						1cf123a343 
					 
					
						
						
							
							ggml-cpu : add basic RVV support for vector f32 ops ( #15057 )  
						
						 
						
						... 
						
						
						
						* ggml-cpu : add basic RVV support for vector f32 ops
* ggml-cpu : add RVV support for f32 softmax 
						
						
							
  b6297
 
						
					 
					
						2025-08-27 16:44:22 +08:00  
					
					
						 
						
						
							
							
							
							
							
							 
						
					 
				 
			
				
					
						
							
							
								 
								Daniel Bevenius 
							
						 
					 
					
						
						
							
						
						fcca2182a1 
					 
					
						
						
							
							common : add -m to bash completion for --model [no ci] ( #15591 )  
						
						 
						
						... 
						
						
						
						This commit updates the bash completion script to include the -m
short option for the --model argument.
The motivation for this is that currently tab completion only works the
full --model option, and it is nice to have it work for the short option
as well. 
						
						
							
						
					 
					
						2025-08-27 10:28:53 +02:00  
					
					
						 
						
						
							
							
							
							
							
							 
						
					 
				 
			
				
					
						
							
							
								 
								rmatif 
							
						 
					 
					
						
						
							
						
						86076f92de 
					 
					
						
						
							
							OpenCL: add fused group_norm/norm, mul, add ( #15314 )  
						
						 
						
						... 
						
						
						
						* add fused group_norm/norm, mul, add
* fix spacing
* revert rms_norm logic
* fix trailing whitespace 
						
						
							
  b6295
 
						
					 
					
						2025-08-26 23:36:05 -07:00  
					
					
						 
						
						
							
							
							
							
							
							 
						
					 
				 
			
				
					
						
							
							
								 
								Diego Devesa 
							
						 
					 
					
						
						
							
						
						bcbddcd54f 
					 
					
						
						
							
							tests : fix test-opt with GGML_BACKEND_DL ( #15599 )  
						
						 
						
						
						
						
							
  b6294
 
						
					 
					
						2025-08-26 22:14:38 +02:00  
					
					
						 
						
						
							
							
							
							
							
							 
						
					 
				 
			
				
					
						
							
							
								 
								Akarshan Biswas 
							
						 
					 
					
						
						
							
						
						8b69686136 
					 
					
						
						
							
							SYCL: fix rms_norm_mul_add for tensor dim not a multiple of sg_size ( #15592 )  
						
						 
						
						... 
						
						
						
						The original implementation unconditionally returned true for this operation, leading to a failure when the tensor's first dimension (ne[0]) was not a multiple of WARP_SIZE. This caused an GGML_ASSERT(ncols % WARP_SIZE == 0) failure in ggml-sycl/norm.cpp.
This change updates the ggml_backend_sycl_device_supports_op check to correctly return true for GGML_OP_RMS_NORM only when the first dimension of the tensor is a multiple of WARP_SIZE, ensuring the operation can be performed without error. 
						
						
							
  b6293
 
						
					 
					
						2025-08-27 00:27:49 +05:30