leo-pony 
							
						 
					 
					
						
						
							
						
						1e8659e65a 
					 
					
						
						
							
							CANN: Add SOC TYPE printing in cmake configuration ( #13837 )  
						
						
						
						
							
 
						
					 
					
						2025-05-28 11:54:20 +08:00 
						 
				 
			
				
					
						
							
							
								lhez 
							
						 
					 
					
						
						
							
						
						a3c30846e4 
					 
					
						
						
							
							opencl: add new ops - argsort, div, sub, addrows, sigmoid, group_norm ( #13787 )  
						
						... 
						
						
						
						* opencl: add `argsort`
* opencl: add `div`
* opencl: add `add_rows`
* opencl: add `sub`
* opencl: add `sigmoid`, both `f16` and `f32`
* opencl: add `group_norm` 
						
						
							
 
						
					 
					
						2025-05-27 12:56:08 -07:00 
						 
				 
			
				
					
						
							
							
								lhez 
							
						 
					 
					
						
						
							
						
						1701d4c54f 
					 
					
						
						
							
							opencl: mark mul_mat f32f32 as supporting non-contiguous tensors ( #13790 )  
						
						
						
						
							
 
						
					 
					
						2025-05-27 12:53:14 -07:00 
						 
				 
			
				
					
						
							
							
								Jeff Bolz 
							
						 
					 
					
						
						
							
						
						bef8176387 
					 
					
						
						
							
							vulkan: use timestamp queries for GGML_VULKAN_PERF ( #13817 )  
						
						... 
						
						
						
						Also change it to be controlled by an env var rather than cmake flag 
						
						
							
 
						
					 
					
						2025-05-27 18:39:07 +02:00 
						 
				 
			
				
					
						
							
							
								Georgi Gerganov 
							
						 
					 
					
						
						
							
						
						34b7c0439e 
					 
					
						
						
							
							cmake : add llama-cparams.cpp to build ( #13832 )  
						
						
						
						
							
 
						
					 
					
						2025-05-27 19:08:44 +03:00 
						 
				 
			
				
					
						
							
							
								Akarshan Biswas 
							
						 
					 
					
						
						
							
						
						f3101a8cc6 
					 
					
						
						
							
							SYCL: add gelu_erf kernel ( #13749 )  
						
						... 
						
						
						
						* SYCL: add gelu_erf kernel
* refactor code
Co-authored-by: Atharva Dubey <atharva.dubey@codeplay.com >
* Use scope_op_debug_print
---------
Co-authored-by: Atharva Dubey <atharva.dubey@codeplay.com > 
						
						
							
 
						
					 
					
						2025-05-27 20:52:59 +05:30 
						 
				 
			
				
					
						
							
							
								Georgi Gerganov 
							
						 
					 
					
						
						
							
						
						1c49c70d07 
					 
					
						
						
							
							sync : ggml  
						
						
						
						
							
						
					 
					
						2025-05-27 18:05:33 +03:00 
						 
				 
			
				
					
						
							
							
								Xuan-Son Nguyen 
							
						 
					 
					
						
						
							
						
						a8ea03d8ad 
					 
					
						
						
							
							ggml : add ggml_repeat_4d ( #13824 )  
						
						
						
						
							
 
						
					 
					
						2025-05-27 15:53:55 +02:00 
						 
				 
			
				
					
						
							
							
								xctan 
							
						 
					 
					
						
						
							
						
						05f6ac6283 
					 
					
						
						
							
							ggml : riscv: add xtheadvector support ( #13720 )  
						
						... 
						
						
						
						* ggml : riscv: add xtheadvector support
* ggml : clean up some macro usage 
						
						
							
 
						
					 
					
						2025-05-27 16:21:36 +03:00 
						 
				 
			
				
					
						
							
							
								Xuan-Son Nguyen 
							
						 
					 
					
						
						
							
						
						bc583e3c63 
					 
					
						
						
							
							mtmd : support Qwen 2.5 Omni (input audio+vision, no audio output) ( #13784 )  
						
						... 
						
						
						
						* mtmd : allow multiple modalities at the same time
* refactor mtmd tokenizer
* fix compile
* ok, missing SinusoidsPositionEmbedding
* first working version
* fix style
* more strict validate of n_embd
* refactor if..else to switch
* fix regression
* add test for 3B
* update docs
* fix tokenizing with add_special
* add more tests
* fix test case "huge"
* rm redundant code
* set_position_mrope_1d rm n_tokens 
						
						
							
 
						
					 
					
						2025-05-27 14:06:10 +02:00 
						 
				 
			
				
					
						
							
							
								bandoti 
							
						 
					 
					
						
						
							
						
						72b090da2c 
					 
					
						
						
							
							docs: remove link for llama-cli function calling ( #13810 )  
						
						
						
						
							
						
					 
					
						2025-05-27 08:52:40 -03:00 
						 
				 
			
				
					
						
							
							
								Christian Kastner 
							
						 
					 
					
						
						
							
						
						7fe03e7446 
					 
					
						
						
							
							ggml-cpu: x86 feature detection is specific to x86 ( #13811 )  
						
						
						
						
							
 
						
					 
					
						2025-05-27 13:18:39 +02:00 
						 
				 
			
				
					
						
							
							
								Diego Devesa 
							
						 
					 
					
						
						
							
						
						952f3953c1 
					 
					
						
						
							
							ggml : allow CUDA graphs when using pipeline parallelism ( #13814 )  
						
						
						
						
							
 
						
					 
					
						2025-05-27 13:05:18 +02:00 
						 
				 
			
				
					
						
							
							
								Georgi Gerganov 
							
						 
					 
					
						
						
							
						
						81713121ee 
					 
					
						
						
							
							kv-cells : track min/max used cells and per-sequence positions ( #13808 )  
						
						... 
						
						
						
						* kv-cells : track min/max used cells and per-sequence positions
ggml-ci
* kv-cells : fix pos-modification updates for seq_pos
ggml-ci
* kv-cells : add comments
ggml-ci 
						
						
							
 
						
					 
					
						2025-05-27 13:49:41 +03:00 
						 
				 
			
				
					
						
							
							
								Georgi Gerganov 
							
						 
					 
					
						
						
							
						
						f9cd68398b 
					 
					
						
						
							
							sampling : make sure samplers return at least 1 token ( #13822 )  
						
						... 
						
						
						
						* sampling : min-p should always return at least one token
ggml-ci
* sampling : same for typical sampling
* tests : sampling tests use min_keep == 0
ggml-ci 
						
						
							
 
						
					 
					
						2025-05-27 12:07:52 +03:00 
						 
				 
			
				
					
						
							
							
								Georgi Gerganov 
							
						 
					 
					
						
						
							
						
						4f81b33e32 
					 
					
						
						
							
							llama : validate seq id batch input ( #13809 )  
						
						... 
						
						
						
						* llama : validate seq id batch input
ggml-ci
* cont : fix the fix
ggml-ci 
						
						
							
 
						
					 
					
						2025-05-27 09:40:59 +03:00 
						 
				 
			
				
					
						
							
							
								Olivier Chafik 
							
						 
					 
					
						
						
							
						
						cdf94a1802 
					 
					
						
						
							
							server: --offline mode ( #13804 )  
						
						... 
						
						
						
						* server: --offline mode (env: LLAMA_OFFLINE)
---------
Co-authored-by: Xuan-Son Nguyen <thichthat@gmail.com > 
						
						
							
 
						
					 
					
						2025-05-26 22:34:27 +01:00 
						 
				 
			
				
					
						
							
							
								Georgi Gerganov 
							
						 
					 
					
						
						
							
						
						a26c4cc11e 
					 
					
						
						
							
							scripts : add option to compare commits in Debug ( #13806 )  
						
						... 
						
						
						
						* scripts : add option to compare commits in Debug
* cont : reuse existing CMAKE_OPTS 
						
						
							
						
					 
					
						2025-05-26 22:24:01 +03:00 
						 
				 
			
				
					
						
							
							
								Georgi Gerganov 
							
						 
					 
					
						
						
							
						
						4265a87b59 
					 
					
						
						
							
							cuda : avoid cuGetErrorString ( #13791 )  
						
						... 
						
						
						
						ggml-ci 
						
						
							
 
						
					 
					
						2025-05-26 22:14:52 +03:00 
						 
				 
			
				
					
						
							
							
								Akarshan Biswas 
							
						 
					 
					
						
						
							
						
						6f180b915c 
					 
					
						
						
							
							SYCL: Add non contiguous support in RMS_NORM and NORM kernels ( #13611 )  
						
						... 
						
						
						
						* SYCL: Add non contiguous input support to norm kernel
* refactor and add RMS_NORM non contiguous input support
ggml-ci
* restore subgroup reduction for multi-subgroup thread blocks in norm kernels
* Swap grid dims of nsamples and nrows
ggml-ci
* Revert "Swap grid dims of nsamples and nrows"
This reverts commit 43be2d657fec7f7fba54e2cd154106bc0fc45adf.
* restore not required changes
ggml-ci
* address review comments: change it to more like SYCL
* Use a common function to calculate offset
* remove wrap around logic for handling broadcasts
* remove static from calculate_offset fn and use ceil_div 
						
						
							
 
						
					 
					
						2025-05-26 21:10:36 +05:30 
						 
				 
			
				
					
						
							
							
								Olivier Chafik 
							
						 
					 
					
						
						
							
						
						03f582ae8f 
					 
					
						
						
							
							server: fix streaming crashes ( #13786 )  
						
						... 
						
						
						
						* add preludes to content on partial regex match
* allow all parsers to parse non-tool-call content.
* tweak order of <|python_tag|> vs <function= parsing for functionary v3.1 format. still not ideal but hopefully less prone to crash 
						
						
							
 
						
					 
					
						2025-05-26 16:03:57 +01:00 
						 
				 
			
				
					
						
							
							
								standby24x7 
							
						 
					 
					
						
						
							
						
						88c125f2ac 
					 
					
						
						
							
							examples/training: Fix file name in README ( #13803 )  
						
						... 
						
						
						
						This patch fixes binary file names in README.md.
Signed-off-by: Masanari Iida <standby24x7@gmail.com > 
						
						
							
						
					 
					
						2025-05-26 16:55:24 +02:00 
						 
				 
			
				
					
						
							
							
								Olivier Chafik 
							
						 
					 
					
						
						
							
						
						d74e94c1b3 
					 
					
						
						
							
							server: fix format of streamed tool call deltas (diff name, fix id location) (#13800 )  
						
						... 
						
						
						
						* fix deltas of tool_call.function.name
* fix tool_call.id (was in tool_call.function.id!) + add function type
* add tool_call.type
* populate empty tool_call.function.arguments on first delta 
						
						
							
 
						
					 
					
						2025-05-26 14:56:49 +01:00 
						 
				 
			
				
					
						
							
							
								Olivier Chafik 
							
						 
					 
					
						
						
							
						
						f13847cfb5 
					 
					
						
						
							
							server: fix regression on streamed non-chat completion w/ stops ( #13785 )  
						
						... 
						
						
						
						* more forgiving message diffs: partial stop words aren't erased, full stops are
* Add (slow) server test for completion + stream + stop 
						
						
							
 
						
					 
					
						2025-05-26 14:16:37 +01:00 
						 
				 
			
				
					
						
							
							
								Georgi Gerganov 
							
						 
					 
					
						
						
							
						
						79c137f776 
					 
					
						
						
							
							examples : allow extracting embeddings from decoder contexts ( #13797 )  
						
						... 
						
						
						
						ggml-ci 
						
						
							
 
						
					 
					
						2025-05-26 14:03:54 +03:00 
						 
				 
			
				
					
						
							
							
								Georgi Gerganov 
							
						 
					 
					
						
						
							
						
						22229314fc 
					 
					
						
						
							
							llama : clarify deprecation message ( #13794 )  
						
						
						
						
							
 
						
					 
					
						2025-05-26 12:57:50 +03:00 
						 
				 
			
				
					
						
							
							
								Romain Biessy 
							
						 
					 
					
						
						
							
						
						9012eb9b45 
					 
					
						
						
							
							sycl: Add more debug prints ( #13640 )  
						
						
						
						
							
						
					 
					
						2025-05-26 10:28:53 +02:00 
						 
				 
			
				
					
						
							
							
								Jeff Bolz 
							
						 
					 
					
						
						
							
						
						fef693dc6b 
					 
					
						
						
							
							vulkan: mark IM2COL as supporting non-contig ( #13783 )  
						
						
						
						
							
 
						
					 
					
						2025-05-26 06:02:07 +02:00 
						 
				 
			
				
					
						
							
							
								Bizhao Shi 
							
						 
					 
					
						
						
							
						
						2d38b6e400 
					 
					
						
						
							
							CANN: Add the basic supports of Flash Attention kernel ( #13627 )  
						
						... 
						
						
						
						* cann: add the basic FA support
* cann: update the readme
* cann: update the FlashAttention with PSEShift
* cann: update the input parameters in FA
* cann: update the alibi with max_bias
* cann: add the constrints of softcap
* cann: update the docs CANN.md
* cann: update the docs CANN.md
* cann: fix typo of CANN.md
* cann: add some comments and update the CANN.md
* cann: update the CANN.md
* cann: update the inner precise for fusedInferAttention
* cann: update the constraints of flash_attn_ext on ggml-cann.cpp
* cann: clean the whitespace
* cann: clean the whitespace
* cann: add a new endline 
						
						
							
 
						
					 
					
						2025-05-26 10:20:18 +08:00 
						 
				 
			
				
					
						
							
							
								Olivier Chafik 
							
						 
					 
					
						
						
							
						
						e121edc432 
					 
					
						
						
							
							server: add --reasoning-budget 0 to disable thinking (incl. qwen3 w/ enable_thinking:false) (#13771 )  
						
						... 
						
						
						
						---------
Co-authored-by: ochafik <ochafik@google.com >
Co-authored-by: Xuan-Son Nguyen <thichthat@gmail.com > 
						
						
							
 
						
					 
					
						2025-05-26 00:30:51 +01:00 
						 
				 
			
				
					
						
							
							
								Xuan-Son Nguyen 
							
						 
					 
					
						
						
							
						
						2f099b510f 
					 
					
						
						
							
							webui : bump max upload file size to 500MB ( #13779 )  
						
						
						
						
							
						
					 
					
						2025-05-25 18:02:18 +01:00 
						 
				 
			
				
					
						
							
							
								Sigbjørn Skjæret 
							
						 
					 
					
						
						
							
						
						aa50ba462f 
					 
					
						
						
							
							tests : improve UGM tokenizer test coverage ( #13773 )  
						
						
						
						
							
 
						
					 
					
						2025-05-25 16:22:29 +02:00 
						 
				 
			
				
					
						
							
							
								Georgi Gerganov 
							
						 
					 
					
						
						
							
						
						de2ef53a4b 
					 
					
						
						
							
							kv-cache : rework kv_cell ( #13706 )  
						
						... 
						
						
						
						* kv-cache : rework kv_cell
ggml-ci
* kv-cells : use "shift" instead of "delta" consistently
ggml-ci
* llama : add llama_max_parallel_sequences()
ggml-ci
* kv-cells : update comments [no ci]
* context : fail upon construction if sequences exceed max value
ggml-ci
* kv-cells : get_pos() -> pos_get() + comments
ggml-ci
* kv-cells : fix tracking of "used" cells
ggml-ci 
						
						
							
						
					 
					
						2025-05-25 16:34:36 +03:00 
						 
				 
			
				
					
						
							
							
								Percy Piper 
							
						 
					 
					
						
						
							
						
						c508256db2 
					 
					
						
						
							
							rpc : Fix build on OpenBSD ( #13541 )  
						
						
						
						
							
 
						
					 
					
						2025-05-25 15:35:53 +03:00 
						 
				 
			
				
					
						
							
							
								Xuan-Son Nguyen 
							
						 
					 
					
						
						
							
						
						40aaa8a403 
					 
					
						
						
							
							mtmd : add support for Qwen2-Audio and SeaLLM-Audio ( #13760 )  
						
						... 
						
						
						
						* mtmd : add Qwen2-Audio support
* small clean up
* update discussion link
* clarify mtmd_get_output_embd
* clarification in multimodal.md
* fix ultravox bug
* ggml_cont 
						
						
							
 
						
					 
					
						2025-05-25 14:06:32 +02:00 
						 
				 
			
				
					
						
							
							
								ddpasa 
							
						 
					 
					
						
						
							
						
						a08c1d2845 
					 
					
						
						
							
							docs : add Moondream2 pre-quantized link ( #13745 )  
						
						... 
						
						
						
						* Multimodal: Added Moondream2 model and fixed ggml.org link
* Apply suggestions from code review
---------
Co-authored-by: name <none@none.com >
Co-authored-by: Xuan-Son Nguyen <thichthat@gmail.com > 
						
						
							
						
					 
					
						2025-05-25 14:04:49 +02:00 
						 
				 
			
				
					
						
							
							
								Olivier Chafik 
							
						 
					 
					
						
						
							
						
						d785f9c1fd 
					 
					
						
						
							
							server: fix/test add_generation_prompt ( #13770 )  
						
						... 
						
						
						
						Co-authored-by: ochafik <ochafik@google.com > 
						
						
							
 
						
					 
					
						2025-05-25 10:45:49 +01:00 
						 
				 
			
				
					
						
							
							
								Piotr Jasiukajtis 
							
						 
					 
					
						
						
							
						
						4032ca4066 
					 
					
						
						
							
							llama : add support for Qwen3 MoE tied word embeddings ( #13768 )  
						
						
						
						
							
 
						
					 
					
						2025-05-25 10:29:43 +02:00 
						 
				 
			
				
					
						
							
							
								Akarshan Biswas 
							
						 
					 
					
						
						
							
						
						515fdbf7ed 
					 
					
						
						
							
							SYCL: revert "sycl: simplify bin_bcast_kernel ( #13383 )" ( #13752 )  
						
						... 
						
						
						
						Temporarily reverted due to failing fp16 DIV operation
This reverts commit 02cdd2d8b0 
						
						
							
 
						
					 
					
						2025-05-25 10:08:37 +03:00 
						 
				 
			
				
					
						
							
							
								Olivier Chafik 
							
						 
					 
					
						
						
							
						
						f5cd27b71d 
					 
					
						
						
							
							server: streaming of tool calls and thoughts when --jinja is on (#12379 )  
						
						... 
						
						
						
						* add common_json w/ support for truncated json healing
* add common_chat_msg_diff
* partial common_chat_parse
* refactor parser w/ optionals
* server: wire chat diffs in stream mode
* fix trigger of thinking models (must happen after thoughts are closed)
* fix functionary v3.2 raw python!
* rename: common_chat_syntax (now contains format)
* rm common_regex.at_start
* don't return empty <think></think>
* accommodate yet another deepseek r1 distill fantasy syntax (`<|tool▁calls|>`)
* fix QwQ 32B tool call parsing after thoughts (hermes2)
* better logs for grammar triggers
* consume spaces after parse_json_tool_calls
* fix required tool calls w/ thinking models that have pre-opened thinking tags
* fix thinking model's initial trigger + test qwq's template
* run most test_tool_call tests in stream + non-stream modes
* make functionary v3.2 parsing more strict (differentiate first match from others)
* send final diff from server, to close off raw python arguments
* support partial content streaming in Generic mode
* tool-call: allow content prelude before hermes2 tool calls (for Qwen2.5)
* Update function-calling.md
* Update tool_bench.py
* chat-parser: remove input from exception (llm output may contain PII)
---------
Co-authored-by: ochafik <ochafik@google.com >
Co-authored-by: Olivier Chafik <ochafik@users.noreply.github.com > 
						
						
							
 
						
					 
					
						2025-05-25 01:48:08 +01:00 
						 
				 
			
				
					
						
							
							
								Diego Devesa 
							
						 
					 
					
						
						
							
						
						a2d02d5793 
					 
					
						
						
							
							releases : bundle llvm omp library in windows release ( #13763 )  
						
						
						
						
							
 
						
					 
					
						2025-05-25 00:55:16 +02:00 
						 
				 
			
				
					
						
							
							
								Diego Devesa 
							
						 
					 
					
						
						
							
						
						17fc817b58 
					 
					
						
						
							
							releases : enable openmp in windows cpu backend build ( #13756 )  
						
						
						
						
							
 
						
					 
					
						2025-05-24 22:27:03 +02:00 
						 
				 
			
				
					
						
							
							
								Diego Devesa 
							
						 
					 
					
						
						
							
						
						2bd1b30f69 
					 
					
						
						
							
							ggml-cpu : set openmp wait time if not set ( #13758 )  
						
						
						
						
							
 
						
					 
					
						2025-05-24 22:26:47 +02:00 
						 
				 
			
				
					
						
							
							
								0cc4m 
							
						 
					 
					
						
						
							
						
						259469c4b5 
					 
					
						
						
							
							Move GLM4 f32 attention fix to the correct function ( #13750 )  
						
						
						
						
							
 
						
					 
					
						2025-05-24 16:49:12 +02:00 
						 
				 
			
				
					
						
							
							
								Xuan-Son Nguyen 
							
						 
					 
					
						
						
							
						
						4c32832c59 
					 
					
						
						
							
							ggml : add ggml_gelu_erf() CUDA kernel ( #13719 )  
						
						... 
						
						
						
						* ggml : add ggml_gelu_erf() CUDA kernel
* missing semicolon 
						
						
							
 
						
					 
					
						2025-05-24 13:06:47 +02:00 
						 
				 
			
				
					
						
							
							
								Sigbjørn Skjæret 
							
						 
					 
					
						
						
							
						
						c3a2624339 
					 
					
						
						
							
							vocab : fix ugm tokenizer precision ( #13743 )  
						
						
						
						
							
 
						
					 
					
						2025-05-24 12:29:09 +02:00 
						 
				 
			
				
					
						
							
							
								Johannes Gäßler 
							
						 
					 
					
						
						
							
						
						ffd0eae60b 
					 
					
						
						
							
							CUDA: fix race condition in FA vector kernels ( #13742 )  
						
						
						
						
							
 
						
					 
					
						2025-05-24 11:46:19 +02:00 
						 
				 
			
				
					
						
							
							
								Diego Devesa 
							
						 
					 
					
						
						
							
						
						b775345d78 
					 
					
						
						
							
							ci : enable winget package updates ( #13734 )  
						
						
						
						
							
						
					 
					
						2025-05-23 23:14:00 +03:00 
						 
				 
			
				
					
						
							
							
								Diego Devesa 
							
						 
					 
					
						
						
							
						
						a70a8a69c2 
					 
					
						
						
							
							ci : add winget package updater ( #13732 )  
						
						
						
						
							
						
					 
					
						2025-05-23 22:09:38 +02:00 
						 
				 
			
				
					
						
							
							
								Georgi Gerganov 
							
						 
					 
					
						
						
							
						
						d13d0f6135 
					 
					
						
						
							
							hparams : initialize arrays ( #13728 )  
						
						... 
						
						
						
						ggml-ci 
						
						
							
 
						
					 
					
						2025-05-23 20:16:13 +03:00